A Clock Distribution Scheme for Large RSFQ Circuits

Krystof Gaj, Eby G. Friedman, and Marc J. Feldman
Department of Electrical Engineering, University of Rochester, Rochester, NY 14627

Andrzej Krasiecki
Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland

Abstract—A primary issue in maximizing the performance of large scale synchronous digital systems is the clock distribution scheme. We present a novel clocking scheme, developed specifically for RSFQ logic, which is based on the concurrent flow of the clock and data signals. The scheme permits the circuit throughput to be independent of inter-cell connection delays and significantly reduces the dependence of the throughput on the clock-to-output delay of the cells. Concurrent flow clocking is particularly well suited for structured architectures. The simulated maximum clock frequency of an RSFQ decimation digital filter currently under development at the University of Rochester can be as much as seven times higher using concurrent-flow clocking rather than conventional (counterflow) clocking. This advantage, however, is reduced to a factor of two due to fabrication process parameter variations in present day superconductive technologies.

I. INTRODUCTION

Full exploitation of the speed of Rapid Single Flux Quantum (RSFQ) logic will require the proper choice of system level architecture. An essential issue in system level design is the choice of synchronization strategy, ranging from fully synchronous to fully asynchronous architectures. RSFQ is the first superconducting digital technology that can use asynchronous schemes as well as synchronous schemes and the possibility of constructing superfast large asynchronous processors has therefore attracted significant interest [1]-[4].

The optimal choice of RSFQ logic architecture should consider the experiences of high-speed semiconductor devices. Although it is often asserted that asynchronous circuits have a performance advantage, a detailed analysis of semiconductor-based logic has shown that asynchronous architectures hardly ever outperform fully synchronous-based architectures. In fact, fully synchronous pipelined clocking almost always outperforms asynchronous timing schemes [5].

Since pipelining is common in digital signal processors, an important near-term application of RSFQ logic [1], [4], asynchronous RSFQ schemes begin with a significant disadvantage. Furthermore, other well-known advantages of synchronous circuits exist, such as smaller area, improved testability, and well established and easier design techniques. We therefore expect that as RSFQ logic matures, it will follow the design path of its semiconductor counterparts.

The primary argument against a synchronous architecture is the deleterious effects of clock skew on circuit performance [2], [4]. At the high frequencies expected with RSFQ circuits, clock skew can become the dominant factor limiting the minimum clock period. One solution for this problem in RSFQ circuits is the use of complex clock distribution networks, such as load-balanced H-trees [4], [6] targeted to minimize clock skew. These networks have some application to semiconductor circuits but are rather inadequate for low-fanout RSFQ logic, as complex clock distribution networks require a large number of splitters integrated into a complex and area inefficient clock distribution tree.

In this paper we propose a new approach to eliminating the deleterious influence of clock skew on RSFQ circuits: concurrent flow clocking. We show that by distributing the clock signal in the direction of the data flow, circuit performance can be enhanced as compared with schemes that attempt to eliminate or minimize clock skew.

Similar synchronization schemes have been developed for semiconductor logic [6]-[8] but have not as yet been widely adopted. The main reasons are conservative design conventions used within industry, relatively small performance improvements (up to 40%) as compared with schemes targeted to produce zero clock skew, and difficulties in implementing well-controlled delay lines within semiconductor-based clock distribution networks.

For RSFQ logic, the concurrent flow of clock and data has been applied to simple shift register [1], [2], [9] and full-adder structures [10]. The scheme permits either fast initialization of shift registers [1], or construction of fast circular shift registers [9]. Our methodology generalizes and optimizes this approach to make non-zero skew clock distribution applicable and efficient for a variety of synchronous RSFQ circuits. We show that, due to specific features of RSFQ logic, the potential performance improvements of applying this synchronous scheme may be significantly higher than in semiconductor devices.

Our paper is organized as follows. In the following section we present our scheme and compare it with the synchronization scheme based on the counterflow of clock and data signals. In section III, we provide a detailed comparison of concurrent and counterflow clocking, considering the deviations of circuit parameters due to variations in the fabrication process. Section IV is a brief prescription for using the equations to design a clock distribution network while accounting for these parameter deviations. In section V, as an example, we apply both clocking schemes to the design of a decimation digital filter currently under development at the University of Rochester. Finally, we present our conclusions in section VI.
II. General Idea of Counterflow and Concurrent Flow Clocking Schemes

In typical structured LSI and VLSI architectures, such as systolic arrays, the flow of data is closely associated with the circuit topology. In particular, data is most often exchanged between physically neighboring cells. We assume that the flow of the clock signal in RSFO circuits is also closely related to the topology of the circuit and therefore to the flow of the data. This assumption permits a significant reduction in the complexity of the clock distribution network, including the number of splitters and Josephson transmission lines (JTLs) that constitute the network. Thus, two basic clocking schemes are possible:

In the counterflow scheme, the clock propagates in the opposite direction to that of the data flow. In the concurrent flow scheme, the subject of this paper, the clock flows in the same direction as the data. This is illustrated in Fig. 1. In both schemes extra delay ($\Delta_{DATA-SYN}$) may need to be added to the data path to avoid races in the circuit (i.e., the propagation of data pulses through several consecutive clocked gates within one clock cycle). However, only in the concurrent flow clocking, an additional delay ($\Delta_{CLK-SYN}$) added to the clock path can permit the designer to more efficiently utilize the intrinsic speed of the RSFO cells. Timing diagrams describing the exchange of data between two cells are shown in Fig. 2. Necessary and sufficient conditions for correct data exchange is described for both clocking schemes, as follows (see Figs. 1 and 2 for the notation):

$$t_{hold(2)} \leq data_{position(2)} \leq T_{CLK} - t_{setup(2)}.$$  

(1)

The first inequality eliminates any race conditions in the circuit while the second inequality imposes limits on the minimum clock period.

For counterflow clocking, (1) becomes (see Fig. 2a):

$$t_{hold(2)} \leq \Delta_{CLK-INT} + \Delta_{CELL} + \Delta_{DATA-PATH}.$$  

(2)

For concurrent flow clocking, (1) becomes (see Fig. 2b):

$$t_{hold(2)} \leq \Delta_{CELL} + \Delta_{DATA-PATH} - \Delta_{CLK-INT}.$$  

(3)

Fig. 1. General idea of counterflow and concurrent flow clocking schemes. 

Fig. 2. The exchange of information between neighboring cells in: 

(a) counterflow clocking scheme, b) concurrent flow clocking scheme. Notation: $\Delta_{CELL}$ - clock-to-output delay of the cell; $t_{hold(2)}$ - hold time of the cell; $t_{setup(2)}$ - setup time of the cell; $data_{position(2)}$ - position of the data pulse within clock cycle (cell i).

For concurrent flow clocking, (1) becomes (see Fig. 2b):

$$t_{hold(2)} \leq \Delta_{CELL} + \Delta_{DATA-PATH} + t_{setup(2)}.$$  

(4)

The inequality (2) is typically satisfied for $\Delta_{DATA-PATH} = \Delta_{DATA-INT}$ (i.e., $\Delta_{DATA-SYN} = 0$). In this case (henceforth referred to as the counterflow clocking case 1), the minimum clock period in the circuit is

$$T_{MIN} = \Delta_{CLK-INT} + \Delta_{CELL} + \Delta_{DATA-INT} + t_{setup(2)}.$$  

(5)

When (2) is not satisfied (the counterflow clocking case 2), additional JTL stages must be added in the data path to avoid races ($\Delta_{DATA-SYN} > 0$).

For concurrent flow clocking, (1) becomes (see Fig. 2b):

$$T_{CLK} \geq \Delta_{CELL} + \Delta_{DATA-PATH} + t_{setup(2)}.$$  

(6)

Extra JTL stages can be added to the clock or data path so that (5) becomes an equality. The minimum clock period of the circuit is therefore

$$T_{MIN} = t_{hold(2)} + t_{setup(2)}.$$  

(7)

From (4) and (7) we see that counterflow clocking is usually much slower than concurrent flow clocking, insofar as

$$t_{hold(2)} \leq \Delta_{CLK-INT} + \Delta_{CELL} + \Delta_{DATA-INT}.$$  

(8)

Intuitively, this is because the counterflow data pulse cannot appear at the input of the second cell during the relatively long initial part of the clock period (see Fig. 2a). The clock pulse must propagate to the input of the first cell, which releases data stored in the first cell, thereby permitting the data signal to propagate through the interconnections between the cells. In concurrent flow clocking, this restriction does not apply and the data pulse may appear at the data input of the second cell as early as the hold time after the beginning of the clock cycle (Fig. 2b).
III. CLOCK DISTRIBUTION NETWORK DESIGN ACCOUNTING FOR PARAMETER VARIATIONS

A major issue in the design of clock distribution networks is accounting for deviations of circuit parameters due to variations in the fabrication process. These deviations can significantly change the timing parameters of the individual cells and the interconnections within the circuit, and must be taken into account at the system level.

In [11], the timing requirements for some basic RSFQ cells are analyzed in light of the design parameter uncertainties inherent in present day Nb/Al$_2$O$_3$/Nb Josephson junction integrated circuit fabrication technologies. It is found that the relative deviation in the clock-to-output delay, $\delta$ (defined as the $3\sigma$ standard deviation in the delay divided by the nominal value of the delay), for most of the basic cells is less than 20%. Also, the absolute deviation in the hold and setup times increases proportionally with $\delta$.

Rewriting (2), (3), (5), (6) to account for these deviations, the worst case conditions for the correct exchange of information between two adjacent cells becomes

* for counterflow clocking
  \[ t_{\text{hold}}(2)_{\text{MAX}} \leq \Delta_{\text{CLK-PATH}} + \Delta_{\text{DATA-PATH}} + \Delta_{\text{CELL}}^{\text{MAX}} + \Delta_{\text{CLK INT}}^{\text{MAX}} + \Delta_{\text{CLK SYN}}^{\text{MAX}} \]  
  \[ T_{\text{CLK}} \geq \Delta_{\text{CLK-PATH}} + \Delta_{\text{CELL}}^{\text{MAX}} + \Delta_{\text{CLK INT}}^{\text{MAX}} + \Delta_{\text{CLK SYN}}^{\text{MAX}} + \Delta_{\text{DATA-PATH}}^{\text{MAX}} \]  

* for concurrent clocking
  \[ t_{\text{setup}}(2)_{\text{MAX}} \leq \Delta_{\text{DATA-PATH}} - \Delta_{\text{CLK-PATH}} + \Delta_{\text{CLK INT}}^{\text{MAX}} + \Delta_{\text{CLK SYN}}^{\text{MAX}} + \Delta_{\text{CELL}}^{\text{MAX}} + \Delta_{\text{DATA-PATH}}^{\text{MAX}} \]  

In our concurrent flow clocking scheme
\[ \Delta_{\text{CLK-PATH}} = \Delta_{\text{CLK INT}} + \Delta_{\text{CLK SYN}} \]  
where $\Delta_{\text{CLK SYN}}$ is the delay of an extra JTL in the clock path (see Fig. 1b), chosen to make (11) an equality for the worst case deviation of all timing parameters.

These conditions lead us to the following general expressions for the minimum clock period for both clocking schemes:

* for counterflow clocking
  \[ T_{\text{MIN}}^{\text{CLK}} = t_{\text{setup}}(2)_{\text{MAX}} + \Delta_{\text{CELL}}^{\text{MAX}} + \Delta_{\text{CLK INT}}^{\text{MAX}} + \Delta_{\text{DATA-PATH}}^{\text{MAX}} \]  

* for concurrent flow clocking
  \[ T_{\text{MIN}}^{\text{CLK}} = t_{\text{setup}}(2)_{\text{MAX}} + \Delta_{\text{CELL}}^{\text{MAX}} + \Delta_{\text{CLK INT}}^{\text{MAX}} + \Delta_{\text{CLK SYN}}^{\text{MAX}} + \Delta_{\text{DATA-PATH}}^{\text{MAX}} + t_{\text{hold}}(2)_{\text{MAX}} \]  

Equations (14) and (15) show the significant advantage of the concurrent flow clocking scheme. The minimum clock period in our scheme no longer directly depends on the clock-to-output delay of the cells but rather depends only on deviations in the delay (i.e., the difference between the maximum and minimum value). Concurrent flow clocking also eliminates the direct dependence of the minimum clock period on the maximum inter-cell connection delay of both data and clock paths, permitting a greater separation between the cells without incurring any degradation in performance.

However, the efficiency of the concurrent flow clocking scheme is greatly affected by the deviation of the timing parameters, as the difference between the maximum and minimum value of any delay changes two times faster as a function of relative delay deviation than the maximum value of the delay.

We further transform (14) and (15) to show their dependence on the nominal values of the timing parameters and their deviations. We assume that both the clock and the data path delays for a given pair of cells are implemented using JTLs located within the same area of the integrated circuit, and thus these deviations are correlated. Thus,

* for $\Delta_{\text{DATA-PATH}} - \Delta_{\text{CLK-PATH}} < 0$ (henceforth referred to as concurrent flow clocking case 1),
  \[ \max(\Delta_{\text{DATA-PATH}} - \Delta_{\text{CLK-PATH}}) = \Delta_{\text{MIN}}^{\text{DATA-PATH}} - \Delta_{\text{MIN}}^{\text{CLK-PATH}} \]  

* for $\Delta_{\text{DATA-PATH}} - \Delta_{\text{CLK-PATH}} > 0$ (henceforth referred to as concurrent flow clocking case 2),
  \[ \max(\Delta_{\text{DATA-PATH}} - \Delta_{\text{CLK-PATH}}) = \Delta_{\text{MAX}}^{\text{DATA-PATH}} - \Delta_{\text{MAX}}^{\text{CLK-PATH}} \]  

Similar relations hold for the minimum value of the difference between delays. Further transformations lead to the minimum clock period and the extra delay in the clock or data path for both clocking schemes, as summarized in Table I.

One can easily check that under the condition,
\[ t_{\text{hold}}(2)_{\text{MAX}} \leq \Delta_{\text{CLK INT}} + \Delta_{\text{CLK SYN}} + \Delta_{\text{DATA-PATH}}^{\text{MIN}} \]  
concurrent flow clocking always outperforms counterflow clocking. For greater $t_{\text{hold}}(2)_{\text{MAX}}$, both schemes offer identical performance.

IV. DESIGN METHODOLOGY - HOW THE EQUATIONS ARE USED

To apply our design strategy for developing RSFQ clock distribution networks, the designer must first establish for each cell in the circuit:

a) the nominal values of the clock-to-output delay, hold time, and setup time;

b) the maximum relative deviation of the clock-to-output delay; and

c) the absolute maximum values of the hold and setup time.

Nominal values of timing parameters can be derived from a circuit level simulator, such as JSPICE or PSCAN. Hold and setup times should be defined to assure correct timing as well as correct functionality: the clock-to-output delay may be affected by too small an interval between the clock and data pulses (see definition of quiescent hold and setup times in [12]). An efficient procedure for establishing deviations of timing parameters is provided in [11]. The maximum of relative delay deviations for all cells is used in further computations as $\delta$.

For all distinct pairs of communicating cells in the circuit under consideration, we must further compute the minimum clock period imposed by the given pair of cells and the extra synchronization delay. All computations can be performed using formulas provided in Table I. The conditions necessary to distinguish between cases 1 and 2 for each clocking scheme are also defined in the table. The largest minimum clock period computed for each pair of cells limits the maximum clock frequency of the circuit. Extra synchronization delay can be implemented using an appropriate number of JTL stages introduced to either the clock path or data path, depending on the sign of the extra delay as specified in Table I.
TABLE I
DESIGN OF CLOCK DISTRIBUTION NETWORKS FOR COUNTERFLOW AND CONCURRENT FLOW CLOCKING SCHEMES

CLOCKING SCHEME | COUNTERFLOW | CONCURRENT FLOW
--- | --- | ---
CONDITION | if \( t_{\text{hold}}(2)_{\text{MAX}} \leq (1-\delta)(\Delta_{\text{nom}}^{\text{CELL}} + \Delta_{\text{nom}}^{\text{DATA-INT}} + \Delta_{\text{nom}}^{\text{CLK-INT}}) \) then CASE 1 | if \( t_{\text{hold}}(2)_{\text{MAX}} \leq (1-\delta)\Delta_{\text{nom}}^{\text{CELL}} \) then CASE 1
else | CASE 2 | else CASE 2

CASE 1

EXTRA DELAY

\( T_{\text{CLK}}^{\text{MIN}} = \frac{i_{\text{setup}}(2)_{\text{MAX}} + (1+\delta)(\Delta_{\text{nom}}^{\text{CELL}} + \Delta_{\text{nom}}^{\text{DATA-INT}} + \Delta_{\text{nom}}^{\text{CLK-INT}})}{1+\delta} \)

CASE 2

EXTRA DELAY

\( T_{\text{CLK}}^{\text{MIN}} = \frac{i_{\text{setup}}(2)_{\text{MAX}} + 2\delta}{1-\delta} t_{\text{hold}}(2)_{\text{MAX}} \)

V. EXAMPLE CIRCUIT

To illustrate the design methodology described in section IV, and to compare the effectiveness of concurrent and counterflow clocking, we apply both schemes to a practical example circuit: the operational unit of the decimation digital filter, first described in [13]. The unit consists of three types of cells: the Parallel Shift Register (PSR), the AND gate (AND), and the Adder-Accumulator (AAC), connected together as shown in Fig. 3. The nominal values of the timing parameters for all cells are given in Table II together with the maximum values of the hold and setup times, which correspond to devisions in the Hynpro, Inc. technological process, as determined in [11]. There are only four distinct pairs of communicating cells in the filter (see Fig. 3). For each pair of communicating cells, we assume \( \Delta_{\text{DATA-INT}} = 0 \) and \( \Delta_{\text{CLK-INT}} = 13 \) ps (equal to the delay of the splitter plus the delay of the one extra JTL stage). In Table III, we provide the minimum clock period imposed by each pair of cells for each clocking scheme, as well as the corresponding synchronization delays. Both values for both clocking schemes are obtained from the general formulas shown in Table I. In Fig. 4, we present the maximum clock frequency of the filter for counterflow and concurrent flow clocking, compared to the intrinsic speed of the slowest component cell for various delay deviations. The intrinsic minimum clock period of the cell is

\[
T_{\text{MIN}}^{\text{CLK}} = i_{\text{setup}}(2)_{\text{MAX}} + t_{\text{hold}}(2)_{\text{MAX}}. \tag{19}
\]

We see that concurrent flow clocking has a significant performance advantage over counterflow clocking. For \( \delta = 20\% \) (our estimate for fabrication technology currently in use), Table III and Fig. 4 show that concurrent flow clocking offers a speed-up of \( 89/43 = 2.1 \) over counterflow clocking. The minimum clock period of the filter is limited by the

TABLE II

VALUES OF TIMING PARAMETERS FOR THE FILTER COMPONENT CELLS

<table>
<thead>
<tr>
<th>cell type</th>
<th>clock-to-output</th>
<th>hold time</th>
<th>setup time</th>
<th>max. hold</th>
<th>max. setup delay (ps)</th>
<th>delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PSR</td>
<td>23</td>
<td>-31</td>
<td>42</td>
<td>-27</td>
<td>46</td>
<td></td>
</tr>
<tr>
<td>AND</td>
<td>30</td>
<td>-10</td>
<td>-3</td>
<td>12</td>
<td>-1</td>
<td></td>
</tr>
<tr>
<td>AAC</td>
<td>24</td>
<td>-13</td>
<td>24</td>
<td>-10</td>
<td>27</td>
<td></td>
</tr>
</tbody>
</table>

TABLE III

MINIMUM CLOCK PERIODS IMPOSED BY EACH PAIR OF COMMUNICATING CELLS (FOR RELATIVE DELAY DEVIATION \( \delta = 20\% \))

<table>
<thead>
<tr>
<th>CELL PAIR</th>
<th>( T_{\text{MIN}}^{\text{CLK}} ) (ps)</th>
<th>( T_{\text{CLK}}^{\text{MIN}} ) (ps)</th>
<th>EXTRA DELAY (ps)</th>
<th>( T_{\text{CLK}}^{\text{MIN}} ) (ps)</th>
<th>EXTRA DELAY (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PSR—PSR</td>
<td>89</td>
<td>0</td>
<td>43</td>
<td>25</td>
<td></td>
</tr>
<tr>
<td>PSR—AND</td>
<td>42</td>
<td>0</td>
<td>22</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>AND—AAC</td>
<td>79</td>
<td>0</td>
<td>40</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>ADD—AAC</td>
<td>71</td>
<td>0</td>
<td>36</td>
<td>11</td>
<td></td>
</tr>
</tbody>
</table>

Fig. 3. General structure of the example circuit - operational unit of the decimation digital filter. Clock distribution network corresponds to the concurrent flow clocking scheme. Optimum values of the extra delays in the clock and data paths have been selected according to Table III.
PSR-PSR connection for either clocking scheme (shown as bold entries in Table III). We see that further improvements in the technology for fabricating superconducting digital circuits can yield a dramatic increase in the operational speed of the circuit. For an ideal fabrication technology \( \delta = 0 \), the concurrent flow clocking takes full advantage of the intrinsic speed of the gates and outperforms counterflow clocking by a factor of seven.

Our theoretical expectations have been confirmed by simulations of the entire operational unit of the decimation filter using the RSFQ logic analyzer — URSULA [13]. Portions of the circuit have also been simulated with SPICE, giving consistent results. The optimized clock distribution network of the filter is shown in Fig. 3.

This paper has not considered constraints that may be imposed by multiple-input RSFQ cells on the minimum separation time between pulses appearing on two different data inputs of the cell [12]. Note that, one must also be aware of constraints imposed by the exchange of information between the circuit and its environment (e.g., the control unit).

VI. CONCLUSIONS

In our paper we present a novel synchronization scheme developed specifically for medium to large scale RSFQ circuits. Concurrent flow clocking is compared with the standard counterflow clocking. Our analysis considers practical issues, such as the deviations of timing parameters due to variations in the fabrication process.

Concurrent flow clocking outperforms conventional counterflow clocking scheme for a large variety of circuit types and a wide range of parameter deviations. The advantage of concurrent flow clocking in RSFQ logic is significantly greater than for semiconductor-based logic. This is due to the following unique features of RSFQ digital electronics:

- In RSFQ circuits, the combinational components of the circuit are combined with storage components (registers) to form simple RSFQ cells. This results in smaller differences between the maximum and minimum delay of the data path between two sequentially-adjacent synchronous cells.
- RSFQ synchronous cells demonstrate excellent correlation between the hold and setup times (to be shown in [11]).
- RSFQ circuits provide well controlled and mutually correlated delays using Josephson transmission lines.

Still, for present day fabrication technologies we show, using a digital filter as an example, only a factor of two performance advantage when using concurrent clocking. This smaller factor is due to the primitive state of today’s Josephson junction circuit fabrication technology, primitive in comparison with the many mature commercial semiconductor circuit fabrication technologies. However, as superconductive technologies mature, a dramatic increase in circuit speed with concurrent clock may be expected.

In any case, the performance improvements offered by concurrent flow clocking entail increased logic and layout design complexity. Thus, the choice of clocking scheme for a particular circuit is application-specific. Our paper is intended to provide a sufficient theoretical background for choosing an appropriate clock distribution design methodology for RSFQ circuits.

ACKNOWLEDGMENT

The authors would like to thank Q. P. Herr, S. S. Martinet and Q. Ke for providing optimized designs of RSFQ cells and for many discussions, and the anonymous reviewer for helpful comments.

REFERENCES