Clock Distribution Design in VLSI Circuits - an Overview

Eby G. Friedman
Department of Electrical Engineering
University of Rochester
Rochester, New York 14627 USA

Abstract - Clock distribution networks synchronize the flow of data signals between data paths, and the design of these networks can dramatically affect system wide performance and reliability. Significant attention to this research area exists within both the industrial and academic communities, and a diverse spectrum of results have been developed. The field of clock distribution design can be grouped into a number of sub-topics. Specifically, 1) circuit and layout techniques for structured custom VLSI systems, 2) the automated synthesis of clock distribution networks with application to automated placement and routing of gate arrays, standard cells, and larger block-oriented circuits, 3) the analysis and modeling of the timing characteristics of clock distribution networks, and 4) the specification of the optimal timing characteristics of clock distribution networks based on architectural and functional performance requirements. Each of these areas are described and summarized and future trends discussed.

I. INTRODUCTION

In a synchronous digital system, the global clock signal is used to define a relative time reference for the movement of data within that system. Because this function is vital to the operation of a synchronous system, much attention has been given to the characteristics of these clock signals and the networks used in their distribution. Most synchronous digital systems consist of cascaded banks of sequential registers with combinatorial logic between each set of registers. The functional requirements of the digital system are satisfied by the logic stages, while the global performance and local timing requirements are satisfied by the careful insertion of pipeline registers into equally spaced time windows to satisfy critical worst case timing constraints and by the proper design of the clock distribution network to satisfy critical timing requirements as well as to ensure that no race conditions exist [1-16].

Each data signal typically is stored in a latched state within a bistable register awaiting the incoming clock signal, which defines when the data should leave the register. Once the enabling clock signal reaches the register, the data signal leaves the bistable register, and propagate through the combinational network, and for properly working system, enters the next register and is fully latched into that register before the next clock signal appears. The delay components that make up a general synchronous system are composed of the following three individual subsystems [17, 18]: 1) the memory storage elements, 2) the logic elements, and 3) the clocking circuitry and distribution. This paper provides an overview of the research which describes the interplay among these three subsystems: particularly, how the timing characteristics of the memory and logic elements constrain the design and synthesis of clock distribution networks.

A schematic of a generalized synchronized data path is presented in Fig. 1, where Cj and Ck represent the clock signals driving the initial register and the final register, respectively, and both originate from the same clock signal source. The clock delay of the initial clock signal Tc0 and the final clock signal Tc1 define the time reference when the data signals begin to leave their respective registers. These clock signals originate from a clock distribution network which is designed to generate a specific clock signal waveform which synchronizes each register. The difference in delay between two sequentially adjacent clock paths, as shown in (1), is the clock skew Tskm. If the clock signals Cj and Ck are in complete synchronism (i.e., the clock signals arrive at their respective registers at exactly the same time), the clock skew is zero.

\[ T_{skm} = T_{c1} - T_{c0} \]  

II. TIMING CONSTRAINTS DUE TO CLOCK SKEW

The magnitude and polarity of the clock skew can have a significant effect on system performance and reliability. Depending upon whether Cj leads or lags Ck and upon the magnitude of Tskm with respect to Tcm, system performance and reliability can either be degraded or enhanced. These cases are discussed below:

A. Maximum Data Path/Clock Skew Constraint Relationship

For a design to meet its specified timing requirements, the greatest collective propagation delay of any data path between a pair of data registers, Rj and Rk, being synchronized by a clock distribution network must be less than the minimum clock period (the inverse of the maximum clock frequency) of the circuit as shown in (2) [5-7,10,12,13,15,16,19]. If the time of arrival of the clock signal at the final register of a data path Tcm leads that of the time of arrival of the clock signal at the initial register of the same sequential data path Tc0, the clock skew is referred to as positive clock skew and, under this condition, the maximum attainable operating frequency is decreased. Positive clock skew is the additional amount of time which must be added to the minimum clock period to reliably apply a new clock signal at the final register, where reliable operation implies that the system will function correctly at low as well as at high frequencies.

In the positive clock skew case, the clock signal arrives at Rk before it...
reaches $R$. From (2) and (3), the maximum permissible positive clock skew can be expressed as [5-7,10,12,13,15,16,19]

$$T_{\text{skew}} \leq T_{c_p} + T_{\text{setup}} + T_{\text{hold}} + T_{\text{setup}}$$  \hspace{1cm} \text{for } T_{c_p} > T_{c_p}.$$  \hspace{1cm} (4)

This situation is the typical critical path timing analysis requirement commonly seen in high performance synchronous digital systems. In circuits where positive clock skew is significant and (4) is not satisfied, the clock and data signals should be run in the same direction, thereby forcing $C_{\text{in}}$ to lag $C_{\text{out}}$ and making the clock skew negative.

![Clock Timing Diagrams](image)

**Fig. 2. Clock Timing Diagrams**

**B. Minimum Data Path/Clock Skew Constraint Relationship**

If the clock signal arrives at $R_e$ before it reaches $R_o$ (see Fig. 2B), the clock skew is defined as being negative clock skew. Negative clock skew can be used to improve the maximum performance of a synchronous system by decreasing the delay of a critical path; however, a potential minimum constraint can occur, creating a race condition [12,15,16,20-23]. In this case, when $C_{\text{in}}$ lags $C_{\text{out}}$, the clock skew must be less than the time required for the data to leave the initial register, propagate through the interconnect and combinatorial logic, and set-up in the final register (see Fig. 1). If this condition is not met before the data stored in register $R_e$ can be shifted out of $R_o$, it is overwritten by the data that had been stored in register $R_e$ and has propagated through the combinatorial logic. Correct operation requires that $R_e$ latches data which correspond to the data $R_o$ latched during the previous clock period. This constraint on clock skew is

$$T_{\text{skew}} \leq T_{c_p} + T_{\text{setup}} + T_{\text{hold}} + T_{\text{setup}}$$  \hspace{1cm} \text{for } T_{c_p} > T_{c_p}.$$  \hspace{1cm} (5)

An important example in which this minimum constraint can occur is in those designs which use cascaded registers, such as a serial shift register or a k-bit counter. In cascaded register circuits, $T_{\text{setup}}$ is zero, and $T_{\text{hold}}$ approaches zero (since cascaded registers are typically designed, at the geometric level, to abut). If $T_{c_p} > T_{c_p}$ (i.e., negative clock skew), then the minimum constraint becomes

$$T_{\text{skew}} \leq T_{c_p} + T_{\text{setup}}$$  \hspace{1cm} \text{for } T_{c_p} > T_{c_p}.$$  \hspace{1cm} (6)

and all that is necessary for the system to malfunction is a poor relative placement of the flip flops or a highly resistive connection between $C_{\text{in}}$ and $C_{\text{out}}$. In a circuit configuration such as a shift register or counter, where negative clock skew is a more serious problem than positive clock skew, provision should be made to force $C_{\text{in}}$ to lag $C_{\text{out}}$.

As higher levels of integration are achieved, on-chip testability becomes necessary. Data registers, configured in the form of serial set/scan chains when operating in the test mode, are a common example of a built-in test design technique. The placement of these circuits is typically optimized around the functional flow of the data. When the system is reconfigured to use the registers in the role of the set/scan function, different path delays are possible. In particular, the clock skew of the local data path can be negative and greater in magnitude than the local register delays. Therefore, with increased negative clock skew, (6) may not be satisfied and the incorrect data will latch into the final register of the reconfigured local data path.

Also, in ideal scaling of MOS devices, all linear dimensions and voltages are multiplied by the factor 1/S, where $S > 1$. Device dependent delays, such as $T_{c_p}$, $T_{\text{setup}}$, and $T_{\text{hold}}$ scale as 1/S while interconnect dominated delays such as $T_{\text{skew}}$ remain constant to first order, and if fringing capacitance is considered, actually increase with decreasing dimensions. Therefore, when examining dimensional scaling, (5) and (6) should be considered carefully.

**C. Enhancing Synchronous Performance by Applying Negative Clock Skew**

Negative clock skew can be used to improve synchronous performance by minimizing the delay of the critical worst case data paths [16-20,21,22]. By forcing $C_{\text{in}}$ to lead $C_{\text{out}}$ at each critical local data path, excess time is shifted from the neighboring less critical local data paths to the critical local data paths. This negative clock skew allows the additional amount of time that the data signal at $R_e$ has to propagate through the logic stages and interconnect sections and into the final register. Negative clock skew subtracts from the logic path delay, thereby decreasing the time delay. This, in effect, increases the total time that a given critical data path has to accomplish its functional requirements by giving the data signal released from $R_e$ more time to propagate through the logic and interconnect stages and latch into $R_o$. Thus, the differences in delay between each local data path is minimized, thereby compensating for any inefficient partitioning of the global data path into local data paths, which often occurs in many practical systems.

The maximum permissible negative clock skew of a data path, however, is dependent upon the clock period itself as well as the time delay of the previous data paths. This result from the structure of the serially cascaded local data paths making up the global data path. Since a particular clock signal synchronizes a register which functions in a dual role, as the initial register of the next local data path and as the final register of the previous data path, the earlier $C_{\text{in}}$ is for a given data path, the earlier that same clock signal, now $C_{\text{in}}$, is for the previous data path. Thus, the use of negative clock skew in the $P$ path results in a positive clock skew for the preceding path, which may then establish the new upper limit for the system clock frequency. It should be emphasized that in [12,15], Hatzimanolis designates the lead/lag clock skew polarity (positive/negative clock skew) notation as the opposite of that used here. Furthermore, different terms have been used in the literature to describe negative clock skew, such as "double-clocking" [16], "deskewing data pulses" [20], "cycle stealing" [22,23], "useful clock skew" [24], and "prescribed skew" [25].

**III. CLOCK DISTRIBUTION DESIGN OF STRUCTURED CUSTOM VLSI CIRCUITS**

Many different approaches, from ad hoc to algorithmic, have been developed for designing clock distribution networks in VLSI circuits. These approaches range from symmetric H-tree distribution networks [8,24] to ensure zero clock skew to compensation techniques which minimize the variation of interconnect impedances and capacitive loads between clock distribution paths [7,20,26-28] by adding positive delay elements, sizing transistors W/L ratios in the distributed buffers, or by other means. A number of specific examples of clock distribution circuits are described in the literature [1,6,12,28-30]. In each of these clock distribution networks, significant effort has been placed on accurately estimating the magnitude of the resistive and capacitive interconnect impedances to determine their effect on the shape of the clock pulse waveform. This information is typically back annotated into a SPICE-like circuit simulator to adjust the clock delays for minimum clock skew. Minimal work exists, however, in developing physical models which merge distributed RC interconnect delay models with distributed buffer delay models in order to estimate clock skews. The difficulty is that the accuracy required in calculating delay differences is much greater than that required when calculating absolute delay values.

Furthermore, in addition to the design of these networks, these
circuits must also be tested. Deco [31] describes a functional test system for specifically evaluating the time differences in clock distribution networks.

IV. AUTOMATED SYNTHESIS AND LAYOUT OF CLOCK DISTRIBUTION NETWORKS

Different approaches have been taken in the automated synthesis and layout of clock distribution networks, ranging from procedural behavioral synthesis of pipelined registers [32-34] to the automated layout of clock distribution nets in gate arrays and standard cells [35-46]. In the area of automated layout, two research paths have been initially taken, though with time these approaches should converge. One path is oriented to the support of commercial semiconductor foundries and their design tools [35-37,40,46], in which a variety of approaches are in use. These are oriented around increasing the prioritization of clock signal nets over data signal nets and connecting these clock nets to previously placed distributed local buffers. These buffers are used for amplifying the clock signals at these signals traverse long interconnect sections. Empirical delay models coupled with back annotation are typically used to model the clock path delays, and either the clock skew or the clock timing is analyzed. The scoreboard is compensated for, thereby forcing the clock skew to an acceptable magnitude.

A second research path has been the development of algorithms which carefully control the variations in delay between clock signal net length so as to minimize clock skew [38,39,41-45]. These results tend to use simplicity, modeling, such as linear delay, where the delay is linearly related to the path length or the Elmore delay, where the delay along a path is the summation of the individual resistive and capacitive distributed interconnect impedances. The fundamental difficulty with both of these delay models, however, is the inability of these models to accurately characterize the effects of active devices, such as distributed buffers, when estimating delay as well as more subtle considerations such as bias dependent loading and varying waveform shapes. Focus has been placed on minimizing total wirelength, metal-to-metal contacts and crossovers, as well as system-wide clock skew.

Localized clock distribution [33] has not yet been considered in automated layout or physical synthesis. However, early work in applying local clock skew to behavioral synthesis is described in [33,34]. In these papers, the delay equations characterizing a local data path, (2) and (3), are used to incorporate the effects of local clock distribution delays on timing by assuming regions of similar clock delay. Thus, as registers are moved from one region to another during the retiming process, the placed registers assume the clock delay of the new physical region. This permits clock skew to be determined locally at each iteration of the retiming process.

V. ANALYSIS AND MODELING OF THE TIMING CHARACTERISTICS OF CLOCK DISTRIBUTION NETWORKS

This research area has taken a number of disparate paths, all of which have in common the attributes of modeling the general characteristics of clock distribution networks. For example, Shoji [47] describes a method for minimizing clock skew induced by variances in process parameters. N-channel and P-channel parameters tend not to track each other as a process varies. Furthermore, the response times of these devices tend to move in opposite directions. Shoji quantitatively describes how the delay of the P-channel and N-channel transistors within the distributed buffers of the clock distribution network should be individually matched to ensure that as the process varies, the path delay between different clock paths track each other. Kugelmaier and Smigiel [48,49] describe a statistical approach for estimating clock skew. They provide upper bounds on clock skew assuming a Gaussian distributed clock delay with a variance proportional to the wire length. This approach to estimating clock skew is quite different from classical deterministic techniques that are used within industry and are described throughout the literature.

An important research area in VLSI circuits is timing analysis, where simplified RC models are used to estimate the delay through CMOS circuits. In these systems, clock characteristics are provided to a timing analyzer to define application-specific temporal constraints, such as minimum clock periods or hold times, on the functional timing of a specific synchronous system [50]. In [22,23], Taps and Lin continue this approach by describing an innovative timing analyzer which considers negative clock skew, i.e., time is "stolen" from adjacent data paths to increase system performance. In [51], Degenais and Rumin describe a timing analysis system which generates important clocking parameters from a circuit description of the system, such as minimum clock periods and hold times. This approach is useful for top-down design when performing exploratory estimation of system performance.

VI. SPECIFICATION OF THE OPTIMAL TIMING CHARACTERISTICS OF CLOCK DISTRIBUTION NETWORKS

Before the design of a clock distribution network can commence, certain timing constraints and goals must be specified. These timing traits are typically application specific and depend greatly on the architectural and circuit tradeoffs of a given system implementation. A number of papers exist which consider different aspects of these architectural tradeoffs. For example, Friedman and Mulligan [17,18] describe the tradeoff between latency and clock frequency when pipelining a synchronous digital system. They provide equations and graphical techniques to determine the optimal level of pipelining. Fieldburn [16] describes a linear program for choosing the optimal clock delays, thereby solving the problem which defines the logical assignment of clock skew. Fieldburn focuses on minimizing the clock period while avoiding "clock hazards," i.e., race conditions. Many papers [23,5-7,12,15,19,21,33] provide similar kinds of timing constraint equations as discussed in section II of this paper.

Sakallah et al. [52] followed by Szymanski [53] analyze the optimal clocking of synchronous circuits using linear programming techniques. Each group utilizes timing constraint relationships to generate clock schedules for improving the performance of synchronous systems.

VII. DIRECTIONS FOR FUTURE RESEARCH IN THE DESIGN OF CLOCK DISTRIBUTION NETWORKS

Significant research still remains in the design of clock distribution networks. Much of it is currently focused on automating the synthesis of clock distribution networks to support higher performance requirements. Thus, the optimal placement of localized distributed buffers, improved delay models which account for non-linear active transistors, the use of negative clock skew to increase circuit speed, and integrated RC interconnect-buffer physical delay models, must be considered in the automated design and layout of clock distribution networks. The effects of clock skew, both positive and negative, must also be integrated into behavioral and RC timing analyzers so as to detect race conditions as well as satisfy performance constraints. Furthermore, synchronous timing constraints must be integrated into high level behavioral synthesis algorithms, thereby improving their accuracy and generality.

VIII. SUMMARY AND CONCLUSIONS

It is often cited that the design of the clock distribution network represents the fundamental circuit limitation to performance in high speed synchronous digital systems. The difficulty in the design of these networks is one of the primary reasons for the recent emphasis placed on asynchronous systems. Clearly, however, synchronous systems will be commonplace for a long time to come, necessitating improved techniques for designing and implementing high speed and reliable clock distribution networks. Furthermore, as tighter control of the clocking parameters improves, approaches such as negative clock skew will be applied to the design of clock distribution networks to further enhance system performance.

A singular commentary on the current immaturity of the research area of clock distribution design is the lack of an agreed upon terminology and notation defining the primary concepts and terms. This is evidenced by the large variety of terms used to describe such issues as (using the notation defined in this paper) race conditions, negative clock skew, and $T_{\text{skew}}$. In summary, all electronic systems are fundamentally asynchronous...
nous in nature; by the careful insertion of precise localized timing relationships and storage elements, an asynchronous system can be adapted to appear to behave synchronously. This permits the use of clock frequency as a measure of how often new data appear at the output of a system, the key performance metric in synchronous systems. As long as specific local timing of functional relationships is satisfied, synchronous systems can be used, easing the timing constraints on data flow, albeit requiring a clock distribution network to provide the synchronizing reference signal.

REFERENCES