Research
 
Select Recent Publications
A Performance-Correctness Explicitly-Decoupled Architecture”,  Alok Garg and Michael Huang, in Proceedings of the 41st International Symposium on Microarchitecture, Nov. 2008
 

MICRO’08

“Replacing Associative Load Queues: A Timing-Centric Approach”, Fernando Castro, Regana Noor, Alok Garg, Dani Chaver, Michael Huang, Luis Pinuel, Manuel Prieto, and Franciso Tirado, in IEEE Transactions on Computers, 2008

(Based on papers published in MICRO’06
and ISLPED’06
)
 

TC’08

“Injection-Locked Clocking: A Low-Power Clock Distribution Scheme for High-Performance Microprocessors”, Lin Zhang, Aaron Carpenter, Berkehan Ciftcioglu, Alok Garg, Michael Huang, and Hui Wu, in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(9):1251-1256, Sep. 2008

Extended technical report

 

TVLSI'08

“Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs, M. Wasiur Rashid and Michael Huang, in Proceedings of the 14th International Symposium on High-Performance Computer Architecture, Feb. 2008, pp. 393-404
 

HPCA'08

Generally speaking, we are interested in architecting future high-performance microprocessors and computer systems: how to design the processor microarchitecture, the memory hierarchy, and the communication substrate. A particular emphasis is to better understand the underlying device, circuit, and manufacturing technology, and develop practical architectural techniques to use new capabilities and address emerging challenges.


Below is a list of some recent representative publications and a description of an incomplete list of topics we explored.

Description of Research Topics and Findings (updated lazily)

Communication Substrate:


As microprocessor chips integrate a growing number of cores, the issue of interconnection becomes more important for overall system performance and efficiency.

Compared to traditional distributed shared-memory architecture, chip-multiprocessors offer a different set of design constraints and opportunities. As a result, a conventional packet-relay multiprocessor interconnect architecture is a valid, but not necessarily optimal, design point. For ex- ample, the advantage of off-the-shelf interconnect and the in-field scalability of the interconnect are less important in a chip-multiprocessor. On the other hand, even with worsening wire delays, packet switching represents a non-trivial component of overall latency. While the focus of current research in on-chip communication substrate focuses a lot on packet-switching, we question the centrality of packet-switching or even its necessity. We have investigated a number of cases where careful integration of device, circuit, and architecture level designs provide compelling packet-switching free solutions using either conventional or photonics technologies [ISCA’10, ISCA’11, ISLPED’11, PTL’11, ISCA’12, JETCAS’12, Carpenter PhD Thesis’12].



Reliability:


With aggressive device scaling, transistors require less and less energy to change state, which means that they are fundamentally more vulnerable to environmental noises: such as those in the power supply, from the substrate, or due to particle strikes. Historically, for a variety of reasons, memory elements are much more vulnerable than their logic counterpart in a system. Since memory can be protected with inexpensive error-correction codes, it is relatively straightforward to protect general-purpose systems. However, device scaling and ever larger scale integration make logic elements increasingly important to protect. Unfortunately, typical random logic requires full-blown replication just to detect errors. While redundancy has long been studied, considerations for its practical application in future general-purpose systems are lacking.


We believe that due to the nature of general-purpose systems, any such protection mechanism should only be activated on demand and should be non-intrusive -- when disabled, the support should incur little impact to the rest of the system, especially the critical paths. Furthermore, while users will increasingly demand higher dependability for general-purpose systems, they will still be sensitive to the cost of redundancy such as energy overhead.


We have explored some aspects of the design of Thread-Level Redundancy [Rashid PhD Thesis’08]: a complexity-effective, on-demand thread-level redundancy support for parallel applications [HPCA’08], and a microarchitectural support to allow increased energy efficiency for redundancy [PACT’05]. A key fundamental inefficiency in redundant systems is that while errors in general are very rare, we pay significant overhead in providing redundancy simply to detect errors. We explored the possibility of other error detection mechanisms aimed at detecting the incoming particles (which are also rare). Our study suggests that a separate detector layer is unlikely to be feasible due to fundamental physics [Srinivas MS Thesis’08], but monitoring the current spike due to the intruding particle can be an efficient substitute for redundant computation [ISQED’09, Narsale MS Thesis’08].  In our on-going work, We are also trying to understand reliability issues in real-world settings [USENIX’07, HOTDEP’07, USENIX’10].




Design and circuit complexity:


With scaling, architectural complexity also escalated. The design complexity can become a limiting factor to translating innovations in the research community to the real world. Memory dependence logic is a prime example. First, the design complexity is high as the logic has to handle a variety of situations, taking into account different operand sizes, whether data is available for forwarding, coherence and consistency requirements, and so on. Second, the implementation relies on time- and energy-consuming circuits such as associative logic and priority encoder. Since store-to-load forwarding is on the critical path, circuit speed is crucial. On the other hand, hiding long communication and memory access latencies requires the logic to have a large capacity to handle many in-flight instructions. To provide sufficient capacity, some high-frequency designs (e.g., Pentium 4) introduced even more speculation, which further increases the design complexity. There are also numerous proposals for scalable memory dependence logic in recent literature. However, design complexity is often not the focus. Complex microarchitecture not only makes verification difficult but also increases the challenge for future optimization and innovation. Given the escalating cost of commercial microprocessor development, we need to search for practical complexity-effective algorithms. To this end, we explored various complexity-effective designs, especially for scalable memory dependence logic.


First, we explored a software-hardware cooperative approach [HPCA’06]. We showed that cross-layer cooperation can help achieve goals in a much more complexity-effective manner than possible in one layer. Specifically, we use a software-based parser to analyze the program binary to identify loads that can safely bypass the dynamic memory disambiguation process. The hardware, on the other hand, only provides support for the software to specify the necessity of disambiguation. Software and hardware also work together when static analysis alone can not offer complete information. Collectively, the mechanism is effective and also inexpensive since the complexity is shifted to the software. Our work demonstrated the potential of a vertically integrated optimization approach, where different system layers communicate with each other beyond standard functional interfaces. The layer most efficient in handling a task can pass information on to other layers for action. We also demonstrated that such cooperative framework does not create backward compatibility obligations. We believe such a cooperative approach will be increasingly resorted to as a way to manage system complexity while continue to deliver system improvements.


We also developed a slackened memory dependence enforcement approach that decouples the performance-critical forwarding logic from the logic that guarantees correctness [ISCA’06]. This decoupling removes the need to simultaneously achieve performance and correctness goals. As a result, the correctness-validation logic is straightforward as it is not on the timing critical path; the forwarding logic is also much simplified as it only needs to focus on the common case and therefore can ignore corner-case correctness concerns. This allows us to use an index-based structure to support timing-critical forwarding. The resulting design is very scalable and with two optional optimization techniques implemented, it offers performance close to that of a conventional system with idealized load/store queues.


Extending the philosophy of trying to handle performance improvement and correctness guarantee in relatively orthogonal fashion, we explored what we call explicitly-decoupled architecture. In such an architecture, from ground up, the design is explicitly separated into a performance and a correctness domain. By design, the performance domain only enables and facilitates high performance in a probabilistic fashion. The architectural design is not just conceptually but also physically partitioned into performance and correctness domains. The physical separation extends to the whole system stack from software and microarchitecture down to circuit and device, allowing freedom in the entire system stack to be optimistically designed. Our initial evaluation showed an explicitly decoupled design can (a) achieve good performance boosting, (b) does not consume excessive energy, and (c) provides robust performance and better tolerance than conventional design to circuit-level issues and to the resulting conservatism [MICRO’08, PACT’11, Garg PhD Thesis’11].


An Intra-Chip Free-Space Optical Interconnect”,  Jing Xue et al., in Proceedings of the 37th International Symposium on Computer Architecture, June 2010
 

ISCA’10

A Case for Globally-Shared-Medium On-Chip Interconnect”,  Aaron Carpenter, Jianyun Hu, Jie Xu, Michael Huang, and Hui Wu, in Proceedings of the 38th International Symposium on Computer Architecture, June 2011
 

ISCA’11

Enhancing Effective Throughput for Transmission Line-Based Bus”,  Aaron Carpenter, Jianyun Hu, Ovunc Kocabas, Michael Huang, and Hui Wu, in Proceedings of the 39th International Symposium on Computer Architecture, June 2012
 

ISCA’12