PUBLICATIONS

Publications: Publications

September 8, 2023

A PPA STUDY OF REINFORCED PLACEMENT PARAMETER AUTOTUNING: PSEUDO-3D VS. TRUE-3D PLACERS

3D Place and Route (P&R) flows either involve true-3D placement algorithms or use commercial 2D tools to transform a 2D design into a 3D design. Irrespective of the nature of the placers, several placement parameters in these tools affect the quality of the final 3D designs. Different parameter settings work well with different circuits, and it is impossible to manually tune them for a particular circuit. Automated approaches involving reinforcement learning have been shown to adapt and learn the parameter settings and create trained models. However, their effectiveness depends on the input dataset quality. Using a set of 10 netlists and 10–21 handpicked placement parameters in P&R flows involving pseudo-3D or true-3D placement, the dataset quality is analyzed. The datasets are the design metrics obtained through different P&R stages, such as placement optimization, clock tree synthesis, or 3D partitioning and global routing. The training runtime and the quality of the final design metrics are compared. On a pseudo-3D flow, the training takes around 126–290 hours, whereas, on a true-3D placer-based flow, it takes around 305–410 hours. It is observed that the datasets obtained from different stages lead to drastically different final design results. With the RL-based training processes, the quality of results in 3D designs improves by up to 23.7% compared to their corresponding untrained P&R flows.

August 16, 2023

ON CONTINUING DNN ACCELERATOR ARCHITECTURE SCALING USING TIGHTLY COUPLED COMPUTE-ON-MEMORY 3-D ICS

This work identifies the architectural and design scaling limits of 2-D flexible interconnect deep neural network (DNN) accelerators and addresses them with 3-D ICs. We demonstrate how scaling up a baseline 2-D accelerator in the X / Y dimension fails and how vertical stacking effectively overcomes the failure. We designed multitier accelerators that are 1.67 × faster than the 2-D design. Using our 3-D architecture and circuit codesign methodology, we improve throughput, energy efficiency, and area efficiency by up to 5 × , 1.2 × , and 3.9 × , respectively, over 2-D counterparts. The IR-drop in our 3-D designs is within 10.7% of VDD, and the temperature variation is within 12 ∘ C.

April 13, 2022

ART-3D: ANALYTICAL 3D PLACEMENT WITH REINFORCED PARAMETER TUNING FOR MONOLITHIC 3D ICS

In this paper, we show that true 3D placement approaches, enhanced with reinforcement learning, can offer further PPA improvements over pseudo-3D approaches. To accomplish this goal, we integrate an academic true 3D placement engine into a commercial-grade 3D physical design flow, creating ART-3D flow (Analytical 3D Placement with Reinforced Parameter Tuning-based 3D flow). We use a reinforcement learning (RL) framework to find optimized placement parameter settings of the true 3D placement engine for a given netlist and perform high-quality 3D placement. We then use an efficient 3D optimization and routing engine based on a commercial place and route (P&R) tool to maintain or improve the benefits reaped from true 3D placement till design signoff. We evaluate our 3D flow by designing several gate-only and processor benchmarks on a commercial 28nm technology node. Our proposed 3D flow involving true 3D placement offers the best PPA results compared to existing 3D P&R flows and reduces power consumption by up to 31%, improves effective frequency by up to 25%, and therefore reduces power-delay product by up to 43% compared with commercial 2D IC design flow. These improvements predominantly come from RL-based parameter tuning, as it improves the performance of the 3D placer by up to 12%.

July 16, 2021

HETEROGENEOUS 3D ICS: CURRENT STATUS AND FUTURE DIRECTIONS FOR PHYSICAL DESIGN TECHNOLOGIES

One of the advantages of 3D IC technology is its ability to integrate different devices such as CMOS, SRAM, and RRAM, or multiple technology nodes of single or different devices onto a single chip due to the presence of multiple tiers. This ability to create heterogeneous 3D ICs finds a wide range of applications, from improving processor performance by integrating better memory technologies to building compute-in-memory ICs to support advanced machine learning algorithms. This paper discusses the current trends and future directions for the physical design of heterogeneous 3D ICs. We summarize various physical design and optimization flows, integration techniques, and existing academic works on heterogeneous 3D ICs.

July 15, 2021

STATISTICAL ARRAY ALLOCATION AND PARTITIONING FOR COMPUTE IN-MEMORY FABRICS

Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or phase change random access memory (PCRAM), various forms of neural networks can be implemented to greatly reduce power and increase on chip memory capacity. However, compute in-memory faces its own limitations at both the circuit and the device levels. Although compute in-memory using the crossbar architecture can greatly reduce data transport, the rigid nature of these large fixed weight matrices forfeits the flexibility of traditional CMOS and SRAM based designs. In this work, we explore the different synchronization barriers that occur from the CIM constraints. Furthermore, we propose a new allocation algorithm and data flow based on input data distributions to maximize utilization and performance for compute-in memory based designs. We demonstrate a 7.47× performance improvement over a naive allocation method for CIM accelerators on ResNet18.

February 24, 2021

CLOCK DELIVERY NETWORK DESIGN AND ANALYSIS FOR INTERPOSER-BASED 2.5-D HETEROGENEOUS SYSTEMS

The 2-D CMOS process technology scaling may have reached its pinnacle, yet it is not feasible to manufacture all computing elements at lower technological nodes. This has opened a new branch of chip designing that allows chiplets on different technological nodes to be integrated into a single package using interposers, the passive interconnection mediums. However, establishing a high-frequency communication over an entirely passive layer is one of the significant design challenges of 2.5-D systems. In this article, we present a robust clocking architecture for a 2.5-D system consisting of 64 processor cores. This clocking scheme consists of two major components, namely, interposer clocking and on-chiplet clocking. The interposer clocking consists of clocks used to achieve global synchronicity and clocks for interchiplet communication established using the AIB protocol. We synthesized these clocking components using commercial EDA tools and analyzed them using standard tools, on-chip, and package models. We also compare these results against a 2-D design of the same benchmark and another 2.5-D clocking architecture. Our experiments show that the absolute clock power is up to 16% less, and the ratio of clock power to system power is up to 4% less in the 2.5-D design than its 2-D counterpart.

December 14, 2020

HETEROGENEOUS MIXED-SIGNAL MONOLITHIC 3-D IN-MEMORY COMPUTING USING RESISTIVE RAM

Resistive random access memory (RRAM)-based compute-in-memory architecture helps overcome the bottleneck caused by large memory transactions in the convolutional neural network (CNN) accelerators. However, their deployment using 2-D IC technology faces challenges, as today's RRAM cells remain at legacy nodes above 20 nm due to high programming voltages. Besides, power-hungry analog-to-digital converter (ADC) units limit the throughput of RRAM accelerators. In this article, we present the first-ever heterogeneous (multiple technology nodes) mixed-signal monolithic 3-D IC designs of the RRAM CNN accelerator. Our RRAM remains at legacy 40-nm nodes in one tier, but CMOS periphery scales toward advanced 28/16 nm in another tier. Our 3-D designs overcome the bottleneck caused by ADCs and offer up to 4.9x improvement in energy efficiency in TOPS/W and up to 50% reduction in footprint …

Novemeber 2, 2020

RTL-TO-GDS DESIGN TOOLS FOR MONOLITHIC 3D ICS

In this paper, we propose RTL-to-GDS design flow for monolithic 3D ICs (M3D) built with carbon nanotube field-effect transistors and resistive memory. Our tool flow is based on commercial 2D tools and smart ways to extend them to conduct M3D design and simulation. We provide a post-route optimization flow, which exploits the full potential of the underlying M3D process design kit (PDK) for power, performance and area (PPA) optimization. We also conduct IR-drop and thermal analysis on M3D designs to improve the reliability. To enhance the testability of our M3D designs, we develop design-for-test (DFT) methodologies and integrate a low-overhead built-in self-test module into our design for testing inter-layer vias (ILVs) as well as logic circuitries in the individual tiers. Our benchmark design is RISC-V Rocketcore, which is an open source processor. Our experiments show 8.1% of power, 19.6% of wirelength …

August 24, 2020

ARCHITECTURE, CHIP, AND PACKAGE CODESIGN FLOW FOR INTERPOSER-BASED 2.5-D CHIPLET INTEGRATION ENABLING HETEROGENEOUS IP REUSE

A new trend in system-on-chip (SoC) design is chiplet-based IP reuse using 2.5-D integration. Complete electronic systems can be created through the integration of chiplets on an interposer, rather than through a monolithic flow. This approach expands access to a large catalog of off-the-shelf intellectual properties (IPs), allows reuse of them, and enables heterogeneous integration of blocks in different technologies. In this article, we present a highly integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5-D designs. Our target design is 64-core architecture based on Reduced Instruction Set Computer (RISC)-V processor. We first chipletize each IP by adding logical protocol translators and physical interface modules. We convert a given register transfer level (RTL) for 64-core processor into chiplets, which are enhanced with our centralized network-on-chip.

August 15, 2020

BREAKING BARRIERS: MAXIMIZING ARRAY UTILIZATION FOR COMPUTE IN-MEMORY FABRICS

Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or phase change random access memory (PCRAM), various forms of neural networks can be implemented to greatly reduce power and increase on chip memory capacity. However, compute in-memory faces its own limitations at both the circuit and the device levels. Although compute in-memory using the crossbar architecture can greatly reduce data transport, the rigid nature of these large fixed weight matrices forfeits the flexibility of traditional CMOS and SRAM based designs. In this work, we explore the different synchronization barriers that occur from the CIM constraints. Furthermore, we propose a new allocation algorithm and data flow based on input data distributions to maximize utilization and performance for compute-in memory based designs. We demonstrate a 7.47 performance improvement over a naive allocation method for CIM accelerators on ResNet18.

June 6, 2019

ARCHITECTURE, CHIP, AND PACKAGE CO-DESIGN FLOW FOR 2.5 D IC DESIGN ENABLING HETEROGENEOUS IP REUSE

A new trend in complex SoC design is chiplet-based IP reuse using 2.5 D integration. In this paper we present a highly-integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5 D designs. We chipletize each IP by adding logical protocol translators and physical interface modules. These chiplets are placed/routed on a silicon interposer next. Our package models are then used to calculate PPA and signal/power integrity of the overall system. Our design space exploration study using our tool flow shows that 2.5 D integration incurs 2.1 x PPA overhead compared with 2D SoC counterpart.

July 13, 2018

CLOCK DOMAIN CROSSING VERIFICATION OF A BIDIRECTIONAL IO IP

This paper was published at Synopsys User Group (SNUG), Conference, held in Bangalore, India, when I was associated with Intel. This paper talks about the efficient techniques that can be used in analyzing the clock domain crossing issues in an analog IP, which has several mesochronous clocks for source synchronous data communication and hundreds of bidirectional I/O's. This paper can be found here.