Circuit Design and Configuration for Low Power FPGAs

Today, current technology is moving towards the idea of ubiquitous computing (UbiComp), where sensors and other integrated circuits (ICs) are ever present in daily life, creating a smart environment for people to constantly monitor and react to their environments. For UbiComp to be fully realized, a host of requirements are necessary for the ICs deployed in this vast network. These requirements include low cost, low power consumption, flexibility, and adequate computation power. Current IC design methodologies do not quite meet all UbiComp requirements. Application Specific Integrated Circuits (ASICs) are low power and have the necessary computing power for UbiComp applications, but they are prohibitively inflexible. Ultra-Low Power (ULP) General Purpose Processors (GPPs) have the necessary flexibility, but are prohibitively high power (by multiple orders of magnitude). Field Programmable Gate Arrays (FPGAs) are another IC implementation that could potentially be used for UbiComp applications, because it bridges the gap between the high efficiency of the ASIC and the high flexibility of the GPP. However, FPGAs have historically been targeted for high performance applications, where performance (or speed) is the main metric. As a result, commercial FPGAs are too power hungry for ULP applications (like UbiComp). Relatively little research has been done to bring reconfigurable logic into the ULP application space. However, if FPGAs can be retargeted for ULP applications, then they will intrinsically bridge that same gap, this time between ULP ASICs and ULP GPPs, by providing energy efficiency than ULP GPPs, but also have the flexibility to update or re-target their applications (a necessary requirement for UbiComp) without the expensive, time-consuming respins that ULP ASICs require. This document proposes a dissertation that will explore the steps necessary to both build and configure Ultra-Low Power FPGAs. These steps include: 1. revisiting circuit design of FPGA-building blocks and re-designing them for low power operation 2. re-assessing FPGA architectural parameters to determine architectures most suitable for ULP operation for FPGAs as opposed to high-performance operation 3. developing a toolflow that will not only allow researchers to quickly co-optimize FPGA fabrics for different circuit and architectural parameters, but also users in the future to quickly be able to generate FPGA fabrics with the necessary configurations for a given benchmark circuit 4. exploring the integration of the ULP FPGA fabric into a SoC designed for wireless sensing applications


Abstract
Today, current technology is moving towards the idea of ubiquitous computing (UbiComp), where sensors and other integrated circuits (ICs) are ever present in daily life, creating a smart environment for people to constantly monitor and react to their environments.For UbiComp to be fully realized, a host of requirements are necessary for the ICs deployed in this vast network.These requirements include low cost, low power consumption, flexibility, and adequate computation power.Current IC design methodologies do not quite meet all UbiComp requirements.Application Specific Integrated Circuits (ASICs) are low power and have the necessary computing power for UbiComp applications, but they are prohibitively inflexible.Ultra-Low Power (ULP) General Purpose Processors (GPPs) have the necessary flexibility, but are prohibitively high power (by multiple orders of magnitude).
Field Programmable Gate Arrays (FPGAs) are another IC implementation that could potentially be used for UbiComp applications, because it bridges the gap between the high efficiency of the ASIC and the high flexibility of the GPP.However, FPGAs have historically been targeted for high performance applications, where performance (or speed) is the main metric.As a result, commercial FPGAs are too power hungry for ULP applications (like UbiComp).Relatively little research has been done to bring reconfigurable logic into the ULP application space.However, if FPGAs can be retargeted for ULP applications, then they will intrinsically bridge that same gap, this time between ULP ASICs and ULP GPPs, by providing energy efficiency than ULP GPPs, but also have the flexibility to update or re-target their applications (a necessary requirement for UbiComp) without the expensive, time-consuming respins that ULP ASICs require.This document proposes a dissertation that will explore the steps necessary to both build and configure Ultra-Low Power FPGAs.These steps include: 1. revisiting circuit design of FPGA-building blocks and re-designing them for low power operation 2. re-assessing FPGA architectural parameters to determine architectures most suitable for ULP operation for FPGAs as opposed to high-performance operation 3. developing a toolflow that will not only allow researchers to quickly co-optimize FPGA fabrics for different circuit and architectural parameters, but also users in the future to quickly be able to generate FPGA fabrics with the necessary configurations for a given benchmark circuit 4. exploring the integration of the ULP FPGA fabric into a SoC designed for wireless sensing applications 1 Introduction

Motivation for Low-Power FPGAs -Ubiquitous Computing
Today, in a world where people have immediate access to massive amounts of information, increased control and awareness of the surrounding environment is becoming more and more viable, and perhaps necessary.For this to occur, integrated circuits (ICs) need to be redesigned such that a large number of sensors can be deployed simultaeneously in different environments (wearable sensors, smart home sensors, infrastructure monitoring, etc.) and users can access the information provided by these sensors and respond accordingly.This landscape, known popularly as ubiquitous computing (UbiComp), is not far off.There are, however, certain restrictions in this application space that prevent UbiComp adoption on a large, commercial scale.These challenges are as follows: • Ultra-Low Power and Energy Operation -For UbiComp to occur, there will need to be a large number of ICs deployed in various environments (homes, wearable sensors, etc.).
Constantly changing batteries for all of these devices, or connecting all of them to wires, becomes invasive and infeasible.Therefore, these sensors need to be very-low power, so as to maintain functionality with a small power budget, and also be low energy, so as to not drain power sources too quickly.This will minimize battery changes, allowing the sensors to spend extended periods of time functioning.
• Computational Flexibility -With many sensors deployed in the field, constantly gathering information, removing them from their deployments can be problematic.Moreover, algorithms used for sensing applications are subject to change, either from discoveries of new methods, or from updated versions of the current approach.Additionally, these sensors are subject to different environments, and thus may need to perform slightly (or completely) different functions based on the environments.For all of these reasons, these sensors need to be designed with a level of flexibility, in order to meet all of the requirements of the environments they are sensing.
• Computational Power -It's not enough for these sensors just to sense information.If the devices simply sent all information collected to a user, it would be highly inefficient, as the sensor would send more information than is necessary, and the information would likely not be discernible by the user.Ideally, all of the sensors in this UbiComp platform will be able to process some of the information it collects on the node itself, in order to give the information to whoever needs it in a way that is easy to understand and work with.
• Low Cost -As mentioned before, UbiComp would require thousands of sensing nodes to be deployed all at the same time.The cost of developing and fabricating these chips would be very high, and that cost goes up even further if these chips need to be re-designed and re-fabricated with each new or updated algorithm.
Unfortunately, current IC options cannot meet all of these requirements.Ultra-low power (ULP) Application Specific Integrated Circuits (ASICs) provide the most efficient computing platform, and can function at low power consumptions with high computing power.Unfortunately, they have little or no flexibility, so updating or changing the functionality of the ASIC would require a full re-design and re-fabrication.This inflexibility also contributes to higher costs for implementation.General purpose processors (GPPs) that are targeted for low power, like ULP microcontrollers (MCUs), like the MSP430 from Texas Instruments, boast low power operation high flexibility and computing power [13].However, the power consumptions of these devices is still too high for certain applications within the UbiComp space.
A more viable option is a ULP system-on-chip (SoC), like the Body-Area Sensor Node (BASN) described in [1].This device leverages both MCU and ASICs to get extremely low power consumptions (<20 µW), allowing the node to run on harvested energy.Battery-less operation is perfect for UbiComp applications.Unfortunately, devices like this one still suffer from lack of flexibility.
Although the node can perform many functions (ECG, EEG, aFib detection, etc.), it is limited to those functions, and needs to be re-fabricated and re-designed for any changes or updates.
Field Programmable Gate Arrays (FPGAs) are reconfigurable ICs that, at nominal voltages, split the difference between ASICs and GPPs in terms of flexibility and efficiency.The reconfigurability of the devices makes them inherently more flexible than ASICs, as they can be reconfigured to any functionality that fits on the logical resources available.Because FPGAs are still hardware implementations of algorithms, they still consume less power and energy than GPPs for many applications.The flexibility of FPGAs is limited to the amount of logic resources on the FPGA, which limits their flexibility.Both ASICs and GPPs have been retargeted for ULP operation, as discussed earlier.However, there are fewer options for ULP FPGAs.This is primarily due to the current market for FPGAs.Applications today that primarily use FPGAs include aerospace, defense, automotive, High-power consumer electronics (like digital cameras), and medical devices (ultrasound, endoscopes, etc.), among others [14].For these applications, power and energy consumption are not as important as performance.Thus, the FPGA industry has been driven by a different set of metrics, and designs for FPGAs have progressed accordingly.That being said, there are FP-GAs from companies like Microsemi and Lattice Semiconductor that boast extremely low power consumptions [11] [12].But again, those devices still consume 10s of mW in active mode, which is high for the UbiComp requirements.If sub-mW FPGAs were to be realized, they would provide an ideal solution for UbiComp applications; a power-efficient computing platform that is also flexible and powerful.This proposal argues that if the circuits and architectures of FPGAs are retargeted for ULP operation, FPGAs can fill the space needed to realize UbiComp pervasively.

Motivation for Configuration Scheme for Custom-FPGAs
In order to adequately test whether circuit and architecture optimizations are valid for minimizing power consumption in FPGAs, it will be important to test how incremental changes effect the FPGA circuit as a whole.To do this, we will need to build schematics of FPGAs with the different circuitlevel and architectural parameters, and configure them to perform different functions.This poses two major problems.
• Design Time FPGAs are large structures, which require the use of a large number of transistors.One single configurable logic block (CLBs) consisting of 9 4-input basic logic elements (BLEs) can have as many as 20,000+ transistors (including test structures like registers).Thus, building entire FPGAs, which include multiple CLBs and a global interconnect, is very time consuming.Multiply that by the different FPGAs required for a thorough design space exploration, and comparing FPGA parameters through SPICE-level simulations becomes virtually impossible by hand.
• Configuration for Custom-built FPGAs As it stands now, there is no commercially available way to configure a custom-built FPGA fabric.Commercial FPGA companies, like Xilinx and Altera, have their own compilers that works for mapping a configuration bit stream that can be loaded to their hardware to implement verilog code provided by the user.Unfortunately, this method will not work for custom-FPGAs, because the custom hardware (circuit designs and architectures) doesn't match the commercial tools.Configuration bit locations are different, so the configuration bit stream wouldn't configure the same configuration bits.
Configuration mechanisms vary from FPGA to FPGA, so it's entirely possible that the configuration bit stream generated by the commercial tool would be completely invalid for the custom FPGA built.Moreover, the underlying architectures will vary for commercial FPGAs, meaning that algorithms will be partitioned to different amounts of logic and interconnect resources.For all of these reasons, using commercial configurations is impossible for custom fabrics.
What is needed is a method of taking a set of circuit level and architectural parameters (which define your physical FPGA structure) and a function for the FPGA to implement, and generating from those a netlist ready for simulation.

Thesis Statement
By re-targeting circuit elements and architecture parameters of reconfigurable fabrics for ultralow power operation, FPGAs can provide the adequate combination of efficiency, flexibility, and computing power to enable ubiquitous computing to become a reality.These fabrics could be deployed as stand-alone ICs, or be included in ultra-low power SoCs to increase their flexibility.
In order to study FPGA circuits and architectures fully, it will be necessary to generate FPGA fabrics, and configure them with algorithms used in ULP applications, in order to see how they will function.To do this, open source FPGA mapping tools (i.e.VTR) need to be extended to generate configurations for FPGAs.

Goals
The overarching goal of this work is to explore and determine circuit designs and architectures to allow FPGA operation with extremely low power consumption (<1mW).The individual goals of the work are as follows: • Explore viable circuit designs for FPGA sub-circuits (logic blocks, interconnect structures, and configuration bits) • Determine best practices for FPGA sub-circuit design • Re-visit FPGA architectures to determine if there are better architectural parameters for ULP FPGAs • Determine optimal FPGA architectures for different classes of ULP applications • Develop a tool-flow enabling circuit-level and architectural co-optimization for FPGA fabrics • Leverage toolflow to create FPGA fabrics that perform tasks with ultra-low power and energy consumptions • Use toolflow to help create embedded reconfigurable fabric to be used in ULP SoCs, illustrating the benefits that ULP FPGA fabrics can provide 2 Background

General FPGA Architecture
FPGAs are reconfigurable ICs that consist of a collection of distributed logic blocks whose inputs and outputs are connected to each other through a reconfigurable interconnect, filled with switches and buffers.Configuration bits (usually SRAM) are distributed through the FPGA and control the reconfigurable interconnect and the logic, and these configuration bits, if altered, can change the functionality of the logic blocks, the connections between different logic blocks, or both.The I/Os to the FPGA chip also include configuration bits that determine whether it will be an input or an output, allowing flexibility in the number of input and output pins the device has.A high-level illustration of the general FPGA architecture is shown in Figure 2.1.
Each configurable logic block (CLB) consists of one or more Basic Logic Elements (BLEs).These One k-input look-up table (LUT) has 2 k SRAM configuration bits that are connected to a 2 k -to-1 multiplexer.The LUT is either configured to be combinational or sequential by a multiplexer and a configuration bit at the output.with horizontal and vertical wires.In 2D-mesh FPGAs, each individual wire is considered a "track," and the collection of adjacent tracks in a given direction are called "channels."Horizontal and vertical channels are connect to each other through switch boxes.Connection boxes connect the CLBs to the FPGA interconnect wires.Configuration bits control which tracks are connected to each other and to the CLB inputs and outputs.Figure 2.3 gives a detailed picture of interconnect resources and how they are connected.

Commercial Ultra-Low Power FPGAs
Most of the market share for FPGAs is shared by Xilinx (47%) and Altera (41%) as of 2012 [17].
However, these companies do not make FPGAs with low power consumptions.These companies' current consumption are consistantly in the mW-W range, which is too high for ULP applications.
There are companies, however, who do make FPGAs with power consumptions in certain modes that are sub-mW, and their circuit designs and architectures are described in this section.

Microsemi IGLOOnano
One of the competitors to Xilinx and Altera is Microsemi, a company that makes some of the lower power FPGAs on the market.Their devices use non-volatile FLASH configuration bits instead of the SRAM bits that most industry FPGAs use.By using these, they can get leakage currents for their devices as low as 2 µW.The smallest logical element in the IGLOO nano, which is analogous to a BLE, is called a VersaTile, which is a 4-input block that can have one of three functions: 1. 3-input LUT 2. Latch with clear or set

D flip-flop with clear or set
The IGLOO has no explicit clustering of their VersaTiles, but clusters are implicitly created by the routing structures in the FPGA.There are multiple levels of hierarchy in the interconnect of the IGLOO.First, there are Local-Line resources that connect every VersaTile with its eight nearest neighbors.Next, Long-Line resources span 1, 2, or 4 VersaTiles, in either the x or y direction.
Lastly, there are Very-Long-Line resources that act as the global interconnect for the FPGA, which span 12 versatiles in the x-direction and 16 VersaTiles in the y-direction.Because of the Local-Line interconnects, we will consider the cluster in these FPGAs to be 9 VersaTiles [11].

Lattice iCE40 Ultra
Lattice Semiconductor is a company who designs SRAM-based FPGAs for low-power operation.

Academic Ultra-Low Power FPGAs
While some commercial FPGAs (like the Microsemi IGLOO and the Lattice iCE40) have sub-mW operation, they only achieve this low power consumption in sleep mode.There have been academic ventures to design ultra-low power FPGA designs that consume low power actively.One example is introduced in [7], which features a low power FPGA that consumes as little as 40 µW in the active mode.This FPGA uses 6T latches instead of SRAM bits that are used in commercial FPGAs.For interconnect resources (such as connection boxes and switch boxes), this FPGA uses buffered, unidirectional wires with buffered multiplexers at the switch points.Limitations to this FPGA design come mostly from the size of the device.This FPGA achieves record-low power consumption, but only has 4 configurable logic blocks.That severely limits the computation power of the device, and almost assuredly is not enough resources for many of the DSP algorithms needed for the IoT.
Another low-power FPGA, proposed in [8], uses the Xilinx Stratix III architecture as its baseline.
Thus, it also uses standard SRAM configuration bitcells and buffered unidirectional multiplexers for the interconnect routing as well.This FPGA included IP blocks like multipliers and block RAMs.
The low-power optimizations employed in this FPGA include mid-oxide, high-VT transistors in the In [6], researchers also used SRAM bitcells for configuration, but used 5T cells as opposed to the standard 6T.This made the cells harder to write, but because the cells are generally in the hold state, using 5T configuration bits only requires extra effort during the configuration stage, in the form of boosted voltages for the writing circuitry.For the interconnect, this FPGA doesnt use buffers at all, but instead uses a purely pass-gate interconnect.To make up for the degraded swing across the interconnect, a modified-Schmitt Trigger sense amp is used at the inputs to the logic blocks in order to restore the signal to full swing.This interconnect leverages the low-swing that occurs throughout the interconnect to reduce power.Using these circuit design techniques, researchers were able to reduce area per LUT by almost 3x, delay at a constant energy by 14x, and energy at a constant delay by almost 5x.
Table 1 shows different low-power FPGAs and their architectures.Although other commercial FPGA companies (like Xilinx and Altera) have moved to larger LUTs, low power FPGAs have interestingly stayed with 4-input LUTs, and all have similar clusterings.Segment lengths and configuration bit topologies, however, vary among the FPGAs.It's important to note that the segment lengths for the IGLOO nano are with respect to their logic elements, called VersaTiles, and not to their clusters of VersaTiles.
3 Circuit Exploration and the configuration bits used to store configuration information for the FPGA.For FPGAs to be feasible for ULP applications, their energy consumption must be comparable to that of ASICs.To achieve this, each individual circuit element must be re-optimized for new metrics, namely low energy and small area.Once each of the circuit elements are redesigned for more efficient operation, the overall power consumption of the FPGA will be reduced dramatically, closing the gap in efficiency between FPGAs and ASICs.The challenge here is to find new designs for these building blocks that have better power and energy consumptions than the state-of-the-art FPGA sub-circuits, while maintaining a level of flexibility suitable for UbiComp and other ULP applications.

Prior Art
In [9], researchers compared different FPGA interconnect switches to determine which would reduce power consumption most effectively.In this case, they compare the traditional routing switch, which they define to be a multiplexer followed by a buffer, with different low-power variants, which include adding headers to provide different modes (high speed, low-power, and sleep mode).Theres also an additional buffer topology, which takes advantage of body-biasing the PMOS devices in the buffer to the virtual rail that is provided by the header devices in the switches.Looking at all the different combinations of techniques, they find that using any of the routing switches lowers dynamic power over the traditional switch topology about 30%, low power modes lower leakage power 28-52%, and sleep mode lowers leakage 61-79%.This work can be extended by looking at un-buffered interconnects, and also by attempting to incorporate a study of the effects of variation.
CLB topologies have not been investigated thoroughly.[6] introduced an alternative idea from the standard multiplexer-based local interconnect inside CLBs which uses FPGA-style routing to connect different BLEs within a CLB together.[10] introduced the notion of depopulating, or removing full connectivity in intra-CLB routing between the inputs/outputs of the CLB and the individual multiplexers inside.However, to the knowledge of the researchers involved, no study has yet been conducted comparing CLB connectivity topologies against each other across a range of circuit-level knobs.[7] compared different configuration bit topologies, including standard 6T SRAM, subthreshold 6T SRAM, and a 6T latch.His study found that the 6T latch had the most robust propagation delay, but the slowest write time.The study correctly pointed out that write pulse width is not an important metric for discussing configuration bit topologies.To extend this study, looking at factors such has hold margin, minimum retention voltage, and leakage current should be highlighted for different configuration bit topologies.[8] compared the differences between using simply high-VT devices vs. using mid-oxide devices, with higher dielectric constants than the other high-VT devices.They found that using transistors that are both mid-oxide and high-VT reduces leakage power by multiple orders of magnitude.In this work, transistor type as well as configuration bit topology will be knobs to turn for optimizing configuration bits for low-power operation.

Research Question
How can we redesign the circuit elements in FPGAs (namely routing switches, CLBs, and configuration bits) to minimize power consumption, while still providing adequate functionality and performance for low-throughput applications?

Approach
The overall goal of this portion of the research will be to search for better alternatives to the current state-of-the-art FPGA sub-circuits.This research will focus primarily on the topologies of switch boxes, CLBs, configuration bits, and LUTs.Other sub-circuits, such as connection boxes, I/Os, registers (within the CLBs) will not be discussed in this work, and are left for future work.

Routing Switch Exploration
Different routing switch topologies will be compared to find the optimal low-voltage (near-to subthreshold) option.We compare the different topologies using an interconnect model, which consists of a driver, a series of switch boxes, and one or more receiver circuits.This model is a preliminary model, and the results from analyzing this model will be presented in this proposal.In the future, a custom-built toolflow (which will be discussed later in this document) will be able to generate full FPGA schematics, and thus test how changes in switch topology effect the FPGA as a whole.This will allow for comparisons between the different switch box topologies across all sorts of benchmarks, creating a more comprehensive exploration of the switch-box-topology design space.In this research thrust, we will explore using tri-state buffers (TSBs), pure pass gate (PG), pure transmission gate (TX), and multiplexer-based designs for the switches in the switch boxes.We will first compare the individual switches, followed by their functionality through specified networks.We will also explore both bi-directionality and uni-directionality in the context of ULP operation.The knobs we will turn for this comparison are: • Transistor Size -Increasing transistor size for the routing switches increases driveability and performance, but also increase load capacitance and area, and thus power consumption.The effect of transistor size on interconnect metrics is complicated, and needs to be studied.
• Path Length -Interconnect nets in FPGAs very greatly in length, with nets ranging from one wire segment long to the diagonal of the entire chip.It will be important to discover and quantify how different routing switch topologies change their performance and power consumption across different path lengths.
• Operating Voltage -Power consumption has a quadratic relationship to operating voltage, so operating the interconnect at the lowest possible voltage is optimal.However, it is expected that the power consumption of different design topologies can have varying changes in metrics like power consumption and performance across different operating voltages.
• Transistor Type -Different types of transistors, with varying V T 's and dopings are available to be used.Leveraging the ability to use these different types of transistors could result in increased power and energy savings.

Configurable Logic Block (CLB) Exploration
Configurable Logic Blocks (CLBs) are the building blocks of FPGAs, and house all of the logic performed by them.Thus, optimizing the CLBs for ULP operation will be paramount for developing ULP FPGAs.In modern FPGAs, CLBs are collections, or clusters, of basic logic elements (BLEs).
The focus of this exploration isnt to change the topology of the BLEs themselves, but rather redesign the local interconnect between the BLEs inside the CLB.Modern FPGAs (for the most part) currently use multiplexers to route the inputs and outputs of the CLB with the individual outputs of the BLEs on the inside.However, [6] proposed an alternative solution, building a mini-FPGA style CLB, which uses switch boxes and connection boxes to connect the BLEs.In that design, researchers were able to achieve a smaller CLB area and lower power.Figure 3.1 illustrates the two topologies.In this work, we will further explore the design space, both in simulation and in hardware, and see under which set of parameters each of the two CLB topologies is optimal.The • Channel Width − This parameter refers to the number of routing channels are present in the local interconnect of the mini-FPGA CLB.Reducing the number of channels helps lower power consumption and area, but limits routeability through the CLB.
• Operating Voltage

Configuration Bit Exploration
Configuration bits are large contributors to the overhead of FPGAs over ASIC designs.These bits control the data inside each of the logic blocks, combinational vs. sequential modes for logic blocks, and the large global interconnect.Commercial FPGAs can have on the order of millions of these bits.Because of this, it is important to optimize these circuit elements.To do so, we will compare different memory cell topologies in a way that pertains directly to FPGA functionality.Generally, when looking at different bitcell topologies, researchers tend to focus on reading the data from the memory cell and writing data into it, making sure to balance both of the phases while also minimizing energy consumption.In FPGAs, however, the read and write phase are not important while the device is in use; only the hold phase is.Thus, this configuration bit exploration will be similar to those that have been discussed in the past, but with more of an emphasis on hold margin and leakage reduction.Our approach will be to take multiple bitcell topologies, including but not limited to: • Standard 6T SRAM • 5T SRAM (Assymetrical, proposed in [Ryan]) • SRAM w/ low-power header (presented in [Anderson]) These bitcells will be simulated using SPICE with few different tests, both static and dynamic.We will vary transistor sizes for each of the bitcell topologies (or use existing knowledge) to determine the optimal sizing for the each bitcell topology we test.If given the opportunity, we will also tapeout banks of the different bitcell topologies to gather hardware test results as well, and compare them to simulation.

Evaluation Metrics
In our study of routing switches to be used in the FPGA interconnect, our metrics for evaluation include: • Area − The switch matrices in an FPGA account for the largest percentage of the area in an FPGA.Reducing the area of the switch box, then, greatly decreases the area of the whole FPGA, thus also decreasing power and energy consumption.Area will be estimated by transistor count.
• Power Consumption − In trying to find an optimal design for ULP FPGA design, it will be important for the block to consume the least amount of power possible, to stay within the strict power constraints of ULP applications.We will measure power consumption through simulations of the device performing our given task.
• Energy Consumption − While it is clearly important to minimize power for devices in ULP applications, it is often the case that ULP applications have strict energy consumption constraints as well, particularly if the device runs on a small battery or harvested energy.
Thus, while timing constraints might be lax, energy consumption still needs to be low to extend battery life.
• Robustness − Maintaining proper functionality in the presence of process-voltage-temperature (PVT) variations is especially important when discussing ULP applications, because the physical parameters of the devices are even more sensitive to these variations.
The evaluation metrics for the CLB comparison are: • Area − We will attempt to reduce CLB area as much as possible.Smaller CLB area corresponds to smaller capacitances, corresponding to lower power dissipation and energy consumption inside the CLBs.Moreover, smaller CLBs also reduce the wirelengths of both the intra-CLB and global interconnects, and therefore the power dissipation and energy consumption.
• Power Consumption

• Energy Consumption
• Routeability − More depopulation (for mux-based CLBs) and fewer routing channels (for mini-FPGA CLBs) reduce power, energy, and area, but also reduce the flexibility and potential functionality of the CLB.We want to have a metric for how much that flexibility to realize different circuits is reduced.There are many ways we could calculate, but we will simply represent routeability as the number of possible inputs to a pin.
• Robustness − measures of robustness for the purposes of this study will be more for individual circuit elements (muxes vs. routing resources) then they will be for the full CLBs.The CLBs are so large (can be 1000s of transistors) that Monte Carlo simulations, which would be conducted in order to characterize circuits in the presence of variation, would not be viable.
The evaluation metrics for the configuration bitcell topologies are: • Leakage Power − Along with hold margin, this is one of the more important metrics by which to evaluate the different configuration bit topologies.These configuration bits are some of the most common circuit elements in an FPGA, so limiting the leakage power of these bitcells will have a strong impact on the overall leakage power of the device.
• Hold Margin − Another important attribute for the configuration bitcells.For FPGAs, the most important function of a configuration bitcell is to retain its state.Thus, it will be important to quantify how much voltage swing on a bitline a configuration bit can tolerate without losing its value while access to the bit is denied.
• Retention Voltage − Reducing voltage is the most effective way of loweing power consumption.As a result, it will be important to characterize different bitcell topologies based on how low the supply voltage of the bitcell can be reduced and still reliably retain the information stored inside.
• Write Voltage Boost − A lesser important metric.When writing some of these bitcells that are designed specifically to hold without consideration given to read-and write-speed, it should still be understood what kind of overheads will be included in being able to successfully write to the cell.
• WL Pulse Width − This metric quantifies the time it takes to write into each individual cell.This metric is again less important, because writing to the bitcell happens very rarely in FPGA lifetime, compared to regular functionality.

Routing Switch Exploration
The work presented in this proposal extends the work done in [6].That FPGA design used pass transistors as interconnect routing switches, and proposed a modified Schmitt trigger sense amp receiver that transitions low-to-high at a lower threshold voltage than the traditional buffer, thus drastically lowering the static current that results from a low-swing input.In this work, we compare that design to other low-swing sense-amp designs to see if there are better solutions.Figure 3.2 shows circuit topologies of different sense amp topologies.The double-buffer is a low swing and a high swing buffer multiplexed together, so that depending on the transition, the proper buffer is used to maximize speed and limit static current.The low-swing buffer is very similar to the modified Schmitt trigger, only with the hysteresis transistor removed.transfer characteristics (VTCs) of the different sense amp topologies, as well as the static current consumption across input voltage.In addition to the three custom receivers, we also compare a buffer built from two inverters and a conventional Schmitt trigger.In this plot, the sense amps are compared at an operating voltage of 0.3 V.The double buffer has the best VTC overall, because it has the lowest low-to-high transition, but also the highest high-low transition.This means the device will switch quickly in both transitions, and should have decently low static current.It is also important to note that for both Schmitt trigger designs, the high-low transition is at a lower voltage than the low-high transition, which results in very slow high-low transitions.The low-swing buffer has the smallest static current, which also peaks a lower voltage.As a result, high-low propagation delay is higher for Schmitt triggers, and the static current in these transitions is much higher.In In deep subthreshold (VDD = 0.25 V), we find that the pass gate network is 63% faster as well.Figure 3.7 is a table which shows the cluster sizes at which the multiplexer-based design and the mini-FPGA design have approximately the same transistor count, which will be referred to as the break-even point.At cluster sizes higher than this break-even point, there is a smaller area for mini-FPGA style CLBs versus the multiplexer-based designs.As the channel width increases, the break-even point happens at larger cluster sizes, as number of switch boxes and connection boxes increases.For a channel width of 6, the transistor count can break even at around 5 BLEs per CLB, if the multiplexer is fully buffered.By making the dependency of transistor count linear (in the mini-FPGA design) instead of exponential (in the multiplexer case), it becomes more feasible to further increase the cluster size of CLBs.
While discussion of transistor area is important, it is also important to characterize the effect of CLB implementation strategies on the overall functionality of the FPGA as a whole.For the purposes of this comparison, we will look at delay for a notion of CLB performance, and energy as a metric for CLB efficiency.We chose architectural parameters for the simulation that were consistent with the mini-FPGA CLB used in [6].The LUTs were not given true functions to perform, such as AND gates, for example.Instead, we configured with chessboard patterns, such that the odd bit values were set to 1 and the even bit values were set to 0. This allowed us to test each possible delay path through the CLB by driving the inputs with a binary counter (from 0-15).The delay measurements are pessimistic, as all 9 outputs are observed to find the worst-case delay for the CLB.Using all 9 BLEs will also give the worst-case energy estimate, because circuit elements in all nine of the BLEs will be switching.which is swept from sub-threshold (0.3 V) to super-threshold (0.8 V).The two topologies have very similar delays, differing by approximately 7.5% at 0.8 V and by 14.6% at 0.3 V.The mini-FPGA CLB, however, has the better performance, though slightly, across a range of supply voltages, including both sub-and super-threshold.We also observe that delay savings from using the mini-FPGA CLB increase as circuit operation moves deeper into the sub-threshold regime.
Figure 3.9 plots energy per operation against supply voltage for the different CLB topologies.
For this comparison, we use a frequency of 10 kHz, giving enough time for the operation of the  CLBs at 0.3 V. From this plot, we see that the mini-FPGA CLB is a lower energy option across supply voltages, with an 18.1% reduction at 0.3 V and a 34.9% reduction at 0.8 V.This happens for two reasons.First, each of the 29-input multiplexers in the multiplexer-based CLB has 28 buffers, each which actively draw current as the inputs to those multiplexers transition.This is not true for the interconnects of the mini-FPGA style CLBs, which have no buffers at all in the interconnect.
Second, the mini-FPGA CLB is slightly faster, which will slightly lower the energy due to leakage per operation as well.

Anticipated Results
There are other switch box topologies that are used by FPGAs, both commercially and academically.Including those in our study of switch boxes will help to make this work more complete.
Also, conducting a more thorough design space exploration, in which paths of different depths and branching breadth are also explored, we can understand how different switch topologies work in FPGAs clearly.
For the CLB exploration, the next step is to expand the simulation exploration to encompass more of the design space.The example simulation result is only one small point on that design space.If the same simulation was run with N = 3 and a mux-based depopulation of 75%, or at N = 30 with a mini-FPGA channel width of 2, the results would be very different, and very contrary to each other.
It will be important to look across all of the values around the break-even points in area, and see where the boundaries are for choosing between the two topologies based on performance and energy efficiency, not just in area (figure 3.7).For exploring configuration bit topologies, we will need to characterize different bitcell options, and to compare them based on area, hold stability, and leakage power, both in simulation and in hardware.Because these circuits are targeted to operate at low voltages, variability becomes an even larger issue.Thus, it is paramount to include robustness to variability as a metric, and design our devices accordingly.We will be able to do this by performing MonteCarlo simulations for different FPGA sub-circuits to determine how they operate in the presence of local and global variations.

Contributions
The contributions of this section are as follows: • Survey of different techniques for design of FPGA sub-circuits for ultra-low power operation -Configuration Bits -Routing switches -Configurable Logic Blocks (CLBs) • Design space exploration across different knobs -Configuration Bits 4 Architecture Exploration

Motivation
In the last section, we discussed how todays FPGAs use circuit elements that allow for increased performance, as FPGA companies design their devices in order to keep up with commercial processors.This is probably even truer of FPGA architectures.The newest generations of Xilinx and Altera chips both have much more complicated CLB structures than the academic FPGA idea, mostly to push performance.In addition, these FPGAs include many different IP blocks for specific functionalities, embedded block rams, and other different IPs.Many FPGAs have extremely large interconnects, because whats more important for some applications is to provide the logic required, and to finish computations as fast as possible (in the GHz range), as opposed to doing so energy efficiently.The problem with these devices is that they are not feasible for ULP applications.Many of these applications require active power dissipation at less than 1 mW.High end commercial FPGAs have sleep power consumption that are still over 1 W. FPGAs that champion low energy operation, like Microsemi IGLOO, still has operating currents in the 10s of mW, which is much lower than the rest of the field, but still about an order of magnitude too high for ULP applications.We propose starting from the beginning, and going back to the architectural sweeps done in the past, but this time trying to minimize power consumption and transistor area, with less of an emphasis on FPGA performance.In doing so, we may find that using different architectures, like different sized LUTs and CLB clusters, may actually be better for FPGAs designed for ULP applications.

Prior Art
Work has been done to optimize individual architectural parameters for FPGAs.In [2], LUT-inputs (K) and cluster size (N) were co-optimized to minimize the area-delay product of a mapped FPGA.
In this study, the VPR tool was used, and the conclusion was that K = 4 to 6 and N = 3-10 provide the best results for area-delay product.[3] extends that study by including larger cluster sizes (up to 20), and includes different routing architectures, which take into account both different segment lengths and switch topologies (pass gates vs. buffers).Not only that, but this study also explicitly targets power consumption.In this study, researchers found that K = 4 minimized power dissipation across different clustering values, and that a cluster size of N = 12 minimized both total power of the FPGA and power-delay product.[5] compares uni-directionality and bi-directionality of interconnects in FPGAs.Uni-directional wiring only allow signals to propagate in one direction through the interconnect, whereas bi-directionality allows signals propagating in both directions.
Findings from this study show that bi-directional routing (in the form of tri-state buffers) for lowfrequency designs have lower energy consumptions, but for high frequency operation uni-directional routing is lower energy.The work proposed for this dissertation extends this work by combining the architectural knobs listed above, while also adding additional knobs to the design space (which will be explained in detail later in the section 3.4).This dissertation will also extend this work by leveraging the VersaPower extension to VTR, which we believe will provide more accurate estimations for power consumption, and will take into account all knob changes directly.Lastly, using our tool-flow, we will also be able to generate schematics of the architectures we choose and simulate full FPGAs, adding another level of verification to the architecture exploration conclusions.

Research Question
How does the optimal FPGA architecture change with a different set of primary metrics, namely area and power consumption?

Approach
We will first do architecture sweeps for FPGA parameters, assuming the conventional design for the different FPGA sub-blocks (BLEs, interconnect resources, I/Os, etc.) We will do this by leveraging the VTR research tool developed by researchers from the University of Toronto.This tool allows for users start with an FPGA fabric (either a given one or the user can create one) and virtually map a given Verilog or BLIF format algorithm (again, there are given ones, but the user may also create one).The virtual mapping consists of a description of the interconnect (describing the resources used), a netlist describing which blocks are connected to each other, a description of the placement of each block relative to the others, and a set of statistics about the mapping, such as resource utilization, number of routing channels, and critical path delay, just to name a few [18].For this exploration, we will use VTRs wide variety of knobs to do a thorough design space exploration of FPGA architectures on multiple benchmark circuits.We will then verify the results by doing simulations on a few test cases, leveraging a custom tool flow developed to work with VTR to go straight from FPGA architectural and circuit parameters to FPGA simulation and fabrication (explained later in the document).The knobs for this architectural exploration are as follows: • K -number of inputs to LUTs • Clustering (N) -the number of BLEs in each logic block • Channel Width (W) -the number of routing channels in a routing segment.This, however, can also be an output, which will be explained in the following section.
• Channel Fanout (FC) -the number of channels in the global interconnect that a logic block or FPGA I/O can connect to • Segment Length (L) -number of CLBs spanned by individual wire segments.
• Unidirectionally vs. bi-diretionality -whether or not interconnect tracks allow signals to go in both directions

Evaluation Metrics
There are two classes of metrics that we will use to evaluate the different architectures.First, we have a set of outputs that come directly from VTR.We will find the architectures that minimize each of those individually, and also develop a figure of merit to lump all of these dependent variables together, in order to find the optimal architecture, based on our set of knobs, for each benchmark circuit we test.We expect there to be an over-arching pattern that holds true for most of the benchmarks, but it is definitely possible for different classes of benchmarks to have different optimal architectures.That will be part of what we find in our study.The metrics of importance that are given in the outputs of VTR are: • Channel Utilization -The percentage of routing resources that are used.High channel utilization results in less wasted interconnect circuitry, and is thus optimal.
• FPGA Size -(represented as the number of CLBs) The larger the FPGA, the larger the power consumption.If the same algorithm could be realized in a smaller FPGA, then the power consumption will have to be smaller.Therefore, smaller FPGAs are better for ULP FPGAs.
• Routing Area, Logic Area, and Total Area -the same argument applies as with number of CLBs; smaller is better.However, logic area and FPGA size change in opposite ways as clustering (N) changes.Seeing how those two metrics relate to each other, and how they affect routing area as well, will be an interesting part of the findings in this study.
• Power Consumption -The newest version of VTR includes a tool called VersaPower, which reports power estimates for some of the given architectures for the VTR tool.We will be able to use VersaPower for the given architectures at the very least, but with some time spent available, but the figure here was generated using VPR 5.0.Certain features of this older version of VPR limit the effectiveness of the tool for this architecture exploration.First, the logic area for a CLB is a fixed, regardless of the amount of clustering.As a result, when we increase the clustering parameter for VPR, the estimated logic block area remains fixed.Also, as clustering increases, the tools estimate for routing area goes down.Thus, the tool reports larger reductions in area due to clustering than would likely occur.Additionally, the main metric we want to address, power, cannot be addressed by VPR 5.0, but is available for VTR 7.0.

Anticipated Results
With the newest version of VTR, we will be able to explore power consumption as a function of architecture choice.Ideally, VTR will also address some of the issues posed in the first attempt at the architectural sweep.One of the issues is the unrealistic estimate of logic area that VPR FPGAs is also a large IC, using 1000s of transistors.As a result, building FPGA schematics by hand becomes impossible, given time and monetary constraints.Secondly, in order to simulate these FPGAs, they need to be configured to perform certain tasks.Commercial FPGAs have their own compilers for configuration, but these software packages are specific to their own FPGA products, and will not work for our custom-built FPGA fabrics.Thus, a tool-flow needs to be built that not only generates FPGA fabrics based on circuit-level and architectural parameters, but that can also create configurations for those FPGA schematics in order to conduct SPICE level simulations.

Prior Art
The entire design space of FPGA hardware design can be abstracted to 3 axes: circuit parameters, architecture parameters, and benchmark circuit designs.Commercial FPGA companies, like Xilinx or Altera, allow for large flexibility in terms of the types of circuits that can be mapped to them, but are extremely limited otherwise.They provide only one alternative in terms of circuit parameters, and a small set of different architectures, ranging from smaller, low-end devices to large high performance devices.The VTR tool is slightly limited in the types of benchmarks it can map, but has a large range of architectural flexibility, meaning it can handle a large design space in architectural parameters [18].Our proposed tool-flow will cover a much larger portion of the possible design space, and be able to cover both architectural and circuit-level knobs while also leveraging a variety • Bitstream Generator -In order to truly test the functionality of the FPGA, a circuit has to be mapped to it, which we can then simulate or power up in hardware.To do this, we develop a set of Perl scripts that takes the virtual mapping given from the VTR tool-flow and determines which physical configuration bits need to be set to 1 and 0 in the FPGA fabric we have built by using the WL/BL map generated earlier.This script will then provide both a bitstream for configuring an FPGA through the scan-chain configuration scheme we implement on our FPGAs, and additionally provide a set of initial condition statements, which configure the FPGA for simulation.Thus, the tool-flow allows researchers to simulate the configuration of the FPGA or bypass it and go straight to simulating the functionality of the FPGA.A popular solution for ULP wireless sensing is developing ASIC-based SoCs with a low-power controller.An example system is introduced in [1].This SoC is designed to run off of harvested energy, and doesnt even need a battery.While the extremely low power consumption is state-ofthe-art, there are still potential issues with the design.First, the node is prohibitively inflexible for widespread adoption.Right now, this node can only be used for ECG sensing applications.A node like this would be a viable solution for all IoT applications, but the cost of re-designing the SoC to tailor its design to each application (100s potentially) is extreme, both in time and monetarily.Additionally, making any updates or changes to the algorithms implemented in any of these SoCs also requires a respin of the chip.Aside from flexibility, testing ASIC SoCs can be difficult and costly.Introducing test structures into the different ASIC blocks of the SoC can hinder block performance, and its difficult to attain adequate observability and testability in large, complicated SoCs.

Results
Including small FPGA fabrics, designed for ULP functionality, could potentially revolutionize ULP sensing.Functions that are common to most sensing applications (filtering, arithmetic operations, etc) should remain ASIC-implemented, as ASICs are the most power-efficient implementation of algorithms.Algorithms that are prone to change, or those that are specific to different applications, should be implemented in the low-power reconfigurable fabric in the SoC.That way, a single SoC could be retargeted for a host of different applications, and the only necessary change would be a re-configuration of the FPGA-portion of the SoC to the specific application.Including reconfigurable fabric on SoCs provides additional benefits other than increasing flexibility.The FPGA fabric can be configured as test structures for both itself and the entire SoC, allowing built-in self-test (BIST) for the SoC with greatly reduced overhead.This enables the removal of test structures from the ASIC blocks, enhancing their efficiency.

Prior Art
[4] discusses the benefits of embedding FPGA fabrics into SoCs.This paper gives a methodology for using the FPGA fabric to test itself, and also use it as a means of testing other cores in the SoC.
Leveraging the techniques described in this paper would allow a ULP sensor to be tested with as little cost and overhead as possible.Commercial FPGA companies have seen the potential benefits of integrating FPGA fabrics into SoCs, and have begun developing FPGA-based SoCs themselves.
the reconfigurable logic and compare the results to the ASIC implementations.Its probable that the FPGA implementation of some of these algorithms will consume more power than the ASIC implementation, but that doesnt make it useless.If the FPGA implementation, along with the rest of the SoC, falls within the stringent power budgets of low-power applications, then the FPGA fabric allows the SoC, which before was only functional for a single application, to be target to a host of new applications, orienting ULP SoC development into a direction that is viable for large-scale ULP applications like the IoT.

Evaluation Metrics
We will determine whether the inclusion of ULP reconfigurable fabric is viable for ULP sensing SoCs through evaluation of the following metrics: • FPGA Size -Devices for ULP applications have strict size restrictions, for comfortability and aesthetics in implementation.While FPGA fabrics can implement any function, given enough logic area, there is a practical limit to the number of FPGA logic resources that can be available for an FPGA block on an SoC targeted for ULP applications.It will be important to make sure that the algorithms partitioned for the FPGA fabric can be implemented using relatively few logic blocks.
• Power Consumption -It will be important to characterize the power consumption of the reconfigurable logic block as it performs the different functions partitioned for it.
• Testability -In order to further incentivize the use of FPGA fabrics in ULP SoCs, showing how the FPGA fabric can be used to test an ULP SoC in a specific test case will be a proof of concept, further showing the flexibility of the FPGA fabric.The testability of the FPGA fabric, however, is directly proportional to the size of the FPGA fabric, so this metric and the FPGA size metric oppose each other, and must be co-optimized • Energy Consumption -While it is important for ULP SoCs to have low power consumption, so as to be able to use smaller and smaller power sources, it is also important for the circuit to have low energy consumption, so as to extend the lifetime of the power source.Therefore, it will be important to make sure that the energy consumption of the reconfigurable logic is still low enough for inclusion in the ULP SoC.

Figure 2 . 1 :
Figure 2.1: A 4x4 2D-mesh FPGA.Configurable Logic Blocks (CLBs) are distributed and connected to each other and input/output (I/O) blocks by a reprogrammable global interconnect.

Figure 2 . 3 :
Figure 2.3: Programmable FPGA interconnect.Connection boxes connect signals from CLBs to individual routing tracks of horizontal and vertical routing channels.Switch boxes connect the channels to each other.
Their iCE40 device boasts typical standby currents of 71 µA, with IP cores including embedded block RAMs, communication interfaces, and DSP blocks.The device clusters 8 4-input LUTs into each programmable logic block (PLB).Like other commercial FPGAs, their logic blocks have dedicated carry logic for more efficient arithmetic functionality.Their routing structure consists of three different segment lengths spanning 1, 4, and 12 PLBs[12].

3. 1
MotivationFPGAS are often overlooked for ULP applications primarily because of the inefficient overhead incurred from reprogrammability.Logic is represented through Look-Up Tables (LUTs), which implemented as memory cells connected to a set of multiplexers.This implementation is inherently more inefficient than an ASIC implementation of the same logic.Moreover, FPGAs have reconfigurable interconnect that connect the different logic blocks together, resulting in unused routing resources and configuration memory cells used to select which routing resources are active.ASICs, on the other hand, only have the routing necessary to perform the given function, which greatly reduces the amount of circuitry outside of the logic path.The first step to lowering the overall power and energy consumption of FPGA fabrics is to optimize each FPGA sub-circuit.FPGAs have relatively few sub-circuit building blocks, making an exploration of each of them feasible.These blocks include interconnect resources (switch boxes and connection boxes), configurable logic blocks (CLBs),

Figure 3 .
Figure 3.2: a) Modified Schmitt trigger, b) low-swing buffer, and c) double buffer sense amp topologies.

Figure 3 .
Figure 3.3: a) VTCs and b) static current pofiles for the low-high and high-low transitions of the different sense amp topologies at VDD = 0.3 V.The double buffer has the fastest low-high and high-low transitions, and the low-swing buffer has the smallest static current.

Figure 3 . 4 :Figure 3 . 5 :Figure 3 . 6 :
Figure 3.4: ED Curves for sense amp circuits, driven by 50 pass gate switch boxes in series.The low-swing buffer is the overall best option for ULP design

Figure 3 . 7 :
Figure 3.7: Total number of transistors vs. clustering value for different CLB types.Multiplexerbased CLBs become very large with high clustering values.The channel width of Mini-FPGAs contributes heavily to the transistor count.

Figure 3 .
Figure 3.8 shows the delay for the two different CLB topologies across supply voltage (VDD),

Figure 3 . 8 :
Figure 3.8: Delay vs. Supply voltage for the two CLB types.The two designs have essentially the same delay, w/ the mini-FPGA style having a slight advantage in sub-threshold.

Figure 3 . 9 :
Figure 3.9: Energy vs. Supply voltage for the two CLB types.Mini-FPGA CLB have lower energy across supply voltage than the multiplexer-based CLB due to a lack of switching buffers in the switch boxes and connection boxes.

Figure 4 . 1 :
Figure 4.1: Optimal architecture parameters for running VPR to build an FPGA to map the ALU4 benchmark.Data is misleading, however, as configuration block area doesnt change with increased K or N.

Table 1 :
FPGA architectures of commercial and academic low-power FPGAs * -These segments span one VersaTile (BLE) not a CLB configuration bits.They also power-gate unused resources, and have a stand-by sleep mode.With these techniques, this FPGA consumes 13-130mW of active power and 46-460 W of sleep power for 1500-15000 logic cells (BLEs).
As discussed earlier in this document, there are many circuit and architectural knobs that we are turning in order to design an FPGA that is optimized for ULP functionality.While testing each individual knob (such as circuit topology) can provide interesting information, its also very important to see how these knob changes affect the functionality of the FPGA as a whole.While FPGA modelling (like what is done in VTR) can give some notion of how the full FPGA fabric behaves with different circuit-level and architectural parameters, a more accurate representation of the circuit performance would be to conduct SPICE-level simulations of FPGA fabrics.To do this, it is necessary to build full FPGA schematics, employing the different circuit-level and architectural parameters we are testing, configuring them to perform some function, and simulating them.However, there are many challenges to conducting these FPGA-level simulations.First, we are exploring multiple knobs, creating potentially 1000s of different combinations for FPGA fabrics.Each one of those