Witch of Git - sigbovik-nan-2020/blob - nan-gates.md

   1 ---
   2 title: "NaN-Gate Synthesis and Hardware Deceleration"
   3 author: |
   4     Cassie Jones
   5     Witch of Light
   6     list+sigbovik@witchoflight.com
   7 date: "1 April 2020"
   8 numbersections: true
   9 documentclass: article
  10 classoption:
  11 - twocolumn
  12 geometry: margin=1cm
  13 ---
  14
  15 # Abstract {-}
  16
  17 In recent years there has been interest in the field of "hardware decelerators," which serve primarily to make computation more interesting rather than more efficient.
  18 This builds off the work of "NaN-Gates and Flip-FLOPS" [9] to provide a hardware synthesis implementation of real-number computational logic using the Yosys Open Synthesis Suite [1] framework, and evaluates the impacts of different floating point formats.
  19
  20 **ACH Reference Format:**
  21 Cassie Jones.
  22 2020.
  23 NaN-Gate Synthesis and Hardware Deceleration.
  24 In *Proceedings of SIGBOVIK 2020*, Pittsburgh, PA, USA, April 1, 2020.
  25 (SIGBOVIK '20, ACH).
  26
  27 # Introduction {-}
  28
  29 Harware decelerators work on the principle of "stop and smell the roses."
  30 There are some qualities that are more important than sheer efficiency, and often these improvements can often only be realized by taking the computer and slowing it down to a more leisurely pace.
  31 The largest advancements in the field happen in the emulation space, since it's the most widely accessible.
  32 It may be most familiar in the form of video-game computers, building computers out of redstone in Minecraft, Factorio combinators, or the like [7] [6].
  33
  34 > "But of course speed is not what we're after here. We're after just, beautiful computation going on inside the heart of this machine." --- Tom7 [10]
  35
  36 The SIGBOVIK 2019 paper "NaN-Gates and Flip-FLOPS" decelerates computers in the name of elegance: it throws away the assumption of binary computers and builds ones based on real numbers, specifically IEEE-754 floating point numbers.
  37 It aims towards "reboot computing using the beautiful foundation of real numbers," but it still leaves us with room for improvement in a few areas.
  38 It leaves the logic gates in the domain of emulation, which limits the types of hardware that are easy to build, and it limits the elegance that can be achieved.
  39 By using an existing CPU as the floating point processor, you still have a computer that's based on binary emulating your real number logic.
  40
  41 Here, we attempt to remove this limitation by bringing NaN-gate computation to the domain of native hardware, via a custom Yosys synthesis pass.
  42
  43 # NaN-Gate Synthesis
  44
  45 The Yosys Open SYnthesis Suite [1] is a free and open source archicture-neutral logic synthesis framework.
  46 It can synthesize Verilog into a variety of backend formats using a flexible and pluggable architecture of passes.
  47 The Yosys manual has a chapter on how to write extensions [2, Ch. 6], which can be consulted for documentation and examples on how Yosys passes are built.
  48 We provide a Yosys extension which synthesizes a circuit down to a network of small floating point units implementing the NaN-gate function.
  49 This can be further synthesized to a final target, like a specific FPGA architecture.
  50
  51 ## Yosys Synthesis
  52
  53 We will demonstrate all synthesis with the following toggle module, since it's small enough for all stages of synthesis results to be understandable and fit neatly on the page.
  54
  55 ```verilog
  56 module toggle(input clk, input en, output out);
  57
  58 always @(posedge clk) begin
  59     if (en) out <= ~out;
  60 end
  61
  62 endmodule
  63 ```
  64
  65 Yosys will take a Verilog module like this and flatten the procedural blocks into circuits with memory elements.
  66 Running sythesis gives us a circuit with a flip-flop, a toggle, and a multiplexer that's driven by the enable line.
  67
  68 ![The synthesized toggle circuit.](figures/toggle-base.svg)
  69
  70 We can also ask yosys to synthesize this to exclusively NAND and NOT gates with a small synthesis script.
  71
  72 ```tcl
  73 read_verilog toggle.v
  74 synth
  75 abc -g NAND
  76 ```
  77
  78 This particular design synthesizes to 1 D flip-flop, 3 NAND gates, and 2 NOT gates.
  79
  80 ![The circuit synthesized down to only NAND, NOT, and memory elements.](figures/toggle-nand.svg)
  81
  82 ## The Yosys `synth_nan` Pass
  83
  84 Our `synth_nan` pass is implemented as a Yosys extension.
  85 For convenience, we'll describe the behavior in terms of the 3-bit float synthesis.
  86 It converts a module to NaN-gate logic.
  87 It summarizes itself as running:
  88
  89 ```tcl
  90 synth
  91 abc -g NAND
  92 nand_to_nan
  93 share_nan
  94 dff_nan
  95 simplify_nan
  96 clean
  97 techmap_nan
  98 ```
  99
 100 The first two steps there are standard synthesis.
 101 The `synth` pass will convert a module into coarsly optimized circuits, and `abc -g NAND` will remap the entire thing into optimized NAND logic.
 102
 103 One complexity we have to deal with is external interfaces.
 104 Despite the wonderful realms of pure real-number computation we want to interact with, when interacting with fixed hardware component interfaces, we have to convert between flots and bits.
 105 In order to handle this, the NaN gate tech library has modules like `fp3_to_bit` and `bit_to_fp3` which perform this boundary conversion.
 106 In order to deal with the chaotic diversity of real circuits, for robustness, the `nand_to_nan` pass converts each NAND gate to a `bit_to_fp3 -> NaN -> fp3_to_bit` chain.
 107 Don't worry, these conversions will later be removed everywhere they don't turn out to be necessary.
 108
 109 ![The toggle circuits synthesized to NaN gates. Note that the external logic ports have floating point conversion modules, but the clock line doesn't.](figures/toggle-nan3.svg)
 110
 111 The `share_nan` pass reduces the number of conversions by sharing ones that have the same inputs.
 112 Then, the `dff_nan` pass can expand the flip-flops in the circuit into a set of enough flip-flops to store the floating point values.
 113
 114 The `simplify_nan` pass converts any instance of `fp3_to_bit -> bit_to_fp3` to just a wire that passes the floats straight through.
 115
 116 We do `clean` to remove dead wires and useless buffers, and then finally the `techmap_nan` pass replaces the opaque NaN-gate modules with actual modules so that further synthesis can properly make them realizable on real hardware.
 117
 118 ## Module Ports
 119
 120 If you want your circuit to support external floating-point based interfaces, you can use the floating point conversion modules yourself.
 121
 122 ```verilog
 123 module toggle(
 124     input clk, input [2:0] en, output [2:0] out);
 125
 126 wire en_b;
 127 reg out_b;
 128 fp3_to_bit en_cvt(en, en_b);
 129 bit_to_fp3 out_cvt(out_b, out);
 130 always @(posedge clk) begin
 131     if (en_b) out_b = ~out_b;
 132 end
 133
 134 endmodule
 135 ```
 136
 137 The NaN synthesis will end up erasing the floating point conversions on either side of the interface since they connect to floating point units.
 138 Future work could include automatically expanding ports using something like a `(* nan_port *)` attribute.
 139
 140 # Floating Point Formats
 141
 142 While tom7's work asserts that a **binary4** floating point format is "clearly allowed by the IEEE-754 standard," this doesn't seem to hold up under a close examination.
 143 Brought to my attention by Steve Canon [8], there are two cases where these floating point formats fall down.
 144 First, and most importantly in the case of **binary4**, you need to encode both quiet- and signaling-NaNs.
 145 Section 3.3 of IEEE-754 says [5]:
 146
 147 > "Within each format, the following floating-point data shall be represented: [...] Two NaNs, qNaN (quiet) and sNaN (signaling)."
 148
 149 While **binary4** does have two separate NaN values (a positive and a negative), they are distinguished only by their sign bit, which isn't allowed to distinguish the two types of NaNs, as we can see in 6.2.1:
 150
 151 > "When encoded, all NaNs have a sign bit and a pattern of bits necessary to identify the encoding as a NaN and which determines its kind (sNaN vs. qNaN)."
 152
 153 \newpage
 154 \begin{figure*}
 155 \centering
 156 \includegraphics[width=7in]{figures/synthesis-pass.pdf}
 157 \caption{The full NaN-gate synthesis process for the toggle module. In step 1 we have the logical circuit after coarse synthesis. In step 2 it's synthesized to NAND and NOT gates. Step 3 converts the gates to NaN gates and adds conversion chains. Step 4 expands the flip-flops to store floats. Step 5 collapses redundant conversion chains to give the final NaN-synthesized module.}
 158 \end{figure*}
 159 \clearpage
 160
 161 This means that we need at least two bits in the mantissa in order to represent the infinities (stored as a 0 mantissa with the maximum exponent) and the NaN values (stored with two distinct non-zero mantissas).
 162
 163 ![From Steve Canon's tweet [8]. As the cat says, since IEEE-754 requires $emax$ to be greater than $emin$, there must be two exponent bits, and since the mantissa must be used to distinguish sNaN, qNaN, and infinity, so it also needs at least two bits, leading to a minimum of 5-bit floats.](binary5.jpg)
 164
 165 The **binary3** format is further disrupted in section 3.3, which rules-out the idea of having an empty range of $emin$ and $emax$, since they're used in an inequality and $emin \le emax$, and is elsewhere forced to be strictly less by other constraints:
 166
 167 > "$q$ is any integer $emin \le q+p-1 \le emax$" 3.3
 168
 169 Still, the **binary3** format is very useful for efficient implementation of NaN gates, and is worth including in synthesis for people who aren't bothered by standards compliance.
 170 For completeness, the `synth_nan` implementation supports synthesis to **binary3**, **binary4**, and the definitely-IEEE-754-compliant **binary5** format NaN-gates.
 171 Furthermore, the architecture would support easy extensions to larger, more conventional floating point formats like **binary8**, or even larger, by simply loading your own libary of modules named `nan_fpN`, `bit_to_fpN`, and `fpN_to_bit`, for any value of `N` you want to synthesize with.
 172
 173 ## The binary5 Representation
 174
 175 Here we document the representation in the **binary5** format, the smallest legal IEEE-754 compliant binary floating-point format.
 176 It has a sign bit, a two bit exponent, and a two bit mantissa.
 177 We include a table of all of the positive values here:
 178
 179 s  E  T value
 180 - -- -- -----
 181 0 00 00 +0.0
 182 0 00 01 +0.25
 183 0 00 10 +0.5
 184 0 00 11 +0.75
 185 0 01 00 +1.0
 186 0 01 01 +1.25
 187 0 01 10 +1.5
 188 0 01 11 +1.75
 189 0 10 00 +2.0
 190 0 10 01 +2.5
 191 0 10 10 +3.0
 192 0 10 11 +3.5
 193 0 11 00 +inf
 194 0 11 01 sNaN
 195 0 11 10 qNaN
 196 0 11 11 qNaN
 197
 198 <!-- @TODO: Make table captions work -->
 199 The positive values representable in the **binary5** format. Note that infinity, sNaN, and qNaN are all distinguished by the mantissa value when the exponent is all ones, so this is the smallest possible floating point format. The negative values for each are the same bit patterns but with a 1 in the sign bit.
 200
 201 ## Evaluation
 202
 203 We compare the size (in both logic and memory elements) and clock speed of modules synthesized with the different floating points.
 204 For the benchmark, we use a pipelined 32-bit multiplier, and the PicoRV32 processor [3] synthesized for the ECP5 architecture, and placed and routed using nextpnr [4].
 205 The numbers given for clock frequency are the best result of 10 runs of placement and routing.
 206
 207 <!-- @TODO: Actually run 10 times -->
 208
 209 The "NAND" variant are synthesized to NAND gates before architecture-specific optimization, in order to obscure some of the higher-level modules that are present in the original design and prevent optimizations that won't be available to the NaN-gate synthesis.
 210 This gives a clearer baseline for comparison with the NaN gates, and so this is used as the basis for relative numbers.
 211 Times marked **DNP** are those that did not successfully place and route for timing analysis, so no frequency can be reported.
 212
 213 Design Variant  Cells  Cell%   DFFs  DFF%   (MHz)
 214 ------ ------- ------ ------ ------ ----- -------
 215 PicoRV Direct    1884    57%    888   48%  103.55
 216        NAND      3328   100%   1848  100%   47.30
 217        fp3      43739  1314%   5544  299% **DNP**
 218        fp4      32853   987%   7392  400% **DNP**
 219        fp5      65511  1968%   9240  500% **DNP**
 220 Mult32 Direct    2879   104%    628  100%  143.04
 221        NAND      2773   100%    628  100%  154.37
 222        fp3      25349   880%   1884  300%   25.26
 223        fp4      19026   661%   2520  400%   21.88
 224        fp5      38001  1320%   3140  500%   21.18
 225
 226 It's interesting that fp4 is the smallest of the floating point variants in logic, even though not in memory.
 227 It seems likely that this is because the ECP5 architecture is based on a "LUT4" architecture, 4-input lookup tables, which means the individual NaN-gates potentially happen to synthesize more efficiently with 4-bit inputs.
 228
 229 ## Flattening
 230
 231 For this benchmark, we synthesize designs without flattening post NaN-gate synthesis, because the optimizer is too effective and eliminates most of the floating point logic.
 232 When they are flattened, the optimzer can consider the logic involved in the individual NaN gates and re-combine them into and erase constant-value flip-flops.
 233 Designs that are flattened before optimizing have no flip-flop overhead, and have on the order of 5% overhead in logic elements vs the reference NAND-gate versions.
 234
 235 While synthesizing with post-NaN flattening substantially undermines the floating point logic and mostly demonstrates the impressive quality of Yosys's optimizations, it suggests as an option a sort of "homeopathic floating-point logic."
 236 For users that require efficiency but still want *some* of the elegance benefits, they can flatten it and optimize it away, keeping some peace-of-mind in knowing that their final circuit is derived from an elegant real-number system, regardless of how it currently behaves.
 237
 238 # Future Work
 239
 240 Floating point synthesis still has many avenues for improvement and future work.
 241
 242 The current synthesis approach used by `synth_nan` remains fragile in the face of flattening and pass-ordering.
 243 It should be possible to make it harder to accidentally flatten the designs away into nothing, but they still do need to be eventually flattened since the `nextpnr` place-and-route flow is still not fully reliable in the presence of un-flattened designs.
 244 Currently the `synth_nan` pass must be run before any device-specific passes, which can be fine but it prevents the utilization of resources such as distributed RAMs.
 245
 246 Float synthesis tools should make it easier to define module ports that should be expanded to accomodate floating-point based signals, so that designs can operate fully in the glorious domain of the real numbers, without having to flatten all designs.
 247
 248 More work could be done into ensuring that the individual gates are properly optimized for different architectures, since it seems unreasonable for fp4 to remain more efficient than fp3.
 249 The system could also benefit from implementing a larger set of primitive gates, to avoid the blowup of using NAND gates to emulate everything, since they should be implementable in similar amounts of elementary logic.
 250
 251 With **binary5** and larger, there looks like there could be potential in attempting to explore designs that work *purely* on NaN values, exploring the flexibility in the handling of signaling and quiet NaN values.
 252
 253 The NaN-gate synthesis plugin for Yosys can be found at <https://git.witchoflight.com/nan-gate>.
 254 This paper and the examples materials in it can be found at <https://git.witchoflight.com/sigbovik-nan-2020>.
 255
 256 \begingroup
 257 # References {-}
 258 \setlength{\parindent}{-0.2in}
 259 \setlength{\leftskip}{0.2in}
 260 \setlength{\parskip}{8pt}
 261
 262 \vspace{-20pt}
 263 \indent{}
 264 [1] Claire Wolf.
 265     The Yosys Open SYnthesis Suite.
 266     Online.
 267 \noindent{}
 268     http://www.clifford.at/yosys/
 269
 270 [2] Claire Wolf.
 271     The Yosys Manual.
 272     Online.
 273 \noindent{}
 274     http://www.clifford.at/yosys/files/yosys_manual.pdf
 275
 276 [3] Claire Wolf.
 277     PicoRV32 - A Size Optimized RISC-V CPU.
 278     Online.
 279     https://github.com/cliffordwolf/picorv32
 280
 281 [4] Claire Wolf, David Shah, Dan Gisselquist, Serge Bazanski, Miodrag Milanovic, and Eddie Hung.
 282     NextPNR Portable FPGA Place and Route Tool.
 283     2018.
 284     Online.
 285 \noindent{}
 286     https://github.com/YosysHQ/nextpnr
 287
 288 [5] IEEE Standard for Floating-Point Artithmetic.
 289     2019.
 290     In *IEEE Std 754-2019 (Revision of IEEE 754-2008).*
 291     Pages 1--84.
 292     DOI 10.1109/IEEESTD.2019.8766229.
 293
 294 [6] justarandomgeek.
 295     2017.
 296     justarandomgeek's Combinator Computer Mk5.
 297     Online.
 298 \noindent{}
 299     https://github.com/justarandomgeek/factorio-computer
 300
 301 [7] legomasta99.
 302     2018.
 303     Minecraft Computer Engineering - Redstone Computers.
 304     Online.
 305 <!--\noindent{}-->
 306     https://www.youtube.com/watch?v=aj6IUuwLyOE
 307     &list=PLwdt_GQ3o0Xe6pUS1vdzy0ZqXrApB2MNu
 308
 309 [8] Stephen Canon.
 310     2017.
 311     A Tweet About IEEE-754.
 312     Tweet by \@stephentyrone on April 1, 2017.
 313 \noindent{}
 314     https://twitter.com/stephentyrone/status/848172687268687873
 315
 316 [9] Tom Murphy VII.
 317     2019.
 318     NaN Gates and Flip FLOPS.
 319     In *Proceedings of SIGBOVIK 2019*, Pittsburgh, PA, USA, April 1, 2019.
 320     (SIGBOVIK '19, ACH).
 321     Pages 98--102.
 322
 323 [10] Tom Murphy VII.
 324     2019.
 325     NaN Gates and Flip FLOPS.
 326     Online.
 327     http://tom7.org/nand/