Add paper
[sigbovik-nan-2020] / nan-gates.md
1 ---
2 title: "NaN-Gate Synthesis and Hardware Deceleration"
3 author: |
4 Cassie Jones
5 Witch of Light
6 list+sigbovik@witchoflight.com
7 date: "1 April 2020"
8 numbersections: true
9 documentclass: article
10 classoption:
11 - twocolumn
12 geometry: margin=1cm
13 ---
14
15 # Abstract {-}
16
17 In recent years there has been interest in the field of "hardware decelerators," which serve primarily to make computation more interesting rather than more efficient.
18 This builds off the work of "NaN-Gates and Flip-FLOPS" [9] to provide a hardware synthesis implementation of real-number computational logic using the Yosys Open Synthesis Suite [1] framework, and evaluates the impacts of different floating point formats.
19
20 **ACH Reference Format:**
21 Cassie Jones.
22 2020.
23 NaN-Gate Synthesis and Hardware Deceleration.
24 In *Proceedings of SIGBOVIK 2020*, Pittsburgh, PA, USA, April 1, 2020.
25 (SIGBOVIK '20, ACH).
26
27 # Introduction {-}
28
29 Harware decelerators work on the principle of "stop and smell the roses."
30 There are some qualities that are more important than sheer efficiency, and often these improvements can often only be realized by taking the computer and slowing it down to a more leisurely pace.
31 The largest advancements in the field happen in the emulation space, since it's the most widely accessible.
32 It may be most familiar in the form of video-game computers, building computers out of redstone in Minecraft, Factorio combinators, or the like [7] [6].
33
34 > "But of course speed is not what we're after here. We're after just, beautiful computation going on inside the heart of this machine." --- Tom7 [10]
35
36 The SIGBOVIK 2019 paper "NaN-Gates and Flip-FLOPS" decelerates computers in the name of elegance: it throws away the assumption of binary computers and builds ones based on real numbers, specifically IEEE-754 floating point numbers.
37 It aims towards "reboot computing using the beautiful foundation of real numbers," but it still leaves us with room for improvement in a few areas.
38 It leaves the logic gates in the domain of emulation, which limits the types of hardware that are easy to build, and it limits the elegance that can be achieved.
39 By using an existing CPU as the floating point processor, you still have a computer that's based on binary emulating your real number logic.
40
41 Here, we attempt to remove this limitation by bringing NaN-gate computation to the domain of native hardware, via a custom Yosys synthesis pass.
42
43 # NaN-Gate Synthesis
44
45 The Yosys Open SYnthesis Suite [1] is a free and open source archicture-neutral logic synthesis framework.
46 It can synthesize Verilog into a variety of backend formats using a flexible and pluggable architecture of passes.
47 The Yosys manual has a chapter on how to write extensions [2, Ch. 6], which can be consulted for documentation and examples on how Yosys passes are built.
48 We provide a Yosys extension which synthesizes a circuit down to a network of small floating point units implementing the NaN-gate function.
49 This can be further synthesized to a final target, like a specific FPGA architecture.
50
51 ## Yosys Synthesis
52
53 We will demonstrate all synthesis with the following toggle module, since it's small enough for all stages of synthesis results to be understandable and fit neatly on the page.
54
55 ```verilog
56 module toggle(input clk, input en, output out);
57
58 always @(posedge clk) begin
59 if (en) out <= ~out;
60 end
61
62 endmodule
63 ```
64
65 Yosys will take a Verilog module like this and flatten the procedural blocks into circuits with memory elements.
66 Running sythesis gives us a circuit with a flip-flop, a toggle, and a multiplexer that's driven by the enable line.
67
68 ![The synthesized toggle circuit.](figures/toggle-base.svg)
69
70 We can also ask yosys to synthesize this to exclusively NAND and NOT gates with a small synthesis script.
71
72 ```tcl
73 read_verilog toggle.v
74 synth
75 abc -g NAND
76 ```
77
78 This particular design synthesizes to 1 D flip-flop, 3 NAND gates, and 2 NOT gates.
79
80 ![The circuit synthesized down to only NAND, NOT, and memory elements.](figures/toggle-nand.svg)
81
82 ## The Yosys `synth_nan` Pass
83
84 Our `synth_nan` pass is implemented as a Yosys extension.
85 For convenience, we'll describe the behavior in terms of the 3-bit float synthesis.
86 It converts a module to NaN-gate logic.
87 It summarizes itself as running:
88
89 ```tcl
90 synth
91 abc -g NAND
92 nand_to_nan
93 share_nan
94 dff_nan
95 simplify_nan
96 clean
97 techmap_nan
98 ```
99
100 The first two steps there are standard synthesis.
101 The `synth` pass will convert a module into coarsly optimized circuits, and `abc -g NAND` will remap the entire thing into optimized NAND logic.
102
103 One complexity we have to deal with is external interfaces.
104 Despite the wonderful realms of pure real-number computation we want to interact with, when interacting with fixed hardware component interfaces, we have to convert between flots and bits.
105 In order to handle this, the NaN gate tech library has modules like `fp3_to_bit` and `bit_to_fp3` which perform this boundary conversion.
106 In order to deal with the chaotic diversity of real circuits, for robustness, the `nand_to_nan` pass converts each NAND gate to a `bit_to_fp3 -> NaN -> fp3_to_bit` chain.
107 Don't worry, these conversions will later be removed everywhere they don't turn out to be necessary.
108
109 ![The toggle circuits synthesized to NaN gates. Note that the external logic ports have floating point conversion modules, but the clock line doesn't.](figures/toggle-nan3.svg)
110
111 The `share_nan` pass reduces the number of conversions by sharing ones that have the same inputs.
112 Then, the `dff_nan` pass can expand the flip-flops in the circuit into a set of enough flip-flops to store the floating point values.
113
114 The `simplify_nan` pass converts any instance of `fp3_to_bit -> bit_to_fp3` to just a wire that passes the floats straight through.
115
116 We do `clean` to remove dead wires and useless buffers, and then finally the `techmap_nan` pass replaces the opaque NaN-gate modules with actual modules so that further synthesis can properly make them realizable on real hardware.
117
118 ## Module Ports
119
120 If you want your circuit to support external floating-point based interfaces, you can use the floating point conversion modules yourself.
121
122 ```verilog
123 module toggle(
124 input clk, input [2:0] en, output [2:0] out);
125
126 wire en_b;
127 reg out_b;
128 fp3_to_bit en_cvt(en, en_b);
129 bit_to_fp3 out_cvt(out_b, out);
130 always @(posedge clk) begin
131 if (en_b) out_b = ~out_b;
132 end
133
134 endmodule
135 ```
136
137 The NaN synthesis will end up erasing the floating point conversions on either side of the interface since they connect to floating point units.
138 Future work could include automatically expanding ports using something like a `(* nan_port *)` attribute.
139
140 # Floating Point Formats
141
142 While tom7's work asserts that a **binary4** floating point format is "clearly allowed by the IEEE-754 standard," this doesn't seem to hold up under a close examination.
143 Brought to my attention by Steve Canon [8], there are two cases where these floating point formats fall down.
144 First, and most importantly in the case of **binary4**, you need to encode both quiet- and signaling-NaNs.
145 Section 3.3 of IEEE-754 says [5]:
146
147 > "Within each format, the following floating-point data shall be represented: [...] Two NaNs, qNaN (quiet) and sNaN (signaling)."
148
149 While **binary4** does have two separate NaN values (a positive and a negative), they are distinguished only by their sign bit, which isn't allowed to distinguish the two types of NaNs, as we can see in 6.2.1:
150
151 > "When encoded, all NaNs have a sign bit and a pattern of bits necessary to identify the encoding as a NaN and which determines its kind (sNaN vs. qNaN)."
152
153 \newpage
154 \begin{figure*}
155 \centering
156 \includegraphics[width=7in]{figures/synthesis-pass.pdf}
157 \caption{The full NaN-gate synthesis process for the toggle module. In step 1 we have the logical circuit after coarse synthesis. In step 2 it's synthesized to NAND and NOT gates. Step 3 converts the gates to NaN gates and adds conversion chains. Step 4 expands the flip-flops to store floats. Step 5 collapses redundant conversion chains to give the final NaN-synthesized module.}
158 \end{figure*}
159 \clearpage
160
161 This means that we need at least two bits in the mantissa in order to represent the infinities (stored as a 0 mantissa with the maximum exponent) and the NaN values (stored with two distinct non-zero mantissas).
162
163 ![From Steve Canon's tweet [8]. As the cat says, since IEEE-754 requires $emax$ to be greater than $emin$, there must be two exponent bits, and since the mantissa must be used to distinguish sNaN, qNaN, and infinity, so it also needs at least two bits, leading to a minimum of 5-bit floats.](binary5.jpg)
164
165 The **binary3** format is further disrupted in section 3.3, which rules-out the idea of having an empty range of $emin$ and $emax$, since they're used in an inequality and $emin \le emax$, and is elsewhere forced to be strictly less by other constraints:
166
167 > "$q$ is any integer $emin \le q+p-1 \le emax$" 3.3
168
169 Still, the **binary3** format is very useful for efficient implementation of NaN gates, and is worth including in synthesis for people who aren't bothered by standards compliance.
170 For completeness, the `synth_nan` implementation supports synthesis to **binary3**, **binary4**, and the definitely-IEEE-754-compliant **binary5** format NaN-gates.
171 Furthermore, the architecture would support easy extensions to larger, more conventional floating point formats like **binary8**, or even larger, by simply loading your own libary of modules named `nan_fpN`, `bit_to_fpN`, and `fpN_to_bit`, for any value of `N` you want to synthesize with.
172
173 ## The binary5 Representation
174
175 Here we document the representation in the **binary5** format, the smallest legal IEEE-754 compliant binary floating-point format.
176 It has a sign bit, a two bit exponent, and a two bit mantissa.
177 We include a table of all of the positive values here:
178
179 s E T value
180 - -- -- -----
181 0 00 00 +0.0
182 0 00 01 +0.25
183 0 00 10 +0.5
184 0 00 11 +0.75
185 0 01 00 +1.0
186 0 01 01 +1.25
187 0 01 10 +1.5
188 0 01 11 +1.75
189 0 10 00 +2.0
190 0 10 01 +2.5
191 0 10 10 +3.0
192 0 10 11 +3.5
193 0 11 00 +inf
194 0 11 01 sNaN
195 0 11 10 qNaN
196 0 11 11 qNaN
197
198 <!-- @TODO: Make table captions work -->
199 The positive values representable in the **binary5** format. Note that infinity, sNaN, and qNaN are all distinguished by the mantissa value when the exponent is all ones, so this is the smallest possible floating point format. The negative values for each are the same bit patterns but with a 1 in the sign bit.
200
201 ## Evaluation
202
203 We compare the size (in both logic and memory elements) and clock speed of modules synthesized with the different floating points.
204 For the benchmark, we use a pipelined 32-bit multiplier, and the PicoRV32 processor [3] synthesized for the ECP5 architecture, and placed and routed using nextpnr [4].
205 The numbers given for clock frequency are the best result of 10 runs of placement and routing.
206
207 <!-- @TODO: Actually run 10 times -->
208
209 The "NAND" variant are synthesized to NAND gates before architecture-specific optimization, in order to obscure some of the higher-level modules that are present in the original design and prevent optimizations that won't be available to the NaN-gate synthesis.
210 This gives a clearer baseline for comparison with the NaN gates, and so this is used as the basis for relative numbers.
211 Times marked **DNP** are those that did not successfully place and route for timing analysis, so no frequency can be reported.
212
213 Design Variant Cells Cell% DFFs DFF% (MHz)
214 ------ ------- ------ ------ ------ ----- -------
215 PicoRV Direct 1884 57% 888 48% 103.55
216 NAND 3328 100% 1848 100% 47.30
217 fp3 43739 1314% 5544 299% **DNP**
218 fp4 32853 987% 7392 400% **DNP**
219 fp5 65511 1968% 9240 500% **DNP**
220 Mult32 Direct 2879 104% 628 100% 143.04
221 NAND 2773 100% 628 100% 154.37
222 fp3 25349 880% 1884 300% 25.26
223 fp4 19026 661% 2520 400% 21.88
224 fp5 38001 1320% 3140 500% 21.18
225
226 It's interesting that fp4 is the smallest of the floating point variants in logic, even though not in memory.
227 It seems likely that this is because the ECP5 architecture is based on a "LUT4" architecture, 4-input lookup tables, which means the individual NaN-gates potentially happen to synthesize more efficiently with 4-bit inputs.
228
229 ## Flattening
230
231 For this benchmark, we synthesize designs without flattening post NaN-gate synthesis, because the optimizer is too effective and eliminates most of the floating point logic.
232 When they are flattened, the optimzer can consider the logic involved in the individual NaN gates and re-combine them into and erase constant-value flip-flops.
233 Designs that are flattened before optimizing have no flip-flop overhead, and have on the order of 5% overhead in logic elements vs the reference NAND-gate versions.
234
235 While synthesizing with post-NaN flattening substantially undermines the floating point logic and mostly demonstrates the impressive quality of Yosys's optimizations, it suggests as an option a sort of "homeopathic floating-point logic."
236 For users that require efficiency but still want *some* of the elegance benefits, they can flatten it and optimize it away, keeping some peace-of-mind in knowing that their final circuit is derived from an elegant real-number system, regardless of how it currently behaves.
237
238 # Future Work
239
240 Floating point synthesis still has many avenues for improvement and future work.
241
242 The current synthesis approach used by `synth_nan` remains fragile in the face of flattening and pass-ordering.
243 It should be possible to make it harder to accidentally flatten the designs away into nothing, but they still do need to be eventually flattened since the `nextpnr` place-and-route flow is still not fully reliable in the presence of un-flattened designs.
244 Currently the `synth_nan` pass must be run before any device-specific passes, which can be fine but it prevents the utilization of resources such as distributed RAMs.
245
246 Float synthesis tools should make it easier to define module ports that should be expanded to accomodate floating-point based signals, so that designs can operate fully in the glorious domain of the real numbers, without having to flatten all designs.
247
248 More work could be done into ensuring that the individual gates are properly optimized for different architectures, since it seems unreasonable for fp4 to remain more efficient than fp3.
249 The system could also benefit from implementing a larger set of primitive gates, to avoid the blowup of using NAND gates to emulate everything, since they should be implementable in similar amounts of elementary logic.
250
251 With **binary5** and larger, there looks like there could be potential in attempting to explore designs that work *purely* on NaN values, exploring the flexibility in the handling of signaling and quiet NaN values.
252
253 The NaN-gate synthesis plugin for Yosys can be found at <https://git.witchoflight.com/nan-gate>.
254 This paper and the examples materials in it can be found at <https://git.witchoflight.com/sigbovik-nan-2020>.
255
256 \begingroup
257 # References {-}
258 \setlength{\parindent}{-0.2in}
259 \setlength{\leftskip}{0.2in}
260 \setlength{\parskip}{8pt}
261
262 \vspace{-20pt}
263 \indent{}
264 [1] Claire Wolf.
265 The Yosys Open SYnthesis Suite.
266 Online.
267 \noindent{}
268 http://www.clifford.at/yosys/
269
270 [2] Claire Wolf.
271 The Yosys Manual.
272 Online.
273 \noindent{}
274 http://www.clifford.at/yosys/files/yosys_manual.pdf
275
276 [3] Claire Wolf.
277 PicoRV32 - A Size Optimized RISC-V CPU.
278 Online.
279 https://github.com/cliffordwolf/picorv32
280
281 [4] Claire Wolf, David Shah, Dan Gisselquist, Serge Bazanski, Miodrag Milanovic, and Eddie Hung.
282 NextPNR Portable FPGA Place and Route Tool.
283 2018.
284 Online.
285 \noindent{}
286 https://github.com/YosysHQ/nextpnr
287
288 [5] IEEE Standard for Floating-Point Artithmetic.
289 2019.
290 In *IEEE Std 754-2019 (Revision of IEEE 754-2008).*
291 Pages 1--84.
292 DOI 10.1109/IEEESTD.2019.8766229.
293
294 [6] justarandomgeek.
295 2017.
296 justarandomgeek's Combinator Computer Mk5.
297 Online.
298 \noindent{}
299 https://github.com/justarandomgeek/factorio-computer
300
301 [7] legomasta99.
302 2018.
303 Minecraft Computer Engineering - Redstone Computers.
304 Online.
305 <!--\noindent{}-->
306 https://www.youtube.com/watch?v=aj6IUuwLyOE
307 &list=PLwdt_GQ3o0Xe6pUS1vdzy0ZqXrApB2MNu
308
309 [8] Stephen Canon.
310 2017.
311 A Tweet About IEEE-754.
312 Tweet by \@stephentyrone on April 1, 2017.
313 \noindent{}
314 https://twitter.com/stephentyrone/status/848172687268687873
315
316 [9] Tom Murphy VII.
317 2019.
318 NaN Gates and Flip FLOPS.
319 In *Proceedings of SIGBOVIK 2019*, Pittsburgh, PA, USA, April 1, 2019.
320 (SIGBOVIK '19, ACH).
321 Pages 98--102.
322
323 [10] Tom Murphy VII.
324 2019.
325 NaN Gates and Flip FLOPS.
326 Online.
327 http://tom7.org/nand/