Witch of Git - web/blog/blob - posts/2018/llvm-hearts-peano-addition.md

   1 ---
   2 title: "LLVM 💖s Peano Addition"
   3 date: 2018-09-07
   4 tags:
   5 - llvm
   6 - compilers
   7 - optimization
   8 summary: "I am surprised to discover that LLVM can optimize the standard peano definition of addition, so I set out to investigate."
   9 ---
  10
  11 This semester I'm taking an advanced compilers class.
  12 We're going to be learning by making changes to LLVM, so for the first assignment I was reading recommended [introduction to LLVM][].
  13 In order to give an example of some LLVM IR, it provides two small C functions implementing addition in different ways, and equivalent IR.
  14
  15 [introduction to LLVM]: http://www.aosabook.org/en/llvm.html
  16
  17 ```c
  18 unsigned add1(unsigned a, unsigned b) {
  19   return a+b;
  20 }
  21
  22 // Perhaps not the most efficient way to add two numbers.
  23 unsigned add2(unsigned a, unsigned b) {
  24   if (a == 0) return b;
  25   return add2(a-1, b+1);
  26 }
  27 ```
  28
  29 Being something of a mathematician myself, I felt I had to [defend the honor][] of "Peano-likers" from this defamation.
  30 I made that joke tweet and moved on, but after [someone suggested LLVM optimize it][suggestion], I started to think about writing some of those optimization passes as hopefully easy pattern-matching definitions.
  31
  32 [defend the honor]: https://twitter.com/porglezomp/status/1037408309140221952
  33 [suggestion]: https://twitter.com/hyperfekt/status/1037654687024119808
  34
  35 The next day, after compiling LLVM and getting a custom Hello World optimizer pass running, I decided to create some tests, and discovered (much to my surprise) that LLVM already handled Peano-style addition and multiplication perfectly competently!
  36
  37 I had just read John Regehr's [blog post on how LLVM optimizes a function][regehr], so I had an idea for how to investigate this.
  38 If you haven't read that yet, you should go read that first in order to see in some more detail LLVM's optimization passes like the ones I'm going to describe below.
  39
  40 [regehr]: https://blog.regehr.org/archives/1603
  41
  42 ## How to View the Optimizations
  43
  44 That blog post proceeds by running the LLVM `opt` tool and examining the changes between passes.
  45 You can easily get the LLVM IR corresponding to some C code using `clang`, just run:
  46
  47 ```shell-session
  48 $ clang peano.c -emit-llvm -S -o peano.ll
  49 ```
  50
  51 and you'll have a beautiful LLVM IR dump in the textual format.
  52 In order to view the optimizations on that code, you can run:
  53
  54 ```shell-session
  55 $ opt -O3 -print-before-all -print-after-all peano.ll
  56 ```
  57
  58 This gives you a huge wall of IR dumps after each optimization pass.
  59 If you want to do a similar investigation yourself, I wrote [a Python script that shows each pass's diff][diff script] and waits for you to continue it.
  60 Make sure you have [icdiff][] (a very nice color diff tool) installed in order to use it, or else modify the diff invocation in the script.
  61
  62 [diff script]: https://gist.github.com/porglezomp/f2dc233f971cf3f30d45e0b501ae5ead
  63 [icdiff]: https://github.com/jeffkaufman/icdiff
  64
  65 ## The Optimizations
  66
  67 As you can see from [John Regehr's blog post][regehr], LLVM's passes sometimes undo and redo lots of work without changing very much when working on a function this simple.
  68 Furthermore, the code emitted by the Clang frontend is a little bit of a mess that needs quite a bit of cleanup before it's decent code, in order to avoid needing to reimplement analyses that LLVM can do perfectly well itself.
  69
  70 In order to make this discussion clearer, I'll use the hand-written IR from the introductory article rather than the IR emitted by clang, and only run through the necessary passes to get the job done, not the whole `-O3` pipeline.
  71 At each step of the optimization, I'll provide the IR, and some roughly corresponding C code.
  72
  73 ### The Program
  74
  75 We'll be investigating this recursive definition of addition:
  76
  77 ```llvm
  78 define i32 @add(i32 %a, i32 %b) {
  79 entry:
  80   %tmp1 = icmp eq i32 %a, 0
  81   br i1 %tmp1, label %done, label %recurse
  82
  83 recurse:
  84   %tmp2 = sub i32 %a, 1
  85   %tmp3 = add i32 %b, 1
  86   %tmp4 = call i32 @add(i32 %tmp2, i32 %tmp3)
  87   ret i32 %tmp4
  88
  89 done:
  90   ret i32 %b
  91 }
  92 ```
  93
  94 Which corresponds to this C program:
  95
  96 ```c
  97 typedef unsigned nat;
  98
  99 nat add(nat a, nat b) {
 100   if (a == 0) return b;
 101   return add(a-1, b+1);
 102 }
 103 ```
 104
 105
 106 ### Tail Call Optimization
 107
 108 The first important optimization here is tail call optimization.
 109 Above we see that we call `@add` into `%tmp4` and then immediately return it without doing anything else in between, which makes this a tail call.
 110 Therefore, in order to avoid the cost of calling functions, the extra stack frames needed, and to expose more opportunities for optimizations, tail call optimization turns our tail recursion into a loop.
 111
 112 ```llvm
 113 define i32 @add(i32 %a, i32 %b) {
 114 entry:
 115   br label %tailrecurse
 116
 117 tailrecurse:
 118   %a.tr = phi i32 [ %a, %entry ], [ %tmp2, %recurse ]
 119   %b.tr = phi i32 [ %b, %entry ], [ %tmp3, %recurse ]
 120   %tmp1 = icmp eq i32 %a.tr, 0
 121   br i1 %tmp1, label %done, label %recurse
 122
 123 recurse:
 124   %tmp2 = sub i32 %a.tr, 1
 125   %tmp3 = add i32 %b.tr, 1
 126   br label %tailrecurse
 127
 128 done:
 129   ret i32 %b.tr
 130 }
 131 ```
 132
 133 This code approximately corresponds to:
 134
 135 ```c
 136 nat add(nat a, nat b) {
 137   while (a != 0) {
 138     a -= 1;
 139     b += 1;
 140   }
 141   return b;
 142 }
 143 ```
 144
 145 By removing the recursive call, further optimizations become visible.
 146 In particular...
 147
 148 ### Induction Variable Simplification
 149
 150 Loop optimizations are a primary focus of compiler optimizations, because many programs spend most of their time in a few loops, making those loops faster is the most fruitful optimization.
 151 "Induction Variable Simplification" is a specific optimization that works on identified "loop induction variables", variables that change by a constant amount each loop iteration, or that are derived from other induction variables.
 152
 153 Here, `a` and `b` are identified as loop induction variables.
 154 Event more critically, `a` is the induction variable that controls the loop condition, so `a` is counting down towards `0`.
 155 Therefore, LLVM can determine that the loop will run exactly `a` times, called the "trip count."
 156
 157 In cases where one of the induction variables is used after the loop and the trip count is statically known, LLVM performs an optimization where it computes the final value of the induction variable outside the loop, which splits the live range of the induction variable, and potentially makes it eligible for dead code elimination (which happens in this case).
 158
 159 ```llvm
 160 define i32 @add(i32 %a, i32 %b) {
 161 entry:
 162   br label %tailrecurse
 163
 164 ; Loop:
 165 tailrecurse:
 166   %a.tr = phi i32 [ %a, %entry ], [ %tmp2, %recurse ]
 167   %tmp1 = icmp eq i32 %a.tr, 0
 168   br i1 %tmp1, label %done, label %recurse
 169
 170 recurse:
 171   %tmp2 = sub i32 %a.tr, 1
 172   br label %tailrecurse
 173
 174 ; Exit blocks
 175 done:
 176   %0 = add i32 %b, %a
 177   ret i32 %0
 178 }
 179 ```
 180
 181 This IR looks basically like this C:
 182
 183 ```c
 184 nat add(nat a, nat b) {
 185   nat a0 = a;
 186   while (a0 != 0) {
 187     a0 -= 1;
 188   }
 189   return b + a;
 190 }
 191 ```
 192
 193 If you're interested in more details of these loop optimizations, my knowledge here comes from [some very nice lecture notes][] linked from Regehr's blog post, go read that if you want to know more about how you actually detect these cases.
 194
 195 [some very nice lecture notes]: https://www.cs.cmu.edu/~fp/courses/15411-f13/lectures/17-loopopt.pdf
 196
 197 ### Delete Dead Loops
 198
 199 This pass is very straightforward.
 200 The loop doesn't do anything anymore, and we know it will terminate, so we can just get rid of it.
 201
 202 ```llvm
 203 define i32 @add(i32 %a, i32 %b) {
 204 entry:
 205   %0 = add i32 %b, %a
 206   ret i32 %0
 207 }
 208 ```
 209
 210 And therefore, our code has been optimized down to:
 211 ```c
 212 nat add(nat a, nat b) {
 213   return b + a;
 214 }
 215 ```
 216
 217 Our recursive definition of addition turns out to actually be addition, and LLVM has proved it for us!
 218
 219 ## Takeaways
 220
 221 Very general optimizations can combine together to have some very surprising specific results, and optimizing compilers are very clever.
 222
 223 These same optimizations work to optimize Peano multiplication, since the loop induction variables like to work with linear functions, but they don't succeed with saturating subtraction, recursive comparisons, or min/max.
 224 It'll be interesting to see if I can come up with a loop optimization pass that can deal with those more complicated trip counts / induction variables in general at all, or if I'll only succeed at pattern matching these very specific functions.