Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

All published worksheets from http://sagenb.org

Views: 168739
Image: ubuntu2004

Floating Point Numbers

This is a very rough sheet containing calculations and explorations for Numerical Analysis in Sage.

 

For the following command (setting the RR print options) to work, you need to apply the patches at http://trac.sagemath.org/sage_trac/ticket/7682.  Alternatively, every time you print a real number m, you can print it as m.str(truncate=False).  This is important in this worksheet since we want to see the roundoff error, and not have Sage hide it by showing only correct digits.

RR.print_options['truncate']=False

First, let's convert 13.28125 to IEEE-754 format.

x=13.28125
x.str(base=2)
'1101.0100100000000000000000000000000000000000000000000'
xbin=2^3 +2^2 + 2^0+2^-2+2^-5; xbin
425/32
xbin.n()
13.281250000000000
x.exact_rational()
425/32
(x*2^(-3)).str(base=2)
'1.1010100100000000000000000000000000000000000000000000'
f=(x*2^(-3)).frac() # .frac() gets the fraction part, the part after the decimal mantissa=f.str(base=2) mantissa
'0.10101001000000000000000000000000000000000000000000000'
c=(3+1023).str(base=2) c
'10000000010'
len(c)
11
len(mantissa[2:54])
52
sign='0'

So here is the IEEE-754 binary representation.

sign+' '+c+' '+mantissa[2:54] # the [2:] just chops off the '0.', since we just need to store the digits after the decimal point
'0 10000000010 1010100100000000000000000000000000000000000000000000'
len(sign+c+mantissa[2:54]) # it's 64 bits!
64
((-1)^(int(sign)) * 2^(int(c,base=2)-1023)*(1+RR(mantissa[:54], base=2)))
13.281250000000000
x
13.281250000000000

So they agree!

x.nextabove()
13.281250000000002
x.nextbelow()
13.281249999999998
x.nextabove()-x.nextbelow()
3.5527136788005009e-15

MPFR has a slightly different convention---store an integer.

x.sign_mantissa_exponent()
(1, 7476679068876800, -49)
7476679068876800*2^(-49)
425/32
7476679068876800.str(base=2)
'11010100100000000000000000000000000000000000000000000'

We can't represent all numbers

(1.0).nextabove()
1.0000000000000002

A finite system

Let's imagine we have a 2-bit mantissa and a 2-bit exponent.  Then all of our machine numbers look like 1.b1b2×2k21.b_1b_2\times 2^{k-2}, where b1b_1 and b2b_2 can each be 00 or 11, and kk can be 0,1,2,30,1,2,3.

Question: List all possible numbers in our set.

RR.print_options['skip_zeroes']=True nums=[RR('1.%d%d'%(b1,b2),base=2)*2^(k-2) for b1,b2,k in CartesianProduct([0,1],[0,1],[0,1,2,3])]
sorted(nums)
[0.25, 0.3125, 0.375, 0.4375, 0.5, 0.625, 0.75, 0.875, 1., 1.25, 1.5, 1.75, 2., 2.5, 3., 3.5]

or as fractions

sorted([i.exact_rational() for i in nums])
[1/4, 5/16, 3/8, 7/16, 1/2, 5/8, 3/4, 7/8, 1, 5/4, 3/2, 7/4, 2, 5/2, 3, 7/2]

The distribution of these numbers is interesting.  Let's plot them on the real line.

points(zip(nums,[0]*len(nums)),pointsize=25).show(xmin=0,xmax=4,figsize=[10,1])

Note that the numbers are unevenly spaced.  In fact, count how many numbers are in [22,21)[2^{-2},2^{-1}), and how many are in [21,20)[2^{-1},2^0), and between [20,21)[2^0,2^1).  What do you notice?  Is there a pattern, and why?

Note that there is no machine number representing 1.1.  In order to deal with machine arithmetic, the machine needs to decide how to represent 1.1.  We notate the machine's version of 1.1 as fl(1.1)fl(1.1) (flfl stands for "float", as in "floating point approximation").

What are the reasonable possibilities for fl(1.1)fl(1.1)?  What would you choose, and why?

The gap around zero is called the "hole at zero".  Where does this come from? Subnormal numbers fill this gap in IEEE 754-2008.

What happens if we do 5/161/4=1/165/16-1/4=1/16 on the machine? In other words, what is fl(fl(5/16)fl(1/4))fl(fl(5/16)-fl(1/4))?

What is fl(fl(1)+fl(5/4))fl(fl(1)+fl(5/4))?

What is 1.1+1.1+1.11.1+1.1+1.1, as the machine does it?  (first write down all applicable flfl, then calculate the result.)

Rounding Modes

Look up in Wikipedia what the possible rounding modes are for IEEE 754-2008.  For each of these rounding modes, what is fl(3.25)fl(3.25)? (To answer this for some of the rounding modes, you'll need to know what the mantissas for 3 and 3.5 are.)

Relative Error

What is the relative error of the approximation of fl(1.1)fl(1.1) and 1.1? (See the definition from last class or at the bottom of p. 19 of the text.)

What is the maximum relative error for a number xx such that fl(x)=1fl(x)=1?  This relative error is called the machine epsilon, and is often denoted with the Greek epsilon ϵ\epsilon or ϵmach\epsilon_{\textrm{mach}}

What is the maximum relative error for a number yy such that fl(y)=2fl(y)=2?  What about the maximum relative error for a number zz such that fl(z)=1/2fl(z)=1/2? Do you notice a pattern? What about the maximum relative error for a number ww such that fl(z)=3fl(z)=3?

What is the machine epsilon for a 53-bit number? (Work it out by hand, from what you know about the exponent range, and verify the next computation.)

((RR(1).nextabove()-RR(1))/2*RR(1))
1.1102230246251565e-16

So intuitively, what does it mean if you do several computations using 53-bit precision floating point numbers (i.e., double precision) and the results differ by about 101610^{-16}, or if you get an answer that is about 101610^{-16}?

2.0^(-53) # Here's a hint for the by-hand part above...
1.1102230246251565e-16

Some more topics

thinking about fl(x)=x+δfl(x)=x+\delta, and so fl(x)+fl(y)fl(x)+fl(y) is really (x+δ1)+(y+δ2)(x+\delta_1)+(y+\delta_2).  So how should we perturb things so that we get the exact answer (backward error analysis), or how much does our computer answer differ from the exact answer (direct error analysis).

significant digits

subtraction

dividing by small number

quadratic formula

horner's form

You can read about two disasters in history coming from ignoring numerical issues here: http://www.ima.umn.edu/~arnold/455.f96/disasters.html

f(x)=x^3-6.1*x^2+3.2*x+1.5
R=RealField(10,truncate=False) R(4.71)
4.7109
f(R(4.71))
-14.422
g=f(x).polynomial(RR)
g(x)=f(x)._maxima_().horner().sage() g(x)
((x - 6.1)*x + 3.2)*x + 1.5
g(R(4.71))
-14.312
g(RR(4.71))
-14.263899
f(RR(4.71))
-14.263899000000011