float point number

Float, the inaccuracy and its string representation
ericlin@ms1.hinet.net

Computer use bits to store numbers. Each bit contains 0-1, so, every number is stored in the form of exponential of 2. It is easy to understand the binary representation about "Integer". For an integer 7, the binary is (111), that is 1*4+1*2+1. However, for floating number, the representation is a bit complex. I wont go to the detail of it. For simplicity, the computer use something like 0.1 to represent 0.5, that is 1*(1/2); Similarly, we would guess 0.75 is stored as 0.11 binary, because the value is 1*(1/2)+1*(1/4);

Think about that, we may have an integer that is so large that it needs more than 64 bit to store this large Integer. It fails in Flash , because Flash use 64 bit to store integer. Similarly, we may have a float number that is too small so that it can not be represent by limited number of bit. Although there are some solution to handle these condition, it is not for a large number with a small float decimal value.

OK, what I said is just to tell one fact: Not all float number can be stored fit by bit of 0-1. Like 1/3 or Math.sqrt(2) , they can not be stored fit in digits of 0-9. So, there is no absolute accuracy of float in computer world. There is always substle in-accuracy.

The next rule we must know is the "String" representation of a number. It convert the binary number into a String of 10 based decimals. Here some "rounding" procedures are done and the String representation is not really the same as the original binary number. So, when we "trace(f)", do not always believe the output to be the real binary value. There is a rounding before it is turned out to a 10-based number.

Here are some examples to clarify some concepts:

example 1: Limitation of String representation of a float number.

a=(1/3);
d=String(a);
e=Number(d);
trace(a); // output 0.333333333333333
trace(e); // output 0.333333333333333
trace(a*3); // output 1
trace(e*3); // output 0.999999999999999

Here we know, the string representation of 1/3 is (0.333333333333333). Obviously, it is not really the same as (1/3). We can accept it because comuter has its limitation to show "infinite 3". So, string representation is not really the binary number, some rounding exists in the procedures. Two float point number with the same string representation may have different value.

Please note that, the binary stored might be also not completely the same as (1/3). Like the inability of showing "infinite 3", the binary also has the limitation to store "infinite" value. However, no matter how the inaccuracy is, the result of (a*3) is output correctly as "1". That is: it is still more accurate than what we thought.

example 2: The tiny difference between String representation and the true value

a = (4/3)-0.333333333333333;
e = Number(String(a));
trace(a); //output 1
trace(e); //output 1
trace(String(a) == String(e)); //output true
trace(a == e); // output false;
trace(a-e); //2.22044604925031e-16

Here again, the String represntation of the float point a is '1'. But in fact the value is not 1. It is more than 1 by a tiny value.

Please note that, if we trace both of them, the output are the same '1'. Because the String representations are the same. But if we check whether they are the same values, the output is 'false';

This implies 2 things:

1. Debugging with 'trace(f)' is not always reliable for float point number.

2. 'Equal' comparing between two float point numbers should be avoid.

When we 'trace(f)', if the output is 102.3 , we know it is absolutely not 102.1; But if the output is 102.1, we should remind ourself that, this string output may be a result after rounding, so it is not necessarily 102.1.

'Equal' should be used in comparing two Integers, not float point numbers. Floating point means a distance in a line. It is nearly impossible to shoot the same "spot" twice.

If we are going to accept a rounded equal, we can compare the string representation between them like: if (String(f1)==String(f2)). So, if there is minute difference, the condition is still workable.

example 3: Complex calculation

a=10;
b=Math.sqrt(a)*Math.sqrt(a);
trace(b);// output 10
trace(b==10); // output false;

Theoretically I get the square root and then Square it back, it should be the original number. By trace, it seems correct. But we know that, it is nearly impossible to get a precise binary storing of square root of 10. So, the result of Square is just a little difference from 10.

This happens in many complex procedures, when we try to set a clip _width=100.1 and then compare it with the number 100.1, by (trace(_width==100.1)). The output usually is "false".

This is reasonable. And we should remember that, calculation of float is not "absolute accurate". The String representation of the number is not necessarily the true value of it. The last, do not use "equal" to compare two float point number.

example 4: The sequence of calculation

a=1000*Math.PI/180;
b=(Math.PI/180)*1000;
trace(a);// output 17.4532925199433
trace(b);// output 17.4532925199433
trace(a==b); //false
trace(a-b); //output -3.5527136788005e-15

Remember, float calculation by computer is with in-accuracy. The in-accuracy is neglectible if we don't use "Equal". See the example: Our teacher taught us that, (a*b)/c is the same as (a/c)*b; It is true. But in the computer world, some difference exists between them.

We know that, in the first formula, (1000*Math.PI) get rounded, only very tiny small value is trimed before it gets divided by 180; While in the second formula (Math.PI/180) get trimed first and the trimed effect is exagerated by 1000;

example 5: Interior in-accuracy of float calculation

a=0.1*0.3;
b=0.1*0.6;
c=0.1*0.9;
trace(c==(a+b));// output false

This is the same as example above. This example tells us that, the in-accurate calculation is not completely the fault of Math.PI. It happens occasionally in usual float point calculations.

example 6; Interior bug for float calculation

a=(99.9-0.1);
trace(a==99.8);// output false

I think, this is really un-acceptable ?!

Well, we must accept this. The floating point calculation is with tiny in-accuracy. Anyway, the String representation of the result is correct as 99.8 if we dont check it by "Equal". The tiny difference can be obtained by trace(a-99.8);

example 7: Accumulation of the in-accuracy

a = 10;
for (var k = 0; k<20; k++) {
a -= 0.1;
trace(a);
}

This is famous loop. This in-accuracy is accumulated to 'a' and at last, the in-accuracy is so large that we need not check the difference between the expect and the true value. It accumulated to a large in-accuracy that even the String representation shows the in-accuracy.

The output is:

9.9
9.8
9.7
9.6
9.5
9.4
9.3
9.2
9.1
9
8.9
8.8
8.7
8.6
8.50000000000001
8.40000000000001
8.30000000000001
8.20000000000001
8.10000000000001
8.00000000000001

So, when it accumulate to the 15 steps, the in-accuracy is visible now.

example 6; Calculation between large number and tiny number

a = 9999999999;
a += 0.000005;
trace(a);//9999999999.00001

Well, we expect the result be 9999999999.000005 right ?; How come the decimal part get rounded ?

It happens frequently when we do calculation between a large number and a small number. Think about that, is the in-accuracy of 0.000005 important to a number of 9999999999 ? I would say, it is neglectable.

The more professional reason is that, number is stored as binary data in limited number of bits. we assume that 9999999999 needs m bits to store, and 0.00005 needs n bits to store, 9999999999.000005 will need about (a+b) bits to store the binary value. It may be too many for the limited bits used for float point number. So, it get to round it.

example 7: The in-accuracy does not cancel but accumulate

a = 1.1;
b = 100000000000;
a = a+b-b;
trace(a);// 1.10000610351563

We just add a number and then subtract it off. The in-accuracy is big enough for 'a', although it is neglectable for 'b'; If we make b even larger, this in-accuracy created by rounding will be even larger.

So, we might have expect that, add a number and subtract a number will not affect the original number. We know it is wrong for float point calculation.

Conclusions:

Flash has only one data type about number. That is "Number". We can not specifically force a variable to float or integer. Anyway, a number with decimal point must be a float.

Float point calculation give us acceptable result. However, it is only "acceptable". In many occasion, there is in-accuracy. The accuracy may be tiny one. But, if the calculation involves large number, the in-accuracy may be large enough to be significant.

Because of this tiny in-accuracy, we should be careful when we compare two float point number by "Equal". To put it simply, the "equal" comparison should be avoided as possible.

Oh, by the way, this float point problem is not Flash specific. Do not blame Macromedia.