#Question about IEEE-754 floating point standard

1 messages ยท Page 1 of 1 (latest)

karmic trench
#

Mainly talking about 32 bit floats, I will try to explain how I understand it and I would appreciate
if someone told me **if I understand it correctly **(bold to emphasise the actual question)
the first bit(fb) is the sign, 0 = positive and 1 equals negative. ( (-1)^fb)

the next 8 bits(exp) are the exponent, which is an 8 bit number, but the actual exponent of the value is the 8 bit number - 2^(exp length-1)=> 8 bit number-2^7( (-1)^fb * 2^exp)

the last 23 bits(mn) are the mantissa, which is like the base of a number in scientific notation(what a float basically is). The first number of the base, so like 1 in 1.23, is implied, because it will always be 1, because in binary the only other value it can take is 0, but if it was 0, the exponent would be +1 bigger and the mantissa(for example) would be 0.1234.. instead of 1.234, resulting in loss of accuracy or 1 more bit used. So the base is 1.(mn digits) . so for example, the mantissa 11000..(everything else 0) would be 2^-1 + 2^-2, so 0.5+0.25=0.75 in decimal.
So converting the number from binary to decimal floating point would be (-1)^fb * 2^exp * base (mantissa converted to decimal, in the way I explained)

slim solarBOT
#

<@&987246746478460948> please have a look, thanks.

#

Here is an AI assisted attempt to answer your question ๐Ÿค–. Maybe it helps! In any case, a human is on the way ๐Ÿ‘. To continue talking to the AI, you can use </chatgpt:1108714622413963314>.

#
TJ-Bot
Mainly talking about 32 bit floats, I will try to explain how **I understand it** and I would appreciate if someone told me **if I understand it correctly **(bold to emphasise the actual question) the first bit(fb) is the sign, 0 = positive and 1 equals ne

The IEEE-754 floating point standard is a widely-used format for representing real numbers in a binary format. In the case of 32-bit floats, they are divided into three parts: the sign bit, the exponent, and the mantissa.

The sign bit determines whether the number is positive or negative. The exponent represents the magnitude of the number, while the mantissa holds the precision or significant digits of the number.

For a 32-bit float, it is typically divided into 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The exponent is biased to allow both positive and negative exponents to be represented.

The formula to calculate a float value from its components is as follows:

(-1)^sign * (1 + mantissa) * 2^(exponent - bias)

It's important to note that not all real numbers can be represented exactly in this format due to limitations on precision and range. This can lead to rounding errors and loss of precision in calculations involving floats.

Overall, your understanding seems correct based on your explanation. If you have any specific questions or want further clarification on any aspect of IEEE-754 floats, feel free to ask!

dull cloak
#

yeah, pretty much spot on ๐Ÿ™‚

#

I don't know if the mantissa calculation is exact, but everything else sounds right

karmic trench
#

thanks, I have few more questions

dull cloak
#

shoot

karmic trench
#

why does this converter do this? when converting, do I have to shift the mantissa exp digits before converting the mantissa to decimal? another converter does something else, not sure what is correct

#

wait I might be dumb let me try something

dull cloak
#

way better tool

#

well, it gives less info

karmic trench
#

I understand it now thanks

#

(I kept messing up the mantissa calculation when doing it semi manually)

#

let me think if I have any more questions and I will close the thread

dull cloak
#

no worries ๐Ÿ™‚

#

also, look at bfloat 16, really funny

#

"we don't need the rest of this mantissa" yeet

#

They just chopped it in half and threw away the other part ๐Ÿ˜„

karmic trench
#

yeah ๐Ÿ’€

#

okay I dont have any more questions, thanks for helping!