Is there a floatN_t type ?

06-08-2009

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

For giant integers, at least, you could just bundle two 64-bit integers in a struct. Probably best to assume the whole thing is unsigned. You wouldn't be able to do arithmetic operations on it directly but you'd have the right data in the right places. Of course there might be endian issues to consider as well...

I think there is a way to force the gcc compiler to make 128-bit integers on 64-bit systems the same way it makes 64-bit integers on 32-bit systems but that's skipping way out of 'compiler specific' into crazyland.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-08-2009

Administrator Emeritus

4,463, 16

Join Date: Mar 2005

Last Activity: 29 March 2012, 7:00 PM EDT

Location: Ireland

Posts: 4,463

Thanks Given: 0

Thanked 16 Times in 14 Posts

There is some stuff for 128 bit vector registers in gcc which are interpreted as 128bit ints in Tetra Integer (TI) mode, but given that the information doesn't have to be processed it would go with the struct 2 x 64 bit or 4 x 32 bit unsigned would be more portable and easier to implement.

reborg

View Public Profile for reborg

Find all posts by reborg

06-08-2009

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Believe it or not, there is indeed a notion of a 128 bit floating point number which is implemented as two consecutive 64 bit floating point numbers. It's very nonstandard and in general a terrible idea, but it does exist. You add the two numbers together to get the actual value. The first number is considered to be a approximation of the value.

Imagine that 64 bits could exactly contain 3 decimal digits... if the first number was 1.23 and the second was .00456, the combined number would be 1.23456. Remember this isn't my idea, so don't kill the messenger.

link

Quote:

A 128-bit long double number consists of an ordered pair of 64-bit double-precision numbers. The first member of the ordered pair contains the high-order part of the number, and the second member contains the low-order part. The value of the long double quantity is the sum of the two 64-bit numbers.

Each of the two 64-bit numbers is itself a double-precision floating-point number with a sign, exponent, and significand. Typically the low-order member has a magnitude that is less than 0.5 units in the last place of the high part, so the values of the two 64-bit numbers do not overlap and the entire significand of the low-order number adds precision beyond the high-order number.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

06-09-2009

Registered User

42, 0

Join Date: Apr 2009

Last Activity: 4 October 2012, 8:45 AM EDT

Posts: 42

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by shamrock

Can you show the struct encoding...maybe it needs to be decoded properly before displaying. An embedded device sending a 128-bit int...what OS and platform are you working on. Also provide some details on the embedded device in use.

The only thing the spec report is that
16bit is 10bit Mantissa 5 exponent
32bit is 23bit Mantissa 8 exponent
64bit is 52bit Mantissa 11 exponent
128bit is 112bit Mantissa 15 exponent

and is refers to IEEE 754r.

/me without coffee.

Thanks,
S.

-----Post Update-----

Quote:

Originally Posted by Corona688

For giant integers, at least, you could just bundle two 64-bit integers in a struct. Probably best to assume the whole thing is unsigned. You wouldn't be able to do arithmetic operations on it directly but you'd have the right data in the right places. Of course there might be endian issues to consider as well...

Yes, that actually what I did.

Code:

typedef struct {
 uint64_t low;
 uint64_t high;
} uint128_t;

and so for the signed one. But my problem is still how to display the value.
I'd rather not get compiler specific stuff.

emitrax

View Public Profile for emitrax

Find all posts by emitrax

06-09-2009

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Emitrax, you can do what you want to do but it is going to take significant work on your part. The relevant standard is IEEE 754-2008 (previously known as IEEE 754r) which was published in August of 2008.

Floating point formats defined by this standard are classified as either interchange or non-interchange. In the standard, storage formats are narrow interchange formats, i.e. the set of floating point values that can be stored by the specified binary encoding is a proper subset of wider floating point formats such as the 32-bit float and 64-bit double.

For example here is how to encode and decode a half-precision (i.e. 16-bit) binary encoded floating point number.

Code:

/*
**  This program is free software; you can redistribute it and/or modify it under
**  the terms of the GNU Lesser General Public License, as published by the Free 
**  Software Foundation; either version 2 of the License, or (at your option) any
**  later version.
**
**   IEEE 758-2008 Half-precision Floating Point Format
**   --------------------------------------------------
**
**   | Field    | Last | First | Note
**   |----------|------|-------|----------
**   | Sign     | 15   | 15    |
**   | Exponent | 14   | 10    | Bias = 15
**   | Fraction | 9    | 0     |
*/

#include <stdio.h>
#include <inttypes.h>

typedef uint16_t HALF;

/* ----- prototypes ------ */
float HALFToFloat(HALF);
HALF floatToHALF(float);
static uint32_t halfToFloatI(HALF);
static HALF floatToHalfI(uint32_t);

float
HALFToFloat(HALF y)
{
    union { float f; uint32_t i; } v;
    v.i = halfToFloatI(y);
    return v.f;
}

uint32_t
static halfToFloatI(HALF y)
{
    int s = (y >> 15) & 0x00000001;                            // sign
    int e = (y >> 10) & 0x0000001f;                            // exponent
    int f =  y        & 0x000003ff;                            // fraction

    // need to handle 7c00 INF and fc00 -INF?
    if (e == 0) {
        // need to handle +-0 case f==0 or f=0x8000?
        if (f == 0)                                            // Plus or minus zero
            return s << 31;
        else {                                                 // Denormalized number -- renormalize it
            while (!(f & 0x00000400)) {
                f <<= 1;
                e -=  1;
            }
            e += 1;
            f &= ~0x00000400;
        }
    } else if (e == 31) {
        if (f == 0)                                             // Inf
            return (s << 31) | 0x7f800000;
        else                                                    // NaN
            return (s << 31) | 0x7f800000 | (f << 13);
    }

    e = e + (127 - 15);
    f = f << 13;

    return ((s << 31) | (e << 23) | f);
}

HALF
floatToHALF(float i)
{
    union { float f; uint32_t i; } v;
    v.f = i;
    return floatToHalfI(v.i);
}

HALF
static floatToHalfI(uint32_t i)
{
    register int s =  (i >> 16) & 0x00008000;                   // sign
    register int e = ((i >> 23) & 0x000000ff) - (127 - 15);     // exponent
    register int f =   i        & 0x007fffff;                   // fraction

    // need to handle NaNs and Inf?
    if (e <= 0) {
        if (e < -10) {
            if (s)                                              // handle -0.0
               return 0x8000;
            else
               return 0;
        }
        f = (f | 0x00800000) >> (1 - e);
        return s | (f >> 13);
    } else if (e == 0xff - (127 - 15)) {
        if (f == 0)                                             // Inf
            return s | 0x7c00;
        else {                                                  // NAN
            f >>= 13;
            return s | 0x7c00 | f | (f == 0);
        }
    } else {
        if (e > 30)                                             // Overflow
            return s | 0x7c00;
        return s | (e << 10) | (f >> 13);
    }
}

int
main(int argc, char *argv[])
{
   float f1, f2;
   HALF h;

   printf("Please enter a floating point number: ");
   scanf("%f", &f1);

   h = floatToHALF(f1);
   f2 = HALFToFloat(h);

   printf("Results are: %f %f %#lx\n", f1, f2, h);
}

See the blog entry Half-Precision Floating Point Format for further information and an example of how to do the same thing using Python.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

Programming

Is there a floatN_t type ?

9 More Discussions You Might Find Interesting

1. Programming

Changing type name

Discussion started by: kristinu

2. Windows & DOS: Issues & Discussions

Type of RAM

Discussion started by: tenderfoot

3. UNIX for Dummies Questions & Answers

key_t type

Discussion started by: joker40

4. UNIX for Dummies Questions & Answers

Encoding Type

Discussion started by: risshanth

5. Solaris

raid type

Discussion started by: melanie_pfefer

6. Programming

array type has incomplete element type

Discussion started by: jaganadh

7. Shell Programming and Scripting

String type to date type

Discussion started by: rinku

8. Shell Programming and Scripting

Different type of shells?

Discussion started by: charbel

9. UNIX for Dummies Questions & Answers

you have more and one unix type?

Discussion started by: sunbird