Documentation
Tools for embedded systems
|
FP16 is a compact fixed-point number library intended for use in embedded systems. It includes a variety of transcendental functions and essential operators, carefully chosen for optimal performance.
The format is a signed Q16.16
, which is good enough for most purposes.
The maximum representable value is 32767.999985
. The minimum value is -32768.0
The minimum value is also used to represent overflow
for overflow detection, so for some operations it cannot be determined whether it overflowed or the result was the smallest possible value. In practice, this does not matter much.
The smallest unit (machine precision) of the datatype is 1/65536=0.000015259
.
All the provided functions operate on 32-bit numbers, qlibs::fp16, which have 16-bit integer part and 16-bit fractional part.
Conversion from integers and floating-point values. These conversions retain the numeric value and perform rounding where necessary.
int
, float
or double
to the qlibs::fp16 type. For constants use the fixed-point literalBasic operator overloading also perform rounding and detect overflows. When overflow is detected, they return overflow
as a marker value.
+
Addition-
Subtraction*
Multiplication/
DivisionRoots, exponents & similar.
+-40
absolute for negative inputs and +-0.003%
for positive inputs. Average error is +-1
for neg and +-0.0003%
for pos.+-3
absolute, average error less than 1 unit.Fixed-point literal defines a compile-time constant whose value is specified in the source file. The suffix _fp
indicates a type of qlibs::fp16.
A plus (+) or minus (-) symbol can precede a fixed-point literal. However, it is not part of the literal; it is interpreted as a unary operator.
This draft example computes one solution of the quadratic equation by using the fixed point format. Equation is given by:
\( x = \frac{ -b + \sqrt{ b^{2} - 4ac} }{ 2a } \)