Numeric Datatypes | Heaton Research

Numeric Datatypes

In Jonathan Swift’s Gulliver’s Travels, the nations of Lilliput and Blefuscu find themselves at war over which end of a hardboiled egg to cut before eating. Lilliputians preferred the Little Endian approach of starting with the little end of the egg, whereas Blefuscu people preferred to start with the large end. An inane controversy indeed, but one that mirrors our own computer industry. When an integer is stored in memory and occupies more than one byte, it is necessary to decide which byte to place first. Take for example the number 1025. This number would have to be stored in two bytes. The high-order byte would be four. The low-order byte would be one. This is because the integer division of 1025 by 256 using is four, with a modulus of one. So we have the bytes of four and one. Is this stored as 04 00 or as 00 04? Computer scientists call the two notations little-endian and big-endian respectively: the same words as those used by Swift to describe the dilemma of the Lilliputians. The two systems can be seen in Figure 1.

So which one is predominant in the industry? At this time it seems to be leaning towards big-endian. Both Intel PC’s and the new Intel Macintoshes make use of big-endian. Little-endian is used in older PowerPC based Macs, and some higher-end workstations. As a result, the binary file class presented in this article will handle both standards.

In order to accommodate the little and big endian numbers, integers are first read in byte by byte and then converted into the correct data type. For numbers that are four bytes, the next four bytes from the file are read into the variables a, b, c and d. Then, to convert to big-endian or little-endian, one of the following equations is used.

result = ((a<<24) | (b<<16) | (c<< 8) | d);// big endian
result = ( a | (b<<8) | (c<<16) | (d<<24) ); // little endian

In addition to the issue of little endian or big endian, numeric data types can be stored as signed or unsigned. Unsigned numbers are virtually unheard of in Java, but they are all too common in other programming languages. This means there are four major categories of numbers to be supported: Signed big-endian, unsigned big-endian, signed little-endian and unsigned little-endian. To accommodate these different systems, the methods setEndian and setSigned are provided. SetEndian will accept either BinaryFile.BIG_ENDIAN or BinaryFile.LITTLE_ENDIAN. There is also a getEndian method to determine the current mode. The setSigned method accepts a Boolean. True indicates that the numbers are signed. False indicates that the numbers are unsigned. There is also a getSigned method to determine the current mode.

Signed numbers are stored in a format called twos complement. Two’s complement uses the most significant bit as a signed or unsigned flag. In all numbers, except zero, a value of one for this bit signifies a negative number. In the case of zero, which has no sign, this bit is set to zero. Positive numbers are stored just as they normally would be. Negative values are stored by subtracting their magnitude from one beyond the highest value that an unsigned number of that type would hold. For example –1 in a word would be stored as 0x10000 – 1, or 0xffff.

In addition to signed or unsigned, the BinaryFile object can also read a variety of sizes of number. The supported sizes are byte, word, and double-word. The methods used to read/write these types are readByte/writeByte, readWord/writeWord and readDWord/writeDWord. A byte occupies just one byte of memory. The endian setting does not affect byte read/writes. A byte can be signed or unsigned. A word occupies two bytes of memory. Words can be little- or big-endian. Words can also be signed or unsigned. The double-word occupies four bytes of memory. A double-word, like the word, follows the endian and signed modes.

Each of the numeric read/write methods deals in Java types that are one size bigger than the underlying data type. A byte is stored in a short, a word is stored in an int, and a double-word is stored in a long. This is done to accommodate the unsigned data types. The Java byte data type cannot hold values all the way to 255. Because of this, the readByte method returns a short and not a byte. The readByte command, when working in unsigned mode, can return numbers in the range of 0 to 255. That would overflow a Java byte, so a short is used instead. These different types can be seen in figure two.

Copyright 2005-2008 by Heaton Research, Inc.