Memory & Encoding Operations

Very Lazy Cliff-Notes Version

BB

For a lot of people, including myself, this is not a particularly thrilling topic to read or write about. However, as someone who does not come from a CS background (not even close), ignorance in this area is a major obstacle to better understanding R fundamentals or any language, for that matter. For that reason, I want to document some basic facts so that I, and others, can use it as a reference as I will inevitably forget a lot of this material. Because I'm obviously no expert in this field (or many others for that matter), I will rely heavily on sources, base-R code examples, and just a sprinkling of my own commentary.

This follows the examples given by the R documentation, except that at times I make certain function arguments explicit to make it more clear what is happening.

The first, and most important, topic in my opinion is how data is stored in a computer's memory.

A computer cannot store “numbers” or “letters”. The only thing a computer can store and work with is bits. A bit is binary, it is either a 0 or a 1. In fact from a physics perspective, a bit is just a blip of electricity that either is or isn’t there.

Consequently, it is from building blocks of 0's and 1's that everything, including R is built. Bits are then combined to form larger structures of memory. These include Also note that system processing power is expressed in terms of bytes as well. For example, a word is either 32 or 64 bits depending on the system (e.g., 4 vs 4+ GB Ram). :

Name Value
bit 1 binary digit (0/1)
nibble 4 bits
byte 8 bits
kilobyte 1024 bytes (e.g., 8192 bits)
megabyte 1024 kilobytes
gigabyte 1024 megabytes
terabyte 1024 gigabytes
petabyte 1024 terabytes
exabyte 1024 petabytes

Considering that a byte holds 8 digits, that means there are 28 different possible combinations for a byte (00000000:11111111), which in turn, means there are 2^28 potential values for a byte, or 256 2^n where the two is the two potential values of a bit (0/1) . This number is sufficiently large to capture most characters on a standard keyboard.

Accordingly...

Since characters (letters, decimal digits and special characters such as punctuation marks, etc) can be represented with bytes, a standard is needed to insure that the code that's used on your computer is the same as the code that is used on mine. There are two standard codes that use one byte to represent a character, ASCII (ass'-key) and EBCDIC (ib'-suh-dik). ASCII, the American Standard Code for Information Interchange, is the code that is most commonly used today. EBCDIC, Extended Binary Coded Decimal Interchange Code, was used by IBM on its large mainframe computers in the past.

However...

"In the past the ASCII character set dominated computing. This set defines 128 characters including 0 to 9, upper and lower case alpha-numeric and a few control characters such as a new line. To store these characters required 7 bits since 27 = 128, but 8 bits were typically used for performance reasons...

...The limitation of only having 256 characters led to the development of Unicode, a standard framework aimed at creating a single character set for every reasonable writing system. Typically, Unicode characters require sixteen bits of storage. Eight bits is one byte, or ASCII character. So two ASCII characters would use two bytes or 16 bits. A pure text document containing 100 characters would use 100 bytes (800 bits)."

Efficient R

Encoding can then be seen as the process of mapping characters to bytes as is shown in the ASCII sample mapping below.

Bit representation Character
01000001 A
01000010 B
01000011 C
01000100 D
01000101 E
01010010 R

Beyond bytes and encoding, it is also important to at least be familiar with the hexadecimal, base-16, system as it is found everywhere in computing, and in particular, memory addresses. The following table is a mapping of binary, decimal, and digits.

Decimal Binary Hexadecimal
0 0 0
1 1 1
2 10 2
3 11 3
4 100 4
5 101 5
6 110 6
7 111 7
8 1000 8
9 1001 9
10 1010 A
11 1011 B
12 1100 C
13 1101 D
14 1110 E
15 1111 F

The following examples center around the 'raw' data type which, as the documentation puts it...

"The raw type is intended to hold raw bytes. It is possible to extract subsequences of bytes, and to replace elements (but only by elements of a raw vector)... ...A raw vector is printed with each byte separately represented as a pair of hex digits"

xx <- raw(length = 2) # length of raw vector

xx[1] <- as.raw(40)

xx[2] <- charToRaw("A")

xx
## [1] 28 41
# 28 = (2*16) + (8*16^0)
# 41 = ASCII = A  

dput(xx) ## as.raw(c(0x28, 0x41))
## as.raw(c(0x28, 0x41))
# 0x = INSERT HERE !!!!

as.integer(xx) ## 40 65
## [1] 40 65
rm(xx)

Conversions operate as follows:

"charToRaw converts a length-one character string to raw bytes. It does so without taking into account any declared encoding"

Whereas...

"rawToChar converts raw bytes either to a single character string or a character vector of single bytes (with "" for 0). (Note that a single character string could contain embedded nuls; only trailing nulls are allowed and will be removed.) In either case it is possible to create a result which is invalid in a multibyte locale, e.g. one using UTF-8. Long vectors are allowed if multiple is true."

x <- "A test string"
(y <- charToRaw(x))
##  [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67
rawToChar(y)
## [1] "A test string"
rawToChar(y)
## [1] "A test string"
rawToChar(y, multiple = TRUE)
##  [1] "A" " " "t" "e" "s" "t" " " "s" "t" "r" "i" "n" "g"
(xx <- c(y,  charToRaw("&"), charToRaw("more")))
##  [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67 26 6d 6f 72 65
rawToChar(xx)
## [1] "A test string&more"

I'm not sure how useful the bit shifting is within an R context, but I include the R documentation's example for illustrative purposes. Also, note the conversion functions as well...

  • rawShift(x, n) shift the bits in x by n positions to the right
  • rawToBits returns a raw vector of 8 times the length of a raw vector with entries 0 or 1
  • intToBits returns a raw vector of 32 times the length of an integer vector with entries 0 or 1

Finally, although not covered here, note that there are bitwise See ?bitwAnd logical operators as well...

rawShift(y, 1)
##  [1] 82 40 e8 ca e6 e8 40 e6 e8 e4 d2 dc ce
rawShift(y, -2)
##  [1] 10 08 1d 19 1c 1d 08 1c 1d 1c 1a 1b 19
# Gibberish
rawToChar(rawShift(y, 1))
## [1] "‚@èÊæè@æèäÒÜÎ"
rawToBits(y)
##   [1] 01 00 00 00 00 00 01 00 00 00 00 00 00 01 00 00 00 00 01 00 01 01 01
##  [24] 00 01 00 01 00 00 01 01 00 01 01 00 00 01 01 01 00 00 00 01 00 01 01
##  [47] 01 00 00 00 00 00 00 01 00 00 01 01 00 00 01 01 01 00 00 00 01 00 01
##  [70] 01 01 00 00 01 00 00 01 01 01 00 01 00 00 01 00 01 01 00 00 01 01 01
##  [93] 00 01 01 00 01 01 01 00 00 01 01 00
intToBits(5)
##  [1] 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 00
showBits <- function(r) stats::symnum(as.logical(rawToBits(r))) # symbolic number coding

z <- as.raw(5)
z ; showBits(z)
## [1] 05
## [1] | . | . . . . .
showBits(rawShift(z, 1)) # shift to right
## [1] . | . | . . . .
showBits(rawShift(z, 2))
## [1] . . | . | . . .
showBits(z)
## [1] | . | . . . . .
showBits(rawShift(z, -1)) # shift to left
## [1] . | . . . . . .
showBits(rawShift(z, -2)) # ..
## [1] | . . . . . . .
showBits(rawShift(z, -3)) # shifted off entirely
## [1] . . . . . . . .
bitwShiftR(-1, 1:31) # shifts of 2^32-1 = 4294967295
##  [1] 2147483647 1073741823  536870911  268435455  134217727   67108863
##  [7]   33554431   16777215    8388607    4194303    2097151    1048575
## [13]     524287     262143     131071      65535      32767      16383
## [19]       8191       4095       2047       1023        511        255
## [25]        127         63         31         15          7          3
## [31]          1

The R documentation has the following to say regarding encoding:

"Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes". These declarations can be read by Encoding, which will return a character vector of values "latin1", "UTF-8" "bytes" or "unknown", or set, when value is recycled as needed and other values are silently treated as "unknown".

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings. Strings marked as "bytes" are intended to be non-ASCII strings which should be manipulated as bytes, and never converted to a character encoding (so writing them to a text file is not supported).

enc2native and enc2utf8 convert elements of character vectors to the native encoding or UTF-8 respectively, taking any marked encoding into account. They are primitive functions, designed to do minimal copying."

## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
## [1] "latin1"
Encoding(x) <- "latin1"
x
## [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
## [1] "latin1" "UTF-8"
c(x, xx)
## [1] "façile" "façile"
Encoding(xx) <- "bytes"
xx # will be encoded in hex
## [1] "fa\\xc3\\xa7ile"
cat("xx = ", xx, "\n", sep = "")
## xx = fa\xc3\xa7ile
i <- as.hexmode("7fffffff")
i; class(i)
## [1] "7fffffff"
## [1] "hexmode"
identical(as.integer(i), .Machine$integer.max)
## [1] TRUE
hm <- as.hexmode(c(NA, 1)); hm
## [1] NA  "1"
as.integer(hm)
## [1] NA  1

With strtoi it is possible to...

"Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.

For the default base = 0L, the base chosen from the string representation of that element of x, so different elements can have different bases (see the first example). The standard C rules for choosing the base are that octal constants (prefix 0 not followed by x or X) and hexadecimal constants (prefix 0x or 0X) are interpreted as base 8 and 16; all other strings are interpreted as base 10.

For a base greater than 10, letters a to z (or A to Z) are used to represent 10 to 35."

strtoi(c("0xff", "077", "123"))
## [1] 255  63 123
strtoi(c("ffff", "FFFF"), 16L)
## [1] 65535 65535
strtoi(c("177", "377"), 8L)
## [1] 127 255

With all that in context, one can use .Internal(address()) to find the address of an object in memory...

x <- 5
.Internal(address(x))
## <pointer: 0x0000000010cc8690>
# Sample address on my machine...

strtoi('0x000000001ce05810',16)
## [1] 484464656

Comments



Name:


E-mail:




Matnile
2019-05-06 20:36:00
Amoxicillin Side Effects In Pets Acheter Du Viagra En Andorre Viagra Generic 100mg <a href=http://genericviabuy.com>viagra</a> Viagra Ipertensione Polmonare Comprar Cialis Generico Seguro Provera Website With Free Shipping
Matnile
2019-05-11 23:35:00
Generico Levitra In Rete Comparatif Viagra Procalis Levitra <a href=http://catabs.com>precio de priligy 30 mg</a> Buy Kamagra India