Lec. 02: Data Types and Program Memory Layout

1. Numeric Data Types and Sign-ness
- 1.1. Basic Numeric Types
- 1.2. Signed Numeric Values
2. Pointers and Memory References
- 2.1. Pointers Basic
- 2.2. Randomization of Memory
3. Arrays and Strings
4. Program Memory Layout
- 4.1. Cowboys and Endian-s
- 4.2. Stack, Heap, Data, BSS

1 Numeric Data Types and Sign-ness

1.1 Basic Numeric Types

Let's recall the basic data types for C programs. One thing to consider is that we are going to be working with 32-bit architectures as opposed to 64-bit ones. This is mostly to keep things simple for the exploits, but it does mean some things may be a bit different at times.

For example, consider this program and its output with the basic numeric types.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]){

  char c = 0xef;
  short s = 0xbeef;
  int i = 0xdeadbeef;
  long l = 0xcafebabe;

  printf("char c=%d size=%u\n",c,sizeof(char));
  printf("short s=%d size=%u\n",s,sizeof(short));
  printf("int i=%d size=%u\n",i,sizeof(int));
  printf("long l=%ld size=%u\n",l,sizeof(long));

  return 0;
}

user@si485H-base:demo$ ./datatypes
char c=-17 size=1
short s=-16657 size=2
int i=-559038737 size=4
long l=-889275714 size=4

Notice that the long number is the same size as an integer, 4-bytes, as opposed to 8-bytes on 64 bit machines. This is because the registers on 32 bit machines are 32 bits wide, and storing 8-byte values is annoying, at best. You can still have 8-byte values, but you have to use long long.

1.2 Signed Numeric Values

The basic numeric types are also signed, that is, their range of values go from [ -2^(w-1) : 2^(w-1)-1 ] because one of the bits is used to indicate the sign. The signed bit is the leading one, if it is 1 then the number is negative, but if it is 0, the number is positive. Positive numbers are represented as you may expected, with regard to binary counting; however, negative numbers are represented with 2's-compliment. That means, you count backwards … sort of.

See the example below for the greatest and smallest negative values:

user@si485H-base:demo$ cat signess.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  //largest positive signed int: 
  //  binary: 0111 1111 1111 1111 1111 1111 1111 1111
  //     hex:   7    f     f    f    f    f   f    f
  int l_pos = 0x7fffffff;

  //smallest postivie signed int: 
  //  binary: 0000 0000  0000 0000 0000 0000 0000 0000
  //     hex:   0   0     0    0    0    0    0    0
  int s_pos = 0x00000000;

  //smallest negative signed int:
  //  binary: 1000 0000 0000 0000 0000 0000 0000 0000
  //     hex:   8    0    0     0    0    0    0   0
  int s_neg = 0x80000000;


  //largest negative signed int:
  //  binary: 1111 1111 1111 1111 1111 1111 1111 1111
  //     hex:   8    f   f     f    f    f    f   f
  int l_neg = 0xffffffff;



  printf(" largest positive=%d \t (0x%08x)\n",l_pos,l_pos);
  printf("smallest positive=%d \t\t (0x%08x)\n",s_pos,s_pos);
  printf("\n");

  printf(" largest negative=%d \t\t (0x%08x)\n",l_neg,l_neg);
  printf("smallest negative=%d \t (0x%08x)\n",s_neg,s_neg);
  printf("\n");

}

user@si485H-base:demo$ ./signess 
 largest positive=2147483647 	 (0x7fffffff)
smallest positive=0 		 (0x00000000)

 largest negative=-1 		 (0xffffffff)
smallest negative=-2147483648 	 (0x80000000)

If we were to count in twos-complement with negative numbers, we start at 0xffffffff (-1) and go to 0x80000000 (-2147483648). This where the counting backwards comes from, except we are counting backwards in the lower 31 bits. The program below shows this clearly:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int s_neg = 0x80000000;
  int l_neg = 0xffffffff;

  printf("largest negative  =%d \t\t(0x%08x)\n",l_neg,l_neg);
  printf("largest negative-1=%d \t\t(0x%08x)\n",l_neg-1,l_neg-1);
  printf("largest negative-2=%d \t\t(0x%08x)\n",l_neg-2,l_neg-2);
  printf("largest negative-3=%d \t\t(0x%08x)\n",l_neg-3,l_neg-3);
  printf("...\n");
  printf("smallest negative+3=%d (0x%08x)\n",s_neg+3,s_neg+3);
  printf("smallest negative+2=%d (0x%08x)\n",s_neg+2,s_neg+2);
  printf("smallest negative+1=%d (0x%08x)\n",s_neg+1,s_neg+1);
  printf("smallest negative  =%d (0x%08x)\n",s_neg,s_neg);

}

user@si485H-base:demo$ ./twos-comp
largest negative  =-1 		(0xffffffff)
largest negative-1=-2 		(0xfffffffe)
largest negative-2=-3 		(0xfffffffd)
largest negative-3=-4 		(0xfffffffc)
...
smallest negative+3=-2147483645 (0x80000003)
smallest negative+2=-2147483646 (0x80000002)
smallest negative+1=-2147483647 (0x80000001)
smallest negative  =-2147483648 (0x80000000)

Of course, as with all data in C, how we interpret ones and zeros is really in the eye of the beholder. If we were to instead interpret these values as unsigned, then we get very different output:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int s_neg = 0x80000000;
  int l_neg = 0xffffffff;

  printf("(unsigned) largest negative  =%u \t\t(0x%08x)\n",l_neg,l_neg);
  printf("(unsigned)  largest negative-1=%u \t\t(0x%08x)\n",l_neg-1,l_neg-1);
  printf("(unsigned) largest negative-2=%u \t\t(0x%08x)\n",l_neg-2,l_neg-2);
  printf("(unsigned) largest negative-3=%u \t\t(0x%08x)\n",l_neg-3,l_neg-3);
  printf("...\n");
  printf("(unsigned) smallest negative+3=%u (0x%08x)\n",s_neg+3,s_neg+3);
  printf("(unsigned) smallest negative+2=%u (0x%08x)\n",s_neg+2,s_neg+2);
  printf("(unsigned) smallest negative+1=%u (0x%08x)\n",s_neg+1,s_neg+1);
  printf("(unsigned) smallest negative  =%u (0x%08x)\n",s_neg,s_neg);

}

user@si485H-base:demo$ ./unsigned 
(unsigned) largest negative  =4294967295  (0xffffffff)
(unsigned) largest negative-1=4294967294  (0xfffffffe)
(unsigned) largest negative-2=4294967293  (0xfffffffd)
(unsigned) largest negative-3=4294967292  (0xfffffffc)
...
(unsigned) smallest negative+3=2147483651 (0x80000003)
(unsigned) smallest negative+2=2147483650 (0x80000002)
(unsigned) smallest negative+1=2147483649 (0x80000001)
(unsigned) smallest negative  =2147483648 (0x80000000)

2 Pointers and Memory References

2.1 Pointers Basic

Pointers (or memory references) are crucial for C programs. Some terminology for the syntax of pointers:

int * p : declaring an integer pointer p.
The value or pointer value of p is a memory reference, an address of a memory value.
*p : de-referencing means to follow the pointer to the memory referenced by p and change the memory value there.
&a : address of the variable a, which is a pointer value since it a memory reference.

The best way to get a feeling for this is to see it in action:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int a = 10;
  int b = 20;
  int * p = &a; //MARK 1

  *p = b; //MARK 2

  p = &b;
  b = 50; //MARK 3

  printf("a=%d &a=%p\n",a,&a);
  printf("b=%d &b=%p\n",b,&b);
  printf("p=%p &p=%p *p=%d\n",p,&p,*p);

  return 0;
}

We can analyze this using a memory diagram for each of the marks. We use arrows to indicate a memory reference.

 MARK 1            Mark 2           Mark 3

.----.----.       .----.----.      .----.----.
| a  | 10 |<-.    | a  | 20 |<-.   | a  | 20 |
|----+----|  |    |----+----|  |   |----+----|
| b  | 20 |  |    | b  | 20 |  |   | b  | 50 |<-.
|----+----|  |    |----+----|  |   |----+----|  |
| p  |  --+--'    | p  |  --+--'   | p  |  --+--'
'----'----'       '----'----'      '----'----'

And, we can see the last mark is the case when we run the program.

user@si485H-base:demo$ ./reference 
a=20 &a=0xbfd00ebc
b=50 &b=0xbfd00eb8
p=0xbfd00eb8 &p=0xbfd00eb4 *p=50

Notice, though, that the pointer values or memory references are really just numbers. Really, the way we should model this diagram is with the full references and values, for the end state:

     address     value
  .------------.------------.
a | 0xbfaebd9c |  20        |
  |------------+------------|
b | 0xbfd00eb8 |  50        | <-.
  |------------+------------|   |
p | 0xbfd00eb4 | 0xbfd00eb8 | --'
  '------------'------------'

2.2 Randomization of Memory

One thing that will trip you up is that on most modern linux install, each run of the program will randomize the address space. The reason for this will be clear later, but the implications is that when your run the program again, you'll get different values.

user@si485H-base:demo$ ./reference 
a=20 &a=0xbfaebd9c
b=20 &b=0xbfaebd98
p=0xbfaebd98 &p=0xbfaebd94 *p=50

To ensure that this will not be the case, you'll have to turn this feature off. Here's how to do that.

user@si485H-base:demo$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

Now, when you run the program, you'll get the same output ever time.

user@si485H-base:demo$ ./reference 
a=20 &a=0xbffff6bc
b=20 &b=0xbffff6b8
p=0xbffff6b8 &p=0xbffff6b4 *p=50
user@si485H-base:demo$ ./reference 
a=20 &a=0xbffff6bc
b=20 &b=0xbffff6b8
p=0xbffff6b8 &p=0xbffff6b4 *p=50
user@si485H-base:demo$

3 Arrays and Strings

3.1 Array Values and Pointers are the same thing!

Here is a fact: pointers and array values are the same thing. Hold this truth to be self evident, and you will never be lost in the dark forest of memory …

Why is this truth true? Well it has to do with the definition of an array. An array is a contiugous memory block of a sequence of similarly typed data types. An array is a reference to the first data item in the contiguous memory block, and thus an array's value references memory. That means an array value is a pointer. Boom!

If that is not so clear, let's look at an example.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int a[] = {10,11,12,13,14};

  int * p = a;

  p[2] = 5;

  printf("a=%p p=%p\n",a,p);

  int i;
  for(i=0;i<5;i++){
    printf("a[%d] = %d\n",i,a[i]);
  }

  return 0;
}

Here, an array a is declared with 5 values, and we allow the pointer p to hold the same value as a. Wait! Notice, that we did not use & to set the value of p. This is a straight assignment, so the value in a is already a memory reference to an integer because that is the kind of data that p stores.

Next, notice that this line:

p[2] = 5;

We are using the array index operator [ ] with p, so in a very real sense, we are treating p as an array. And, the operation does as we would expect in the output.

user@si485H-base:demo$ ./arrays 
a=0xbffff6b4 p=0xbffff6b4
a[0] = 10
a[1] = 11
a[2] = 5
a[3] = 13
a[4] = 14

This begs the question: What is the array index operator anyway? It's a special deference that adds the index to the base. For example:

p[i]  <-- same as --->   *(p+i)

You can see this to be true in the following example:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int a[] = {10,11,12,13,14};

  int * p = a;

  *(p+2) = 5;

  printf("a=%p p=%p\n",a,p);

  int i;
  for(i=0;i<5;i++){
    printf("a+%d=%p *(a+%d) = %d\n",i,a+i,i,a[i]);
  }

  return 0;
}

user@si485H-base:demo$ ./p_arrays 
a=0xbffff6b4 p=0xbffff6b4
a+0=0xbffff6b4 *(a+0) = 10
a+1=0xbffff6b8 *(a+1) = 11
a+2=0xbffff6bc *(a+2) = 5
a+3=0xbffff6c0 *(a+3) = 13
a+4=0xbffff6c4 *(a+4) = 14

Ok, so if array values and pointers the same, why do we have array values? Well, I lied, just a bit. Array values and pointers are the same in the sense that they are both memory references, but the declaration of an array and the declaration of a pointer are not the same. When you declare an array, you are declare a region of memory all at once, and you must be able to always reference that memory, otherwise you'll have a memory leak. That is, an array value is immutable or constant — it cannot change.

When you declare a pointer, on the other hand, you are creating a variable whose value reference memory, but the memory it references can change.

3.2 Pointer Arithmetic

Let's take a closer look at the last output:

user@si485H-base:demo$ ./p_arrays 
a=0xbffff6b4 p=0xbffff6b4
a+0=0xbffff6b4 *(a+0) = 10
a+1=0xbffff6b8 *(a+1) = 11
a+2=0xbffff6bc *(a+2) = 5
a+3=0xbffff6c0 *(a+3) = 13
a+4=0xbffff6c4 *(a+4) = 14

Notice the memory reference on each incrimination. While we are adding just 1 to the value a, the resulting memory reference is changing by 4. That is because each integer is four bytes wide, so the computer has to shift each array index reference by the size of the data types stored within the array.

Consider the same style program for different types, and this fact becomes apparent:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  int a[] = {10,11,12,13,14};
  short s[] = {10,11,12,13,14};
  char c[] =  {10,11,12,13,14};

  int i;
  for(i=0;i<5;i++){
    printf("a+%d=%p *(a+%d) = %d\n",i,a+i,i,a[i]);
  }

  printf("\n");

  for(i=0;i<5;i++){
    printf("s+%d=%p *(s+%d) = %d\n",i,s+i,i,s[i]);
  }

  printf("\n");

  for(i=0;i<5;i++){
    printf("c+%d=%p *(c+%d) = %d\n",i,c+i,i,c[i]);
  }

  return 0;
}

user@si485H-base:demo$ ./pointer_arithemtic 
a+0=0xbffff6a8 *(a+0) = 10
a+1=0xbffff6ac *(a+1) = 11
a+2=0xbffff6b0 *(a+2) = 12
a+3=0xbffff6b4 *(a+3) = 13
a+4=0xbffff6b8 *(a+4) = 14

s+0=0xbffff69e *(s+0) = 10
s+1=0xbffff6a0 *(s+1) = 11
s+2=0xbffff6a2 *(s+2) = 12
s+3=0xbffff6a4 *(s+3) = 13
s+4=0xbffff6a6 *(s+4) = 14

c+0=0xbffff699 *(c+0) = 10
c+1=0xbffff69a *(c+1) = 11
c+2=0xbffff69b *(c+2) = 12
c+3=0xbffff69c *(c+3) = 13
c+4=0xbffff69d *(c+4) = 14

3.3 Strings

C-strings(!!!!) the bane of student programmers world wide, but they are not really that bad if you remember that string is simply an array of characters that is null terminated. In its most pedantic usage, we can declare a string as an array in the standard sense.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  char str[] = {'H','e','l','l','o',' ','W','o','r','l','d','!','\n','\0'};

  printf("%s",str);

  return 0;
}

However, this is a huge burden, so in C, we can use the double-quote mark to declare strings which automatically creates the array and null terminates. As below:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  char str[] = "Hello World!\n";

  printf("%s",str);

  return 0;
}

Notice, that when we declare a string with double-quotes, we still have to declare the variable that references the string as an array type (or a pointer type) that's because it is still an array.

And, like arrays of integer, we can still reference individual items in the array using pointer iteration. It is also typical with strings to leverage the fact that it is NULL terminated. NULL is just a pseudonym for 0, which is a false value.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){

  char str[] = "Hello World!\n";
  char * c;

  for( c = str;*c; c++){
    putchar(*c);
  }

  return 0;
}

4 Program Memory Layout

4.1 Cowboys and Endian-s

If you think about it, there are two very different fundamental ways to organize bytes with meaning. As western language thinkers and learning, we assign meaning from left to right. For example, if I wrote down the number:

210500

That is the number two-hundred-and-ten thousand, five hundred. It is not the number five thousand and twelve (reading the number right to left).

Where the most-significant byte is represented is referred to as endian-ness in computer science. By far the most prevalent representation is little endian, by which the least significant bytes come first. This is in opposition to big endian, which is how we think of things, where the most significant byte comes first.

The impact of this becomes clear with a simple program below.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv){

  int a = 0xdeadbeef;

  //treat the integer like a array of one-byte values
  char * p = (char *) &a;

  int i;
  for(i=0;i<4;i++){
    printf("p[%d] = 0x%hhx\n",i,p[i]);
  }

  return 0;

}

Here, we treat the integer a as a buffer of 4-bytes. If we write those bytes out indexed from 0 to 3, we see that the number 0xdeadbeef is written out in reveres, 0xef, 0xbe, 0xad, 0xde.

user@si485H-base:demo$ ./endian 
p[0] = 0xef
p[1] = 0xbe
p[2] = 0xad
p[3] = 0xde

4.2 Stack, Heap, Data, BSS

Now that we have a decent sense of how we interact with memory, let's spend some time looking at the memory address layout. The big question is, when we declare a variable/value/data, where is that data located?

To start, we need to remind ourselves of the virtual address space, which allows a single program to treat the entire available memory space in an ordered way, from high to low addresses. In reality, the memory is stored in physical RAM in random locations, but from the program's perspective, it's all lined up.

When a program is loaded into memory, there are different regions for different kinds of data. A key concept, though, is that the program's data and its code all reside in memory together, which is what we will leverage when we exploit programs later.

Here are the general areas of the program memory layout, from higher address to lower address. Note that this is a 32-bit memory addresses, so there is total of about 4GB in the memory space.

higher address
0xffffffff --> .----------------.
               |    reserved    |  <-- command line args
               +----------------+      envirment variables
               |                |
               |     stack      |  <-- user stack, function frames
               |       |        |
               :       |        :
               '       v        '
                                   <-- mapped data
               .       ^        .
               :       |        :
               |       |        |
               |     heap       |  <-- user heap, dynamic memory
               +----------------+
               |      bss       |  <-- global memory 
               +----------------+
               |     text       |  <-- code segments
0x00000000 --> '----------------'
lower address

Here's what each of these sections are used for:

reserved: the reserved space is used for passing enviroment variables and command line arguments to the program.
stack: the stack is for organizing the execution of the program into stack frames for tracing functions and local variables. Each function call pushes a stack fram. from on the stack, and each return pops off a stack frame. The stack grows towards lower addresses, into empty memory address space.
heap : the heap is for dynamic, global memory allocations, such as called from malloc()
bss : the bss is used to store gloabl or stacically declared vaues
text : is where the program code, i.e., the x86 instructions, is stored.

We can see each of these memory locations in the following program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void foo(void) { return; }

int main(int argc, char *argp[], char *envp[]){

  int a = 10;

  char stack_str[] = "Hello";

  char * heap_str = malloc(strlen(stack_str)+1);
  strcpy(heap_str,stack_str);

  char * bss_str = "World";



  printf("(reserved)   evnp = 0x%08x \n", envp);
  printf("(stack)        &a = 0x%08x \n", &a);
  printf("(stack) stack_str = 0x%08x \n", stack_str);
  printf("(heap)   heap_str = 0x%08x \n", heap_str);
  printf("(bss)     bss_str = 0x%08x \n", bss_str);
  printf("(text)       main = 0x%08x \n", main);
  printf("(text)        foo = 0x%08x \n", foo);
}

user@si485H-base:demo$ ./mem_layout 
(reserved)   evnp = 0xbffff77c 
(stack)        &a = 0xbffff6c4 
(stack) stack_str = 0xbffff6be 
(heap)   heap_str = 0x0804b008 
(bss)     bss_str = 0x08048630 
(text)       main = 0x080484b3 
(text)        foo = 0x080484ad