Why Do Arrays Start at Zero?

Every new beginner to programming has a moment when learning about the array data structure where they find themselves wondering, why? I don't count anything else from zero? Let's talk abou the historical reasoning for this.

Prerequisites

This article only assumes that you are familiar with the idea of an array in any programming language.

If you program in a lower-level language, you likely already understand the reasoning and this article is probably not for you, but I'm glad you stopped by none the less!

What is an Array?

An array as defined by Merriam-Webster is "a data structure in which similar elements of data are arranged in a table". As higher-level languages like PHP, Ruby, Python, and JavaScript hit the scene, the lines became blurred with terms like array, list, or collection all being used interchangeably.

For the instant gratification folks -- the short and sweet is that arrays start with 0 because of pointer arithmetic. What does this mean? Let's first draw a comparison to arrays in more high-level languages.

Arrays in High-Level Languages

Using PHP as an example, an array is written like this:

$arr = ['Jody', "Johnny", "Jenny", "Jessie"];

This looks more or less like an "array" does in pretty much any other language, a collection of names that are grouped into a single unit. The size of the array is known, and all entries share a single datatype of string. However, this is also valid:

$arr = [1, "Jody", false];

This breaks one of the rules of arrays by mixing data types. The following also works, which PHP calls an associative array.

$arr = [
    "name" => "Jody",
    "age" => 32,
];

Now we are not only mixing data types, but we have a key-value relationship like what JavaScript would call an object, Python would call a dictionary, and Java would call a hash map. This is because under the hood, an array in PHP is actually an implementation of an ordered map.

Arrays in Low-Level Languages

To contrast, let's look at the idea of an array in a lower level language like C.

int ages[] = {32, 36, 41, 42, 45};

The syntax is almost identical, but there is vastly different behavior occurring here. The array can only contain a single datatype as dictated by the definition, in this case, integers. The reason being is that a true array, items are stored in contiguous units of memory. Imagine a table where each column represents 4 bytes of memory, which is the size of a 32-bit integer.

This is the same reason why the size of an array must be known at the point of initialization, because if you create an array with five items and a size of five, then later want to add five more items, there is no guarantee there are five blocks of memory available to be allocated from the same starting point. What makes that detail so significant?

It comes down to how C and similar languages allocate memory and then reference the value stored to that memory.

Pointers and Pointer Arithmetic

If you have not written in a language with lower-level memory controls, you may be unfamiliar with pointers. Just think of it as a label. Referencing the code above, ages is not an actual value like when you save a single boolean or integer to a variable, but rather a pointer that points to a specific location in memory. This i where the table we mentioned earlier lives.

The way pointer arithmetic works is that you can take the value of a pointer, a memory address, add the size of the data type (4 bytes / 32 bits in this case for an integer) and that gives you the memory address of the next item in the array. C, and other similar low level languages handle this for you behind the scenes when looking up an array item by an index. However, you may see this done manually in other use cases.

Conclusion

This is the reason that arrays, in most languages, start with zero. Because you aren't actually dealing with an index like a numbered list of items on a todo list, but actually a memory offset. So arr[2] isn't saying give me the second item in the array, it's saying give me a memory address of:

desired address = starting memory address + (size of data type * 2)

There are languages like Lua that index "arrays" starting at 1, but for the most part, it will always be zero for one or more reasons such as the language using arrays under the hood even if the array from userland is not an array in the strict sense or just to match a long-established convention.