Quickie Dev #2: Lists, sets, dicts, tuples… in Python

Because choosing the right data structure can save your program!

This article is also available on Medium.

When you want to store a pack of data in Python, you can pick one of several built-in iterable types (in the docs, they’re classified in “sequence”, “set” and “mapping” types). Most of us know of the lists and the dicts, but there are some others that can prove very efficient in specific situations.

It’s important to know the strengths and weaknesses of those common data types because choosing the proper data structure at the right time can drastically alter (in good or in bad) the performance of your application!

Lists, sets, tuples, dicts: what are they?

I’ll dive more into each of these data types and their properties in the following section, but as a quick intro:

  • lists are mutable sequence types denoted by square brackets [] and they store contiguous elements (potentially of different types) ordered by integer indices
  • sets are denoted by brackets {} and they’re an unordered collection of unique hashable elements – you may add or remove elements from a set
  • a frozenset is almost identical, but once you’ve created it you can’t add or remove elements from it anymore; it is immutable and thus hashable
  • tuples are a sequence of ordered elements denoted by parenthesis (): the elements are keyed by an integer and they can be of different types but the resulting sequence is always immutable and thus hashable
  • (named tuples are an extension of tuples where you can access also elements by an attribute instead of the usual integer key)
  • dicts (or dictionaries) are the Python mapping type: they are an efficient way of creating and accessing key-value pairs (where the key is any hashable object and the value is anything you want) but they do not insure the order of the keys

What properties can they have?

When you choose between data structures, you’re usually primarily interested in one or more of the following properties.

Uniqueness

Let’s start simple with an intuitive property: elements uniqueness. Some iterable types insure that all the elements they contain appear only once (anytime you try to re-add a value that is already present, it is simply ignored). It’s the case of Python sets and, to some extent, dicts.

Sets are mathematical objects that have their own theory. A set is an unordered collection of distinct elements and they’re specifically designed to compute the union, difference or intersection of two collections.

The 4 basic set operations (union, intersection, difference and symmetric difference). Image from: https://www.datacamp.com/community/tutorials/sets-in-python

The picture above describe the 4 basic operations you can do with sets and shows their associated Venn diagrams. Those operations are easily called in Python:

> A = {1, 2, 3}
> B = {3, 4, 5}
> A.union(B)
set([1, 2, 3, 4, 5])
> A.intersection(B)
set([3])
> A.difference(B)
set([1, 2])
> B.difference(A)
set([4, 5])
> A.symmetric_difference(B)
set([1, 2, 4, 5])

We can even check if two sets have a non-null intersection with isdisjoint() and there are shortcuts for all of these operations as explained in this Datacamp tutorial.

Since they only have one copy of each value, sets are a neat way to remove duplicates from a list, if you don’t care about the elements order:

> L = [1, 1, 4, 0, 4, 6, 3]
> S = set(L)
set([0, 1, 3, 4, 6])
> L2 = list(S)
[0, 1, 3, 4, 6]

And of course, sets are very efficient if you need to check if two lists of words have any entry in common, for example!

Dicts somewhat share this uniqueness property too since you cannot have two different values for the same key: if the key is already present in the dict and you set a new value for it, the first value will be lost. This can be useful if you specifically want to perform memoization and overwrite values (for example when computing math sequences like Van Eck’s). However, dicts are not optimized for all the sets operations and are therefore not the best tool to get elements common to two collections.

Order

Another easy-to-understand property for iterable types is whether or not they are ordered. Ordered collections record the order of insertion when you add elements so that you can later:

  • access an element by a specific integer key (the index of the element in the collection)
  • slice the collection and extract just a part of it

By default, lists and tuples are ordered, so you can access their elements using integer indices. They are 0-indexed (meaning the first element has index 0) and support negative indexing to access elements starting from the end of the collection. For example:

> L = [1, 2, 3]
> L[0]
1
> L[-1]
3
> T = (4, 5, 6)
> T[-2]
5

Tuples can be extended using named tuples – this allows you to access elements using a custom attribute:

> from collections import namedtuple
> Point = namedtuple('Point', ['x', 'y'])
> p = Point(1, 2)
> p[0]
1
> p.x
1

Be extra careful with dicts: although you’ll usually get the keys in the same order you added them initially, you should not assume keys are ordered! They may be accessed in a different order at one point in the program.

Note: I’m only focusing on built-in Python types here but Python’s collections package is full of interesting additional data structures, such as the OrderedDict that insures the keys are walked through in the same order you originally added them.

Mutability (and hashable types)

If a variable is mutable, then you can update its value once it’s been created; else, it sticks with the value it had upon creation for the rest of its life. For example, a list is mutable because you can add or remove elements from it whenever you want, and you can even go and change one particular element after you’ve put it in the list. On the other hand, tuples are immutable: you decide what the value of your tuple is when you first create it, and then it will stay the same ad vitam aeternam.

You may think that immutable types are useless because they only bring constraints. But that’s far from the truth! The big advantage of immutable types is that they can be associated with a unique compressed representation, called a hash. Therefore, immutable objects are said to be hashable and this means that they can be used as keys for mapping types like the dicts.

Memory consumption

Now, why would you care if you use a list or a dict? It’s often about the space your collection takes in memory.

As a general rule of thumb:

  • taking less memory is better
  • having a scalable data storage is better

Scalability is the ability a system has of adapting and handling a heavier workload if you give it more resources. In our case, it’s about having a program that’s still able to work if we give it more input data.

A very common example is when you’re programming little video games and you want to store a 2D game map in Python (yes, I’m thinking of roughly half the Codingame problems here ;)). If you need to store every single cell of the map, then you can create a list where you transform 2D coordinates to 1D coordinates, or you can even create a list of lists to keep your 2D-structure. But if your map is empty except for a few points of interests, it’s a shame to use so much memory! Instead of lists, you can use dicts and create tuples for keys.

Say you have a map like the one below, where:

  • spaces are empty cells and will be coded with 0
  • Xs are walls and will be coded with 1
  • Ps are power-ups and will be coded with 2
XXXXXXXXXXXX
X          X
X   P      X
X          X
XXXXXXXXXXXX

The naive list-based approach would make for the following data (with 60 elements):

[
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  [1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1],
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]

Whereas the dict-based one gives us (with 31 elements):

{
  # top border
  (0, 0): 1, (1, 0): 1, (2, 0): 1, (3, 0): 1, (4, 0): 1, (5, 0): 1,
  (6, 0): 1, (7, 0): 1, (8, 0): 1, (9, 0): 1, (10, 0): 1, (11, 0): 1,
  # left border
  (0, 1): 1, (0, 2): 1, (0, 3): 1,
  # right border
  (11, 1): 1, (11, 2): 1, (11, 3): 1,
  # power up
  (4, 2): 2,
  # bottom border
  (0, 4): 1, (1, 4): 1, (2, 4): 1, (3, 4): 1, (4, 4): 1, (5, 4): 1,
  (6, 4): 1, (7, 4): 1, (8, 4): 1, (9, 4): 1, (10, 4): 1, (11, 4): 1,
}

As you can see, the “list approach” always requires W x H cells, where W is the width of the map and H is its height. No matter how many cells are empty, we have this whole lot of zeros in the middle. On the other hand, with our dict, we only have as many entries as there are non-empty cells. In our small map, the difference isn’t huge (though we already divided the number of stored elements by 2). But think of a map that is way larger and completely empty in the middle, like this other one:

XXXXXXXXXXXXXXX
X             X
X             X
X             X
X             X
X             X
X             X
X             X
X             X
XXXXXXXXXXXXXXX

Now, the “dict approach” is so much better! It only requires us to use 46 elements (for the border walls), instead of 150 cells with the “list approach”. We could reduce this even further by using smart compression storage ideas; for example, we could store “blocks” of contiguous cells with the same type in a single element by using 4D-tuples representing small rectangles as: (min x, min y, max x, max y).

Note: if you’re interested in techniques for storing only “relevant” nD data, take a look at how to store sparse matrices!

To sum up

Here is a quick table to sum up the various properties of the data structures I talked about in this article:

Ordered Unique Mutable Keyed by Usual applications
List Y N Y integer
  • sorted storage
  • mutations
Tuple Y N N integer / attribute
  • data grouping
  • sequence hashing
Set N Y Y
  • de-duplication
  • set operations
Frozenset N Y N
  • de-duplication
  • set operations
  • set hashing
Dict N ~Y Y hashable type
  • sparse storage
  • memoization

Conclusion

Picking the right data type is important – it determines how much memory and computing power your program needs and it may even be the reason your application works or not. A well-known mantra in coding whenever you work on a new project is: “Make it work, make it right, make it fast”. It’s a nice phrase that mirrors the 3 stages lots of your applications will probably go through:

  • stage 1: at least, when you can click “compile” (or “run), it compiles (or “runs”)…
  • stage 2: now, it even computes the correct results, yay!
  • stage 3: amazing, it didn’t take the whole week-end but only 2 minutes!!

Many times, you’ll want to stop at stage 2. But stage 3 is important because a piece of code that computes the right thing but cannot actually be used in everyday life really loses value. There is even an entire math field devoted to the study of those perfect algorithms that can never be run in real life, called computational complexity theory, that is mostly famous for its P versus NP problem. Plenty of coding competitions and/or training exercises (like the ones I mentioned earlier in Codingame) force you to take this additional step by presenting you with test samples that gradually scale in size – thus a non-optimized program will struggle or even crash before it can complete the task on the larger datasets. And sometimes, the solution is not in the algorithm itself but in the way you store and access your data…

Leave a Reply

Your email address will not be published. Required fields are marked *