Collections - Guides - Resources

From a programmer's perspective, taking care of memory and time are the most important issues. Computers have limited memory and accessing it has a computational cost.

The first step I always do before I start programming is to think about the problem and the data structure. To define the data structure well, it is necessary to know what will be the way to access the data. For example, how to iterate over the elements, access an element or insert elements. Also, it is important to know the relationship between the elements: are they unique, do they aggregates, or is there an order?

The Python language hides the implementation of the data structures that it provides. This relaxes the importance that the Python programmers must place on defining their data structures. Obviously, this is not a Python problem that we need to fix. Python is a great language and we just need to learn how it works in order to take advantage of its behavior.

To explain these data structures, I will base this text on the Java language. I studied for the Java Certification (SCJP and SCWCD) and I realized that the Java Collection Framework (JCF) is a well defined API. It defines a set of interfaces related to the different collections that a programmer can use: list, set and map (In this case "map" is not a function. In the 80's the programmers call "map" to the structure that relates a key with a value, nowadays it's usually called: dictionary. For example, in C++ this structure is called "std::map").

List: collection of elements that can be repeated in the collection and ordered by position (the programmer defines the position: this is the first, this is the second, and so on)
Set: collection of elements that cannot be repeated and aren't ordered (See below if you want to have the elements ordered).
Map: collection of key-value elements. In fact, this collection is a kind of Set, where the key is the element to insert into, and the value is the related data.

In Java, only List and Set inherit from the Collection interface. The problem is that Map has a relation between the key and the value, and the programmer cannot add an element into a Map using the method "add", which is defined in the Collection interface. The programmer must use the method "put", which has two parameters: the key and the value.

Collection, List, Set, and Map are interfaces in Java. The programmer needs to create instances of classes, and the Framework defines a bunch of classes to use the different collection types depending of the algorithm:

ArrayList: class that implements List using an array (continuous memory to store all the elements). It's fast indexing elements, but slow adding elements between others or if it needs more memory.
LinkedList: class that implements List using a linked list (a bunch of nodes containing an element and a reference to the next node). It's slow indexing elements, but fast adding new elements between others.
Stack: class that extends Vector, which is close similar to the ArrayList class. In this class, the programmer can push and pull elements into the collection.
HashSet: class that implements Set using a Hash table, which is an array where the element index is defined using a hash function (function that returns an integer from a bunch of bytes).
TreeSet: class that implements Set using a binary tree. In this case, the elements are ordered using the "Natural ordering" of its keys, which it is, for example, the alphabetical ordering for Strings, ascending for numbers...
HashMap and TreeMap: classes that implement Map and have the same functionality that the two previous one, but storing the value related to the key.

At this point, we can define this simple table:

Class	Can repeat an element	Ordered by position	Natural ordering
ArrayList	YES	YES	NO
LinkedList	YES	YES	NO
Stack	YES	YES	NO
HashSet	NO	NO	NO
TreeSet	NO	NO	YES
HashMap	Key: NO, Value: YES	NO	NO
TreeMap	Key: NO, Value: YES	NO	Key: YES, Value: NO

In Python, there are existing classes close similar to Java and we can create a relationship between them:

ArrayList: list
LinkedList: deque
Stack: deque
HashSet: set
TreeSet: it doesn't exist, but there is an another class, collections.OrderedSet, that orders by position.
HashMap: dict
TreeMap: it doesn't exist too, but there is the class collections.OrderedMap which orders the keys by position.

So, if you want to have a bunch of elements ordered by position you can use a list in Python. If you want that there are not duplicated elements, you can use a set. If you want both behaviors, you must program it by yourself, because neither list and set provides you this functionality (order in a set is not guaranteed!).

The breadth-first search algorithm is an useful example, where the main goal is find the shortest path between two nodes in a graph. In this case, the algorithm receives a graph, the start node (root) and the end node, and returns the shortest path between these nodes.

The algorithm uses two main structures: (i) "visited" that is a collection of nodes where the code just pass throw before and (ii) "seeds" that is a collection of nodes where the code reached, but needs to explore its children.

The next code uses a list in the structures "visited" and "seeds":

def find_path_1(g, f, t):
    if f == t:
        return [f, t]
    visited = []
    seeds = [(f, [])]
    while len(seeds):
        current, path = seeds.pop(0)
        if current == t:
            return path + [t]
        if current not in visited:
            visited.append(current)
            for _, next in g.edges(current):
                seeds.append((next, path + [current]))

The code uses "visited" to check if the node has been visited and avoid loops. Currently, the "list" checks if it contains a certain value while iterating over the elements. This has a linear cost. In order to improve this code, the type of this collection can be changed from list to a set, which has a logarithmic cost:

def find_path_2(g, f, t):
    if f == t:
        return [f, t]
    visited = set()
    seeds = [(f, [])]
    while len(seeds):
        current, path = seeds.pop(0)
        if current == t:
            return path + [t]
        if current not in visited:
            visited.add(current)
            for _, next in g.edges(current):
                seeds.append((next, path + [current]))

Another point is that the "seeds" structure acts as a queue. The next node to explorer is always the first in the list and its children will be appended at the end of the list. In this case, the list is not a good option because the cost of remove elements at the beginning of the list is high. In the next code, I changed the list by a deque:

def find_path_3(g, f, t):
    if f == t:
        return [f, t]
    visited = set()
    seeds = deque()
    seeds.append((f, []))
    while len(seeds):
        current, path = seeds.popleft()
        if current == t:
            return path + [t]
        if current not in visited:
            visited.add(current)
            for _, next in g.edges(current):
                seeds.append((next, path + [current]))

The test code is:

import networkx as nx
a = nx.barabasi_albert_graph(20000, 30)
find_path_1(a, 10, 9000)
find_path_2(a, 10, 9000)
find_path_3(a, 10, 9000)

And the different time code cost is:

find_path_1: 4.55 seconds
find_path_2: 3.5 seconds
find_path_3: 2.6 seconds

The three codes are so similar, but the third is running 40 % faster than the first!