Master Python's Dictionaries and Sets Fast(Python Core in Action 4)
Struggling with Data Efficiency in Python? Discover the Power of Dictionaries and Sets!
In the previous lessons, we learned about lists and tuples in Python, understanding their basic operations, and performance comparisons.
In this lesson, we will learn about two other commonly used and useful data structures: dictionaries (dict) and sets. Dictionaries and sets are widely used in Python and are highly optimized for performance, making them very important.
Basics of Dictionaries and Sets
What exactly are dictionaries and sets?
A dictionary is a collection of key-value pairs. In Python 3.7+, dictionaries are ordered (note: in 3.6, the ordering was an implementation detail, but it became a language feature in 3.7; thus, you cannot guarantee ordering in 3.6). Dictionaries are mutable, allowing the addition, deletion, and modification of elements.
Compared to lists and tuples, dictionaries perform better, especially for lookups, additions, and deletions, which can be done in constant time.
A set is similar to a dictionary but without key-value pairs. It is an unordered collection of unique elements.
Let's look at how to create dictionaries and sets in a few ways:
d1 = {'name': 'jason', 'age': 20, 'gender': 'male'}
d2 = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
d3 = dict([('name', 'jason'), ('age', 20), ('gender', 'male')])
d4 = dict(name='jason', age=20, gender='male')
d1 == d2 == d3 ==d4
True
s1 = {1, 2, 3}
s2 = set([1, 2, 3])
s1 == s2
True
In Python, dictionaries and sets can have mixed types for keys and values. For example, you can create a set with elements 1, 'hello', and 5.0:
s = {1, 'hello', 5.0}
Accessing elements in a dictionary is done via keys. If the key doesn't exist, an exception is thrown:
d = {'name': 'jason', 'age': 20}
d['name']
'jason'
d['location']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'location'
You can also use the get(key, default) function to return a default value if the key doesn't exist:
d = {'name': 'jason', 'age': 20}
d.get('name')
'jason'
d.get('location', 'null')
'null'
Now, let's discuss accessing elements in a set. Sets do not support indexing because they are fundamentally hash tables, unlike lists. Thus, the following operation will throw an exception:
s = {1, 2, 3}
s[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing
To check if an element is in a dictionary or set, use `value in dict/set`:
s = {1, 2, 3}
1 in s
True
10 in s
False
d = {'name': 'jason', 'age': 20}
'name' in d
True
'location' in d
False
Dictionaries and sets also support adding, deleting, and updating elements:
d = {'name': 'jason', 'age': 20}
d['gender'] = 'male' # Add the key-value pair 'gender': 'male'
d['dob'] = '1999-02-01' # Add the key-value pair 'dob': '1999-02-01'
d
# Output: {'name': 'jason', 'age': 20, 'gender': 'male', 'dob': '1999-02-01'}
d['dob'] = '1998-01-01' # Update the value associated with the key 'dob'
d.pop('dob') # Remove the key-value pair with the key 'dob'
# Output: '1998-01-01'
d
# Output: {'name': 'jason', 'age': 20, 'gender': 'male'}
s = {1, 2, 3}
s.add(4) # Add the element 4 to the set
s
# Output: {1, 2, 3, 4}
s.remove(4) # Remove the element 4 from the set
s
# Output: {1, 2, 3}
Note that the pop() operation on a set removes the last element, but since sets are unordered, you can't predict which element will be removed, so use this operation with caution.
In practice, we often need to sort dictionaries or sets, for example, to get the top 50 values.
For dictionaries, you can sort them by keys or values:
d = {'b': 1, 'a': 2, 'c': 10}
d_sorted_by_key = sorted(d.items(), key=lambda x: x[0]) # Sort dictionary by keys in ascending order
d_sorted_by_value = sorted(d.items(), key=lambda x: x[1]) # Sort dictionary by values in ascending order
d_sorted_by_key
# Output: [('a', 2), ('b', 1), ('c', 10)]
d_sorted_by_value
# Output: [('b', 1), ('a', 2), ('c', 10)]
This returns a list of tuples, where each tuple contains a key and its corresponding value from the original dictionary.
For sets, sorting is similar to lists and tuples. You can use sorted(set), which returns a sorted list:
s = {3, 4, 2, 1}
sorted(s) # Sort the elements of the set in ascending order
# Output: [1, 2, 3, 4]
Performance of Dictionaries and Sets
At the beginning of the article, I mentioned that dictionaries and sets are highly optimized for performance, especially for lookup, addition, and deletion operations.
Now, let's look at their performance in specific scenarios and compare them with lists and other data structures.
For example, in an e-commerce backend, we store each product's ID, name, and price. If we need to find the price of a product given its ID, we can use a list to store these data structures and perform a lookup. The code would be:
def find_product_price(products, product_id):
for id, price in products:
if id == product_id:
return price
return None
products = [
(143121312, 100),
(432314553, 30),
(32421912367, 150)
]
print('The price of product 432314553 is {}'.format(find_product_price(products, 432314553)))
# Output
# The price of product 432314553 is 30
If the list has n elements, the lookup process will have a time complexity of O(n). Even if we sort the list first and use binary search, it would still take O(log n) time complexity, plus O(n log n) for sorting.
However, if we use a dictionary to store this data, lookups become very efficient with a time complexity of O(1). This is because dictionaries are implemented as hash tables, allowing direct access via the key's hash value.
products = {
143121312: 100,
432314553: 30,
32421912367: 150
}
print('The price of product 432314553 is {}'.format(products[432314553]))
# Output
# The price of product 432314553 is 30
Now, let's say we need to find out how many unique prices there are. We can compare using a list and a set.
Using a list, the code would be:
# List version
def find_unique_price_using_list(products):
unique_price_list = []
for _, price in products:
if price not in unique_price_list:
unique_price_list.append(price)
return len(unique_price_list)
products = [
(143121312, 100),
(432314553, 30),
(32421912367, 150),
(937153201, 30)
]
print('Number of unique prices: {}'.format(find_unique_price_using_list(products)))
# Output
# Number of unique prices: 3
If the original list has n elements, in the worst case, the time complexity is O(n^2).
Using a set, the code would be:
# Set version
def find_unique_price_using_set(products):
unique_price_set = set()
for _, price in products:
unique_price_set.add(price)
return len(unique_price_set)
products = [
(143121312, 100),
(432314553, 30),
(32421912367, 150),
(937153201, 30)
]
print('Number of unique prices: {}'.format(find_unique_price_using_set(products)))
# Output
# Number of unique prices: 3
Because sets are highly optimized hash tables, elements cannot be repeated, and adding and lookup operations take O(1) time. Thus, the overall time complexity is O(n).
To give you a better sense of these time complexities, let's look at an example with real data.
The following code initializes a product list with 100,000 elements and measures the time to count unique prices using lists and sets:
import time
id = [x for x in range(0, 100000)]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))
# Measure time for list version
start_using_list = time.perf_counter()
find_unique_price_using_list(products)
end_using_list = time.perf_counter()
print("Time elapsed using list: {}".format(end_using_list - start_using_list))
# Output
# Time elapsed using list: 41.61519479751587
# Measure time for set version
start_using_set = time.perf_counter()
find_unique_price_using_set(products)
end_using_set = time.perf_counter()
print("Time elapsed using set: {}".format(end_using_set - start_using_set))
# Output
# Time elapsed using set: 0.008238077163696289
As you can see, even with just 100,000 elements, the speed difference is significant.
In large enterprises, backend data often involves billions of elements. Using the wrong data structure can easily crash servers, affecting user experience and causing huge financial losses for the company.
How Dictionaries and Sets Work
We've seen the efficiency of dictionary and set operations through examples and comparisons with lists.
But why are dictionaries and sets so efficient, especially for lookups, insertions, and deletions?
The reason lies in the internal data structure of dictionaries and sets.
Unlike other data structures, both dictionaries and sets use a hash table internally.
For dictionaries, this table stores the hash value, key, and value.
For sets, the table stores only single elements without key-value pairs.
Let's look at the hash table structure in older versions of Python:
--+-------------------------------+
| Hash Key Value
--+-------------------------------+
0 | hash0 key0 value0
--+-------------------------------+
1 | hash1 key1 value1
--+-------------------------------+
2 | hash2 key2 value2
--+-------------------------------+
. | ...
__+_______________________________+
As the hash table grows, it becomes more sparse. For example, consider this dictionary:
{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}
It might be stored like this:
entries = [
['--', '--', '--'],
[-230273521, 'dob', '1999-01-01'],
['--', '--', '--'],
['--', '--', '--'],
[1231236123, 'name', 'mike'],
['--', '--', '--'],
[9371539127, 'gender', 'male']
]
This design wastes a lot of storage space.
To improve space utilization, modern hash tables separate the indices and the hash values, keys, and values, like this:
Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------
Entries
--------------------
hash0 key0 value0
---------------------
hash1 key1 value1
---------------------
hash2 key2 value2
---------------------
...
---------------------
So, the previous example would be stored like this in the new hash table structure:
indices = [None, 1, None, None, 0, None, 2]
entries = [
[1231236123, 'name', 'mike'],
[-230273521, 'dob', '1999-01-01'],
[9371539127, 'gender', 'male']
]
This new structure greatly improves space utilization.
Understanding this design, we can see why operations like lookup, insertion, and deletion are so efficient in dictionaries and sets.
Insert Operation
When inserting an element into a dictionary or set, Python first calculates the hash value of the key (`hash(key)`). It then uses a mask (`mask = PyDictMinSize - 1`) to determine the index where this element should go (`index = hash(key) & mask`).
If this index in the hash table is empty, the element is inserted there. If the index is occupied, Python compares the hash values and keys of the two elements.
If both are equal, the element already exists, and if the values differ, it updates the value.
If they are not equal, this is a hash collision, meaning different keys have the same hash value. Python then searches for the next empty slot. Usually, a linear search is performed, checking each subsequent position until an empty one is found.
Python optimizes this search process internally, making it more efficient.
Lookup Operation
The lookup operation is similar to the insert operation. Python uses the hash value to find the expected position. It then compares the hash value and key of the element at that position with the one being searched. If they match, it returns the element; if not, it continues searching until an empty slot is found or an exception is raised.
Delete Operation
For deletions, Python temporarily marks the position with a special value. When the hash table is resized, these elements are removed.
Hash collisions can slow down operations. To maintain efficiency, dictionaries and sets usually ensure at least one-third of the hash table is empty. When the empty space falls below this threshold, Python expands the hash table, reallocating all elements.
Although hash collisions and resizing slow down operations, these events are rare. On average, insert, lookup, and delete operations still have a time complexity of O(1).
Conclusion
In this lesson, we learned the basic operations of dictionaries and sets, and how their internal structure ensures high performance. Dictionaries are ordered in Python 3.7+, while sets are unordered. Their hash table structure guarantees efficient lookups, insertions, and deletions.
Dictionaries and sets are ideal for scenarios requiring efficient element lookup and deduplication.
Questions
1. Which of the following dictionary initializations is more efficient?
# Option A
d = {'name': 'jason', 'age': 20, 'gender': 'male'}
# Option B
d = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
2. Can a dictionary key be a list? Is the following dictionary initialization correct? If not, why?
d = {'name': 'jason', ['education']: ['Tsinghua University', 'Stanford University']}
Thanks Meng!
Question 1: Option A is more efficient because Option B is duplicative (using the dict function on a dictionary
Question 2: Pretty sure you can’t have lists as keys, it’s only supposed to be a single value like a string or number but let me know if there’s a better explanation for why