Deep Dive into Strings (Python Core in Action 5)
Struggling with Python Strings? Here Are the Solutions You Need!
In Python, strings are everywhere, commonly seen while reading code.
Strings are a frequent data type in Python. They appear in log printing, function comments, database access, basic variable operations, and more.
You probably already know a bit about strings.
Today, we'll review common string operations and explain some useful tricks.
String Basics
What is a string?
A string is a sequence of characters, typically enclosed in single quotes (''), double quotes (""), or triple quotes (''' ''' or """ """). Here are a few examples:
name = 'jason'
city = 'beijing'
text = "welcome to jike shijian"
In these examples, `name`, `city`, and `text` are all strings.
In Python, single, double, and triple-quoted strings are identical. For example, `s1`, `s2`, and `s3` in the following example are exactly the same:
s1 = 'hello'
s2 = "hello"
s3 = """hello"""
s1 == s2 == s3
True
Python supports all three forms to allow for strings containing quotes. For instance:
"I'm a student"
Triple-quoted strings are mainly used for multi-line strings, such as function comments.
def calculate_similarity(item1, item2):
"""
Calculate similarity between two items
Args:
item1: 1st item
item2: 2nd item
Returns:
similarity score between item1 and item2
"""
Python also supports escape characters, which are special characters starting with a backslash.
Here's a table of common escape characters:
For example:
s = 'a\nb\tc'
print(s)
a
b c
In this code, '\n' represents a newline character, and '\t' represents a tab character.
So, the output is:
a
b c
Note that the string `s` still contains only 5 characters:
len(s)
5
The most common escape character is the newline '\n', often seen in file reading where each line ends with '\n'.
When processing data, we usually remove these newline characters.
Common String Operations
After discussing the basics of strings, let's look at some common string operations.
Think of a string as an array of characters. Therefore, Python strings support indexing, slicing, and iteration.
name = 'jason'
name[0] # 'j'
name[1:3] # 'as'
Like lists and tuples, string indices start at 0. `index=0` refers to the first character, and `[index:index+2]` represents a substring from the `index` to `index+1`.
Iterating through a string is straightforward, as it involves iterating through each character.
for char in name:
print(char)
# Output:
# j
# a
# s
# o
# n
It's crucial to note that Python strings are immutable. Hence, you cannot change a character in the string directly.
s = 'hello'
s[0] = 'H'
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: 'str' object does not support item assignment
To change a string, create a new one. For example, to change 'hello' to 'Hello':
s = 'H' + s[1:]
s = s.replace('h', 'H')
The first method uses the `+` operator to concatenate 'H' with a slice of the original string.
The second method replaces 'h' with 'H' to create a new string.
In other languages like Java, mutable string types like `StringBuilder` allow efficient modifications. Unfortunately, Python doesn't have such types, so creating new strings is necessary, often requiring O(n) time complexity, where n is the length of the new string.
You might have noticed terms like "usually" or "often" rather than "always." This is because Python's performance optimizations have improved over time.
Consider the `+=` operator for string concatenation, which can be an exception to immutability:
str1 += str2 # equivalent to str1 = str1 + str2
For example:
s = ''
for n in range(100000):
s += str(n)
Initially, you might think the time complexity is O(n^2), as each loop iteration creates a new string. But since Python 2.5, if `str1` has no other references, Python tries to expand the buffer size in place rather than allocating new memory. Thus, the time complexity is O(n).
So, feel free to use `+=` for convenience without worrying too much about efficiency.
Additionally, you can use the `join` function for string concatenation:
l = [str(n) for n in range(100000)]
l = ' '.join(l)
Since list `append` operations have O(1) complexity, the overall time complexity is O(n).
Next, let's look at the `split()` function:
path = 'hive://ads/training_table'
namespace = path.split('//')[1].split('/')[0] # 'ads'
table = path.split('//')[1].split('/')[1] # 'training_table'
data = query_data(namespace, table)
Other common functions include:
`string.strip(str)` removes leading and trailing `str`.
`string.lstrip(str)` removes leading `str`.
`string.rstrip(str)` removes trailing `str`.
These functions are useful for data parsing. For example, to remove leading and trailing whitespace:
s = ' my name is jason '
s.strip() # 'my name is jason'
Python strings have many more useful functions, such as `string.find(sub, start, end)` to find the position of a substring. While I've highlighted the most common and error-prone ones, you can explore further in the documentation.
String Formatting
Finally, let's look at string formatting. What is string formatting?
Usually, we use a string as a template with placeholders.
These placeholders reserve positions for actual values to display them in the desired format.
String formatting is often used in program outputs, logging, etc.
Here's a common example:
Suppose we have a task to fetch user info from a database using a given `userid`. If the user isn't found, we log it for future analysis or debugging.
We usually do this:
print('no data available for person with id: {}, name: {}'.format(id, name))
Here, `string.format()` is the formatting function, and `{}` are placeholders for the actual values, like `name`. If `id = '123'` and `name = 'jason'`, the output will be:
'no data available for person with id: 123, name: jason'
Simple, right?
Note that `string.format()` is the latest string formatting method and standard.
We also have other methods. In earlier versions of Python, we used `%` for formatting. The above example can be written as:
print('no data available for person with id: %s, name: %s' % (id, name))
Here, `%s` represents a string, `%d` represents an integer, etc. These are basics you should know.
However, I recommend using the `format` function in your code. It's the latest standard and recommended in official documentation.
Some might ask, why use a formatting function when string concatenation works too?
True, string concatenation can meet many formatting needs.
However using a formatting function is clearer, more readable, and standardized, reducing errors.
Conclusion
In this lesson, we mainly learned some basic knowledge and common operations of Python strings. We explained these with specific examples and scenarios. Pay special attention to the following points:
In Python, strings can be represented by single quotes, double quotes, or triple quotes. They are the same and have no difference. Triple quotes are usually used for multi-line strings.
Strings in Python are immutable (except for the `+=` operation in newer versions of Python). Therefore, you cannot change the value of characters in a string.
String concatenation in Python has become much more efficient than before, so you can use it with confidence.
String formatting (`string.format`) in Python is often used in outputs, logging, and similar scenarios.
Exercise
Finally, here is a question for you to think about.
Which of the following string concatenation methods do you think is better? Feel free to leave a comment and share your thoughts. Also, you are welcome to share this article with your colleagues and friends.
s = ''
for n in range(0, 100000):
s += str(n)
l = []
for n in range(0, 100000):
l.append(str(n))
s = ' '.join(l)