5 Working with Text

Learning Outcomes for this Lesson

Upon completion of this lesson you should:

…know when and how to convert numerical data to and from strings
…be able to perform operations to manipulate strings
…know how to format strings to get desired output
…be able to use f-strings to insert values into strings
…know how to write multiline strings
…know how string comparisons work

Figure 5.1: This cartoon references the Perl programming language, which is particularly good for working with text, as well as Regular Expresssion’s, also know as “Regex” which is a syntax for creating complicated queries to find text embedded in other text. Python supports ‘regex’ via a package called `re` included in the standard library. Regex is outside the scope of this class, but it’s worth logging it’s existance in the back of your mind for some future use case.

5.1 Creating Strings

We can create a variable containing text in the same way we assign a number:

s = 'some text'

We can also convert numerical data to string format:

number = str(11/7) 
print("Result:", number)

Result: 1.5714285714285714

Note that the evaluation of 11/7 occurs before the string conversion occurs.

In addition to assigning values using hand-written statements, it is also possible to ask the user for some text input using the following built-in function:

s = input()

This features is rarely used in scientific Python code. However, it is often used in “utility” apps that do things like delete duplicate photos or upload files. Theses types of programs will often ask “Are you sure (y/n)”, which indicates that your two options are “y” and “n” (for yes and no, obviously).

Figure 5.2: A screenshot of a commandline program asking for user input.

5.2 Manipulating Strings

5.2.1 Strings are Containers

When working with strings it is often helpful to keep in mind that strings are containers, where each character is a “item” in a list.

For instance, subsets of strings can be accessed by index:

s = 'this is a string'
print(s[0])

Or:

s = 'this is a string'
print(s[0:4])

this

Or:

s = 'this is a string'
print(s[5:7])

is

It is not possible to write to a string using indexing unfortunately. In this regard they are more like tuples than lists (recall that tuples are immutable).

5.2.2 Operations on Strings

As has been pointe out already, the reason Python insists on everything having a type is so that is knows how to perform operations on them.

The + symbol is a very useful example of this. When placed between two ints like 1 + 3, we know that + means addition in the mathmatical sense. However, when placed between two strs like 'taylor' + 'travis' we also know this means join in the physical sense.

Sure enough:

'taylor' + 'travis'

'taylortravis'

This is called concatenation.

We can use other operators on strings too:

'Nom'*5

'NomNomNomNomNom'

'worries'*0

''

Note that - and / don’t work, which also agrees with our understanding of operations you can do to text.

5.2.3 Using the String Object’s Methods

Of Python’s many built-in functions, only a few can be applied to strings in any useful way: sorted and len(), for example. Luckily, a str object carries with it a rather large set of methods which can be used to alter the string such as removing trailing spaces or padding a number with 0’s.

Partial list of methods attached to a `str` object. For a complete list see Table C.1 {#tbl-partial-string-methods}.
Method	Description
`capitalize()`	Converts the first character to upper case
`count()`	Returns the number of times a specified value occurs in a string
`endswith()`	Returns `True` if the string ends with the specified value
`find()`	Searches the string for a specified value and returns the position of where it was found
`join()`	Converts the elements of an iterable into a string
`lower()`	Converts a string into lower case
`replace()`	Returns a string where a specified value is replaced with a specified value
`split()`	Splits the string at the specified separator, and returns a `list`
`startswith()`	Returns `True` if the string starts with the specified value
`upper()`	Converts a string into upper case
`zfill()`	Fills the string with a specified number of `0` values at the beginning

Several of these methods are especially useful. For instance, the replace() method can be used to replace any characters which are causing problems.

For instance, perhaps we would to replace the \t escape sequence with a ,:

s = '123\t456'
s = s.replace('\t', ',')
print(s)

123,456

Another useful tool is split(), which will turn a single string into multiple strings by splitting it at the given character:

s = 'filename.txt'
s = s.split('.')
print(s)

['filename', 'txt']

5.2.4 Joining Multiple Strings into a Single String

We have seen that Python will interpret operators like + differently depending on data type. We can join two strings using:

a = 'Hello'
b = 'World'
c = a + ' ' + b
print(c)

Hello World

It is common to have many strings that need to be “joined”. We could do it the verbose way:

1strings = ['this', 'is', 'a', 'list', 'of', 'several', 'strings']
2new_string = ''
3for item in strings:
4  new_string = new_string + ' ' + item
print(new_string)

1: Each word is an item in a list
2: We need to initialize an empty string so we have something to add to
3: item will take on each values in the list
4: Here we use the + operator to add each item to the new string, so it gets longer by 1 word on each loop

 this is a list of several strings

But because it is so common Python has a shortcut for us. The join method on a string object accepts a list of strings as follows:

strings = ['this', 'is', 'a', 'list', 'of', 'several', 'strings']
1new_string = ' '.join(strings)
print(new_string)

1: We still need to create an empty string to get started. Here use ' ' with a space between the quotes. We then call the join() method of this string, and pass our list of strings. Each word in strings will be added to the growing string, separated by the space in the original string.

this is a list of several strings

We can convert a string to a list, then we can use indexing to change specific characters, then use the join() method to convert back:

s = 'this is a string'
1s2 = list(s)
print(s2)
2s2[5] = 'wa'
3s3 = ''.join(s2)
print(s3)

1: Converting a string to a list puts each character as a separate entry in the list.
2: We can then change any character as needed.
3: Converting back to a string is a bit tricky though. We can’t just use str(s2). We must use the join() method to “join” each character in the list back into a string.

['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g']
this was a string

Another very common situation is to separate a sentence into separate words. We can use the split() method for this, which returns a list with each word as a separate item.

s = 'this is a sentence'
1s2 = s.split(' ')
print(s2)

1: We can use any character or set of characters as the splitting criteria. Here we have used a space which gets us a list of each isolated word.

['this', 'is', 'a', 'sentence']

And we can use the join() method to put them all back together once we’re done:

s = 'this is a sentence'
1s2 = s.split(' ')
2s2.insert(3, 'short')
3s3 = ' '.join(s2)
print(s3)

1: We need to catch the result of the split() method in a new variable.
2: s2 is a list already, with each word as an item, so we can use list methods; in this case we used insert() to insert a new word.
3: We can use the join() method to put things back into a sentence.

this is a short sentence

5.2.5 Converting Strings to Other Types

It may seem a bit silly to convert 1.0 to '1.0' and back, but this is actually extremely common when reading and writing files. We will talk about reading and writing files later in the course, but the main issue is that we simple humans require files that are human readable, while computers prefer other more efficient formats. “Human readable” basically means a file filled with text. For this reason we need to convert all numerical values to str before writing to a file that is intended to be human readable, and vice versa.

The following code will not work:

number_of_cookies = 3
message = "There are " + number_of_cookies + " cookies."

Why not? Because you cannot add a string and a numerical value. Instead we must do:

number_of_cookies = 3
message = "There are " + str(number_of_cookies) + " cookies."
print(message)

There are 3 cookies.

And of course we can convert in the reverse direction too:

input_string = "4.56893"
input_value = float(input_string)
print("Result:", input_value)

Result: 4.56893

Note that the conversion between text and numerical values requires that the item is eligible for conversion. An error will occur when doing int('a') for instance.

5.2.6 Escape Characters

One very common task is writing data to a file, and a common subset of this task is writting tabular data, like:

Time	Temperature	Humidity
13:31.04	29.1	55
14:29.11	28.2	56
15:30.40	28.5	54

To make our file “human readable” we would insert “tabs” (or spaces, or commas) between values, and “new lines” after each row is complete. Python does not do this for us by default, so we need to insert some “markers” into the text. These are:

\n for a new line
\t for a tab

The appearance of a \ tells Python that the next character is special. This \ is called an escape character because it tells Python to “escape” the act of reading the text as pure text.

This presents a small conundrum though: how to you put an actual \ in the text? You escape it with \\!

One more source of trouble comes from the use of ' and ". If you start your string with ', then Python will end the string when it sees the next '. If you want to use ' in your text then you have two options:

You can start your string with ", then you can use all the ' you like (and vice versa)
You can escape the ' with \'.

Table 5.1: List of escape sequence used to format strings for human readable output

Escape Sequence	Effect
`\'`	A single quote
`\"`	A double quote
`\\`	A backslash
`\n`	A new line
`\t`	A tab

5.2.7 Inserting Values Into Strings

Printing text that contains the value of variables is so common that Python has a shortcut for this. It is called an f-string.

print(f"The value is {11/7}")

The value is 1.5714285714285714

The above line has several crucial features:

The string is prefaced with an ‘f’ which tells Python this string is to given special treatment.
The curly braces inside the f-string tells Python where active code is located. It will evaluate these expressions, and there can be more than one.
We did not need to convert the results of 11/7 to a string, as Python knows this is already inside a string so does it for us.

The output of numericals value when written to text can often be ugly (i.e. way too many decimal places). Since this is such a common task, Python offers a way to control the conversion.

print(f"The value is {11/7:.2f}")

The value is 1.57

Here we have followed our value (11/7) with a :, then .2f which means “print this value as a float (f) with 2 decimal places”.

You can also have values written in scientific notation:

print(f"The value is {1111/7:.2e}")

The value is 1.59e+02

Table Table 5.2 gives examples of different types of output and how to get it.

Table 5.2: Overview of various ways to format numbers when printing strings

Number	Format	Output	Description
3.1415926	{:.2f}	3.14	Format float 2 decimal places
3.1415926	{:+.2f}	+3.14	Format float 2 decimal places with sign
-1	{:+.2f}	-1.00	Format float 2 decimal places with sign
2.71828	{:.0f}	3	Format float with no decimal places
5	{:0>2d}	05	Pad number with zeros (left padding, width 2)
5	{:x<4d}	5xxx	Pad number with x’s (right padding, width 4)
1000000	{:,}	1,000,000	Number format with comma separator
0.25	{:.2%}	25.00%	Format percentage
1000000000	{:.2e}	1.00e+09	Exponent notation

And “f-strings” work with variables as well!

pi = 3.14159
print(f"The value of pi is {pi}")

The value of pi is 3.14159

And of course we can do operations on variables:

pi = 3.14159
D = 1.5
print(f"The area of a circle with a diameter of {D} is {pi/4*D**2}")

The area of a circle with a diameter of 1.5 is 1.767144375

“f-strings” were added to Python fairly recently and are a huge timesaver. The previous way to insert values into strings was verbose and hard to read:

print("The value is {a}".format(a=11/7))

The value is 1.5714285714285714

Or we could include the decimal place formatting information:

print("The value is {a:.2e}".format(a=11/7))

The value is 1.57e+00

And multiple variables were handled as:

pi = 3.14159
D = 1.5
print("The area of a circle with a diameter of {a} is {b:.2f}".format(a=D, b=pi/4*D**2))

The area of a circle with a diameter of 1.5 is 1.77

The introduction of “f-strings” was obviously very welcome! However, you will still often see code that uses the format() style, since it still works just fine and people are either:

used to the old approach so stick with it
too busy to find the time to update old code

5.3 Multiline Strings

To write long strings which span multiple lines we could write each line as a separate string and join them using +, as follows:

s = ("this \n"
    + "text \n"
    + "spans \n"
    + "multiple \n"
    + "lines \n")
print(s)

this 
text 
spans 
multiple 
lines

but this is rather tedious. Instead we can put out text between triple quotes and write as many lines between them as we wish:

s = """
This
text
spans
multiple
lines
"""
print(s)


This
text
spans
multiple
lines

If we inspect s we’ll find that \n has been inserted for us: '\nThis\ntext\nspans\nmultiple\nlines\n'

5.4 Comparing Strings

Relational operators discussed in the previous chapter can also be applied to strings. This might seem strange at first, but the effect of relational operators on strings is quite intuitive…they evaluate based on alphabetical ordering.

'Fred' < 'Harry'

True

'Fred' < 'Fanny'

False

This type of ordering is also called lexicographic ordering and depends on case and length of strings,

Uppercase precedes lowercase:

'A' < 'a'

True

Comparisons are evaluated character by character:

'abc' < 'abd'

True

Numbers comparisons are possible. The following is useful for comparing the version number of a software package:

v1 = "3.11.3"
v2 = "3.12.1"
v1 < v2

True

String length also matters:

'abc' < 'abcd'

True

Python also provides functionality to search for a string withing another string using the in operator:

'abc' in 'abcd'

True

'acb' in 'abcd'

False

The empty string is always a substring of another string:

'' in 'abc'

True

The list of string methods also contains some tools for comparison. For instance, endswith and startswith are useful:

string = 'email.address@uwaterloo.ca'
1check = string.endswith('@uwaterloo.ca')
2result = 'valid' if check else 'invalid'
3print(f"{string} is a {result} UW email address")

1: The nice thing about this method is that we don’t need to worry about the length of the text we’re looking for.
2: Here we have used a ‘one line if-statement’ to create the word ‘valid’ of ‘invalid’ as appropriate
3: Note the use of “f-strings” here to insert both the queried string and the result into a final printout.

email.address@uwaterloo.ca is a valid UW email address

Example 5.1 (Use Dictionary Keys to Identify Data) Storing data in “Excel Sheets” is exceedingly common. The top row of each column is often the “header” which indicates the data stored there, while each row is a new data point. A typical example is weather data. The University of Waterloo maintains a weather station on campus, and the historical data back to 1998 is available here. The data is reported in a format like that shown below.

Table 5.3: Sample data from the UW Weather Station Archive

year	month	day	hour	minute	Temperature	RH	Pressrue
2023	1	1	0	0	-1.43133	90.53333	101.62801
2023	1	1	0	15	-1.43133	90.53333	101.62801
2023	1	1	0	30	-1.55	90.53333	101.628
2023	1	1	0	45	-1.616	90.5	101.63467
2023	1	1	1	0	-1.616	90.5	101.64334
2023	1	1	1	15	-1.668	90.5	101.65135
2023	1	1	1	30	-1.79867	90.5	101.66534
2023	1	1	1	45	-1.91	90.48	101.67735

Tidy up the column names be giving them consistent case and adding units. Also, add a column for time elapsed since the start of the year (in minutes) so the trends can be plotted vs time.

Solution

The best way to work with tabulate data is as a dictionary, so let’s convert the above data:

data = {
    'year': [2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023],
    'month': [1, 1, 1, 1, 1, 1, 1, 1],
    'day': [1, 1, 1, 1, 1, 1, 1, 1],
    'hour': [0, 0, 0, 0, 1, 1, 1, 1],
    'minute': [0, 15, 30, 45, 0, 15, 30, 45],
    'Temperature': [-1.43133, -1.43133, -1.55, -1.616, 
                    -1.616, -1.668, -1.79867, -1.91],
    'RH': [90.53333, 90.53333, 90.53333, 90.5, 
           90.5, 90.5, 90.5, 90.48],
    'Pressure': [101.62801, 101.62801, 101.628, 101.63467, 
                 101.64334, 101.65135, 101.66534, 101.67735],
}

We can use the keys() method to ensure that everything is present:

print(data.keys())

dict_keys(['year', 'month', 'day', 'hour', 'minute', 'Temperature', 'RH', 'Pressure'])

Let’s start by correcting the case. We’ll do it ‘programmatically’ (i.e use Python to do it inside a loop instead of doing it by hand):

new_data = {}
for k in data.keys():
    new_data[k.lower()] = data[k]

Instead of creating a whole new dictionary, we can make our changes directly to data, but we need to keep Python happy. The following will cause an error because we are changing data while we are iterating over it. The data dictionary is supplying the keys to the for-loop, but when we call the pop method we are removing keys. This confuses Python and will result is a fairly self-explanatory error: RuntimeError: dictionary keys changed during iteration.

for k in data.keys():
1    data[k.lower()] = data.pop(k)

1: Using pop on a dict while simultaneously iterating over a dict will cause an error since the dict changes size

However, if we convert data.keys() to a normal list, then we have decoupled the for-loop index from the dict. This is quite “behind the scenes”, but it happens often enough that it needs to be pointed out.

1for k in list(data.keys()):
    data[k.lower()] = data.pop(k)

1: Converting the keys to a list will create an independent list of keys which does not care that the dict is changing size.

Now let’s create a new column which contains the time in minutes since the start of the year:

1cumulative_time = 0
2total_time = []
for i in range(len(data['minute'])):
3    total_time.append(cumulative_time)
    cumulative_time = cumulative_time + 15
4data['total_time [min]'] = total_time
print(total_time)

1: Initialize a counter which will be incremented
2: Initialize an empty list to which each new cumulative_time will be added
3: We need to add the time to the list first, then increment it, otherwise our first time will be 15.
4: After we’re done we add it to the data dictionary

[0, 15, 30, 45, 60, 75, 90, 105]

Finally, we can print the result using matplotlib which we’l learn about after the midterm:

import matplotlib.pyplot as plt
1plt.plot(data['total_time [min]'], data['temperature'])

1: I am trying very hard to resist the urge to format this plot in a more professional manner since I don’t want to confuse things on your first introduction to matplotlib. But in a few weeks we’ll do a deep dive into how to make these plots look great!

Comments

“Massaging” data like this is a very common task in all aspects of engineering. Data comes from many diverse sources, each with their own styles, formats, conventions, etc. The field of machine learning for instance, is mostly about collecting and organizing data. The actual “training” of the AI to learn from the data is almost a blackbox at this point.

It is worth pointing out that our total_time calculation is a bit naive. We assumed each row was 15 minutes apart in time, but what happens if a row is missing? We should have used a more robust logic to determine how much actual time had elapsed between rows.

5.5 Excercises

Copy and paste the following code to a new .py file in Spyder and work through each cell until you get the desired result.


# %% Problem 1: Combining variables of strings
# Given the following pieces of information about yourself:
name = "Bob"
age = 33
favourite_food = "takoyaki"

# Output the following sentence:
"My name is Bob, and I'm 33 years old, and my favorite food is takoyaki"

# %% Problem 2: Using str methods
# Given the text of someone's name, like the following:

name = 'mr. harry s truman'

# Convert it to this format:
"Mr. Harry S Truman"

# %% Problem 3: Using string formatting
# Given the following information:
a = 12.85949333
b = 25.04383990

# Print the following:
"The values of a and b are 12.859 and 25.044."

# %% Problem 4: Check prefix and suffix
# Give a list of files names like those below, convert them all to 'my_file.py'

a = 'my file.py'
b = 'my.file.py'
c = 'my_file.py.py'

a = ""
b = ""
c = ""

print(a)
print(b)
print(c)

# %% Problem 5: String Length and Slicing
# Given the string `s = "Python Programming"`:
# a. Find the length of the string.
# b. Extract and print the substring "Python".
# c. Extract and print the substring "Programming".
# 
# Fill in the code below:

s = "Python Programming"

length = # Get length of the string
substring_python = # Slice the first word
substring_programming = # Slice the second word

print("Length of the string:", length)
print("First word:", substring_python)
print("Second word:", substring_programming)

# %% Problem 6: Multiline Strings
# Write a function that:
# Given a string (s) and number (n), uses a multiline string to write s in n 
# different lines.
# 
# Example:
# Q6('I'm so excited for the NE121 Quiz!!!', 4)
# --> I'm so excited for the NE121 Quiz!!!
#     I'm so excited for the NE121 Quiz!!!
#     I'm so excited for the NE121 Quiz!!!
#     I'm so excited for the NE121 Quiz!!!
#
# NOTE: Only 1 print statment and one variable are needed!
#
# Fill in the code below:

def Q6(s, n):
    multiline_string = # Do not use any loops or new lines
    print(multiline_string)

# %% Problem 7: String Comparisons
# Write a function that given two strings, a and b, 
# uses a comparison operator to check if a is lexicographically smaller than b.
# If False, check if a is equal to b
# The function should print the description of the outcome
#
# Examples:
# Q7(a="abc", b="abd")
# --> a is lexicographically smaller? True.
# Q7("abc", "abc")
# --> a is lexicographically smaller? False.
# --> The strings are equal.
# Q7("abd", "abc")
# --> a is lexicographically smaller? False.
# --> The strings are not equal.
# 
# It is okay if the print statements are not the exact same, as long as the 
# message is clear. 
#
# Fill in the code below:

def Q7(a, b):
    # No help this time :)

# %% Problem 8: Creating a Table with Strings (Hard)
# Write a function `create_table(headers, data)` that takes two lists:
# a. `headers`: A list of column headers for the table (e.g., ["Name", "Age", "Country"])
# b. `data`: A list of lists containing rows of data (e.g., [["Alice", 25, "USA"], ["Bob", 30, "Canada"]])
#
# HINT: you will need to use a for loop, \t, and \n
#
# The function should:
# a. Print the headers, with each header separated by a tab.
# b. For each row in `data`, print the row's elements separated by tabs.
# c. Ensure that each row and the headers are printed on a new line.
#
# Example:
# If `headers = ["Name", "Age", "Country"]` and `data = [["Alice", 25, "USA"], ["Bob", 30, "Canada"]]`, 
# the output should be:
# 
# Name    Age    Country
# Alice   25     USA
# Bob     30     Canada
#
# Fill in the code below:

def create_table(headers, data):
    # I believe in you 
    # -Bardia
    print(table)