- 1
-
Matplotlib offers two ways to interact with it:
pylab
andpyplot
.pylab
is more familiar to Matlab users. All serious Python programmers usepyplot
. The downside of this flexibility is that theimport
statement is annoyingly long! - 2
-
This is the new standard way to invoke a new empty plot. All plots consist of a main figure (
fig
) which contains one or more axis objects (ax
). Thesubplots()
function returns the “handles” to both so we can edit their properties as needed later. - 3
-
Here we tell the
ax
object to populate itself with ascatter
plot of thex
andy
data. The trailing semi-color (;
) is not needed but it surpresses any text which may be emitted by this function.
10 Data Visualization and Other Tricks
Table of contents
So far we have focused on using Python and Numpy for “crunching numbers”…but crunching numbers is not enough! It is necessary to present the results of our efforts so that other’s can learn from them. In this chapter we’ll take a look at a few ways to make data look good.
Data visualization is actually an active area of academic research. There are always ways to present data that make it easier for a human see things more clearly. Below are some examples.
10.1 Introduction to Matplotlib
Matplotlib is the main package for plotting in Python. There are several other contenders which offer their own perspectives on plotting, but Matplotlib is really all we need.
Matplotlib is part of the Numpy/Scipy ecosystem of packages. It is designed to work with Numpy arrays.
We have seen several examples of Matplotlib graphs already in the course notes, but now we’ll take a deep dive into how we get the plots we want.
10.1.1 The Standard Charts
10.1.1.1 Scatter Plots
The scatter plot, or x-y plot, is probably the most useful and commonly used type of plot. It allows us humans to see trends and relationships instantly.
Let’s plot a quadratic curve: \(y = ax^2 + bx + c\).
Customizing the formatting of Matplotlib plots can become quite verbose since there are so many options!. Let’s look at a few of the normal options:
- 1
- The following code style is allowed by PEP8 and is quite helpful when a function has many arguments. This layout not only allows the coder to see each argument clearly, but it’s also easy to add and remove arguments by just inserting/deleting lines. Even the closing bracket is on its own line so that every argument can be easily adjusted in this way.
- 2
-
This is a complete list of the arguments that are listed in the
scatter()
function’s docstring, BUT there is another huge list of options that can be passed to (almost) any plot. This will be explored below. - 3
-
Here is an example of applying some formatting to an axis using the
ax
handle. There are a lot of things we can do here, which will also be explored below.
As mentioned above there are many more arguments that can be passed to matplotlib
functions. In the following code block we use keyword-unpacking introduced in Chapter 7. We will put all our desired properties into a dict
, then pass it to the function:
- 1
- This entry and the next are examples of extra arguments that are accepted by most plot types.
- 2
-
Here we pass all the values in the
formatting_args
dict
using the**
notation which tells Python to assign each “key” to the corresponding keyword argument.
For the sake of completeness, let’s also review the positional argument unpacking as well:
- 1
-
These need to be in the same order as they are expected by the
scatter()
method since the*
notation indicates these are unpacked as “position arguments”. - 2
- Here we have unpacked the positional arguments first followed by the keyword arguments. Note that this is the rule whether using unpacking or not.
10.1.1.2 Histograms and Bar Charts
Bar charts are a common option when plotting statistical distributions of data. We talked briefly in the last chapter about how to use Numpy to generate data following different common distributions.
= np.random.normal(loc=1, scale=0.1, size=10000)
vals
= plt.subplots()
fig, ax = ax.hist(vals, bins=25, edgecolor='k', alpha=.5) h
Another common use of bar charts is when the ‘x-axis’ is a category rather than a numerical value. Matplotlib lets us specify both an ‘x’ location for the bars,and individual labels:
- 1
-
It is not necessary to store the data in a
dict
but it is obviously quite convenient.
- 2
-
Since the data are in a
dict
we can use usekeys()
as the x-coord (i.e. category) andvals()
as the y-coord (i.e. height of the bars).
10.1.2 Formatting ax
and fig
objects
In programming there is a “design pattern” called getters and setters. This means functions which “get” data and functions which “set” data. It is common for these functions to start with the words get
and set
. In Table 10.1 we can see all the set
methods on ax
objects, but there are also a lot of get
methods too.
In fact, when you write data to a dict
in Python, like d[item] = value
, behind the scenes Python actually runs a hidden function called __setitem__()
with item
and value
as arguments like d.__setitem__(item, value)
. You can try this yourself! You can also retrieve data using d.__getitem__(item)
.
The ax
and fig
objects have a lot of methods attached to them which allow us to manipulate the formatting. Of this huge list, the ones that start with set
are for “setting” properties. Below is a list of methods on the ax
objects. A similar selection of methods can be found on the fig
object as well.
axis
object
set |
set_adjustable |
set_agg_filter |
set_alpha |
set_anchor |
set_animated |
set_aspect |
set_autoscale_on |
set_autoscalex_on |
set_autoscaley_on |
set_axes_locator |
set_axis_off |
set_axis_on |
set_axisbelow |
set_box_aspect |
set_clip_box |
set_clip_on |
set_clip_path |
set_facecolor |
set_fc |
set_figure |
set_forward_navigation_events |
set_frame_on |
set_gid |
set_in_layout |
set_label |
set_mouseover |
set_navigate |
set_navigate_mode |
set_path_effects |
set_picker |
set_position |
set_prop_cycle |
set_rasterization_zorder |
set_rasterized |
set_sketch_params |
set_snap |
set_subplotspec |
set_title |
set_transform |
set_url |
set_visible |
set_xbound |
set_xlabel |
set_xlim |
set_xmargin |
set_xscale |
set_xticklabels |
set_xticks |
set_ybound |
set_ylabel |
set_ylim |
set_ymargin |
set_yscale |
set_yticklabels |
set_yticks |
set_zorder |
Let’s see some of these in action:
= {
data 'apples': 10,
'pears': 3,
'oranges': 7,
'bananas': 9,
}
= plt.subplots()
fig, ax =data.keys(), height=data.values())
ax.bar(x
'grey')
ax.set_facecolor(0, 2, 4, 6, 8, 10])
ax.set_yticks(['a', 'b', 'c', 'd', 'e', 'f'))
ax.set_yticklabels(('Some nonsense values')
ax.set_ylabel('The fruit name')
ax.set_xlabel(0.1)
ax.set_xmargin(
'silver')
fig.set_facecolor('black')
fig.set_edgecolor(10)
fig.set_linewidth(4); fig.set_figwidth(
Note: This graph does not look very good. This is just a demonstration of how to go about making it look good.
10.1.3 Named Colors and Color Maps
Specifying colors is one of the main ways which we customize plots. Matplotlib has a large list of “named colors” which can be passed to any of the color-related argument like set_facecolor()
. These colors are given below:
Matplotlib also has a small subset of colors which work well together called the “tableau palette”. These are shown below:
And here is an example of using the “tableau palette” when combining 4 different data sets.
= np.random.rand(2, 1000)
x1, y1 = np.random.normal(1, 0.1, (2, 1000))
x2, y2 = np.random.weibull(a=2, size=(2, 1000))
x3, y3 = np.random.gamma(scale=1, shape=1, size=(2, 1000))
x4, y4
= plt.subplots()
fig, ax ='.', alpha=0.5, c='tab:orange')
ax.scatter(x4, y4, marker='.', alpha=0.5, c='tab:green')
ax.scatter(x3, y3, marker='.', alpha=0.25, c='tab:red')
ax.scatter(x2, y2, marker='.', alpha=0.15, c='tab:blue') ax.scatter(x1, y1, marker
Another way to use colors effectively is assigning each data point some different color based on some additional value other than just “x” and “y” locations.
- 1
- Here we create 2 subplots, which we’ll explore in more detail in the next section
- 2
-
We assign the array
color
to thec
argument and Matplotlib will color each marker accordingly - 3
-
We can override the default color map by passing a colormap to the
cmap
argument
There are quite a few colormaps provided by Matplotlib, as shown below:
A good tutorial on the subject of plotting with colors is given here
10.1.4 Multiple Plots
The Matplotlib Gallery contains hundreds of examples to use as a starting point. You can browse this page until you find something that looks close to what you want.
It is pretty common to plot two things side-by-side, or even an N-by-N grid. Using the plt.subplots()
function as our default approach allows this easily:
- 1
-
Note the the
axes
objects are returned as collection which we unpack intoax1
andax2
.
- 2
-
In this line and the next we add
scatter
plots to each of theaxes
we created.
In the above demo we created the exact number of axes
when we “unpacked” the collection, but we can also just use the collection directly:
= np.random.rand(2, 1000)
x1, y1 = np.random.normal(1, 0.1, (2, 1000))
x2, y2 = np.random.weibull(a=2, size=(2, 1000))
x3, y3 = np.random.gamma(scale=1, shape=1, size=(2, 1000))
x4, y4
1= plt.subplots(2, 2)
fig, ax 20][0].scatter(x1, y1, marker='.', c='tab:blue')
ax[0][1].scatter(x2, y2, marker='.', c='tab:red')
ax[1][0].scatter(x3, y3, marker='.', c='tab:green')
ax[1][1].scatter(x4, y4, marker='.', c='tab:orange') ax[
- 1
-
Here we create a 2-by-2 grid of axes, and catch all of them in
ax
- 2
-
To add data to each
axis
we need need to index into theax
collection according to the “row” and “column” location we want.
There is another version of the subplots()
approach called subplot_mosaic()
which is interesting since it gives us quite a lot of control over the layout. Instead of just a regular grid, we can do the following:
= np.random.rand(2, 1000)
x1, y1 = np.random.normal(1, 0.1, (2, 1000))
x2, y2 = np.random.weibull(a=2, size=(2, 1000))
x3, y3 = np.random.gamma(scale=1, shape=1, size=(2, 1000))
x4, y4
1= plt.subplot_mosaic([['A', 'A', 'A', 'B'],
fig, ax 'A', 'A', 'A', 'C'],
['A', 'A', 'A', 'D']])
[
2'A'].scatter(x1, y1, marker='.', c='tab:blue')
ax['B'].scatter(x2, y2, marker='.', c='tab:red')
ax['C'].scatter(x3, y3, marker='.', c='tab:green')
ax['D'].scatter(x4, y4, marker='.', c='tab:orange') ax[
- 1
-
We create a grid of where we want plots
'A'
through'D'
to appear. Note that we can use any names we want like'Panel A'
or'axis 1'
.
- 2
-
We then use the names to index into the
ax
object to create each individual subplot.
10.1.5 Working with Images
We have already seen several examples of working with images. Images are a very common form of data since engineers use them for many things:
- microscope images of surfaces or tiny features on a part or sample
- arial photographs of an area like a water treatment plant
- maps with color coding for terrain type or elevation
Let’s take one more look at color images by reading in an image that is often used as a test case:
= plt.imread('media/astronaut-eileen-collins.png')
im
= plt.subplots(2, 2)
fig, ax 0][0].imshow(im)
ax[0][0].set_title('All Channels')
ax[0][1].imshow(im[..., 0], cmap=plt.cm.Reds)
ax[0][1].set_title('Red Channel')
ax[1][0].imshow(im[..., 1], cmap=plt.cm.Greens)
ax[1][0].set_title('Blue Channel')
ax[1][1].imshow(im[..., 2], cmap=plt.cm.Blues)
ax[1][1].set_title('Green Channel'); ax[
The “png” format stands for “portable network graphics”. It has the usual three “RGB” “channels”, but also has a 4th “channel” which is the transparency. In the case of a normal photo, there is no transparency. In cases like logos we might want only the logo and letters to show up, and all the background to disappear.
Let’s look at this in greyscale. Greyscale is a very powerful way to present information. It means that each pixel is a value between 0
and N
. This can be used to represent some precise numerical value like elevation, or labels to indicate different regions like countries.
- 1
-
The
scikit-image
package is part of a family ofscikit
packages. The name is meant to invoke a connection withscipy
, and they are like additional toolboxes that are too specific to be included withinscipy
. These packages are generally abbreviated assk<name>
since python package names cannot have dashes (-
) in them.scikit-learn
andscikit-image
are the most popularscikit
packages. They are abbreviated assklearn
andskimage
. - 2
-
This line discards the 4th channel that controls the transparency. Also note that we have introduced a new slicing syntax:
...
. Three dots are called an ellipses. This means “give me all the first N axes, but only 3 steps along the last axis.”
10.2 Introduction to Jupyter Notebooks
Jupyter notebooks were originally called IPython Notebooks, where the “I” stood for “interactive”. The package was enhanced to allow any language for the backend, not just Python, so a new name was required. The word Jupyter is a partial anagram of “JUlia, PYThon, R”, which are three languages used by programmers doing numerical computations. The reference to the planet Jupiter was to indicate the strong gravitational pull of the planet pulling several hundred moons into its orbit. The programming languages are like the moons. It seems this metaphor was appropriate since there are hundred of kernels available now.
Note that notebooks still have the .ipynb
file extension regardless of which language was used within it.
In this course so far we have been focusing on using “IDEs”, but “Jupyter Notebooks” are quite popular so are worth introducing. They are a web-based pseudo-IDE. Jupyter Notebooks offer the following two benefits:
- You can mix programming-related information like code, outputs, and graphs with text-based information like headings, explanations, and equations.
- Once a notebook is complete you can save it to a permanent format like a PDF file to be shared like a normal document.
So basically, Notebooks are a way to share the results of our computations with others. They have become a popular way for programmers (such as ourselves) to do quick calculations because they make it easy for us to remember what we did and why (assuming we included good explanations!).
To launch the Jupyter environment, you can do either:
- Open Anaconda Navigator and click “Launch”
- Open the (Ana)conda prompt, and type
jupyter notebook
You will end up seeing something like the following in your web-browser:
Notebooks are based on the concepts of “cells”.
- Each notebook is a linear series of cells
- Cells can contain either text or code
- The output of code cells is rendered immediately below the cell
- Text cell can be formatted with a syntax called Markdown
- Text cells can also contain equations written using Latex syntax
When working with any program it is extremely useful to learn the shortcut keys. A list of shortcut keys for Jupyter is given in Appendix H.