- 1
- We have a collection where every item has a different type
- 2
-
The
item*2
operation does something different on each loop! Python has to check the type ofitem
EVERY. SINGLE. TIME. This really slows things down.
2
2.0
11
oneone
2
(2+0j)
Although it might not always seem like it, Python is actually very easy to use (compared to other languages). Not only does it have a very readable syntax, but it also let’s us mix-and-match types with impunity. Python deals with the operations on different types “on-the-fly”. Consider the following case:
item*2
operation does something different on each loop! Python has to check the type of item
EVERY. SINGLE. TIME. This really slows things down.
2
2.0
11
oneone
2
(2+0j)
The slow-down incurred by having to check the type of each variable before doing an operation is quite problematic in numerical computing. Looping through large lists, vectors, arrays, and matrices of data is the basis for all numerical computations.
ndarray
numpy
is not included in Python. We need to install it separately. If we used Anaconda to get ourselves started, then we get numpy
included (along with a few hundred other packages). So despite the fact that numpy
is pretty much mandatory for any numerical computing, it is still a separate package with its own “interface” that we must learn.
The main feature of numpy
is a new data-type known as an ndarray
. This array looks and acts a lot like a list
but has one key difference:
Every item in an
ndarray
must be the same type!
This means that is is now possible to scan an ndarray
without checking the type of each item, thus avoiding that slow process.
HOWEVER, it is not quite that simple. If we attempt to use Python for-loops to index into ndarrays
Python will still check the type. However, numpy
provides us with literally hundreds of functions that we can use to accomplish almost anything that would otherwise require for-loops.
The downside is that using numpy
is like learning a language within a language.
Probably 99% of the time we can get by using some combination of numpy
functions and functionality, but if we are doing something really special the use of a for-loop may be necessary. In this case we can still avoid slow Python for-loops by using a package called numba. numba
lets us write functions with for-loops to scan through ndarrays
, but the loops are fast. This package basically provides a way to write a for-loop with avoids the type-checking of each item. It’s a bit of a pain to use though.
It uses a technique called “just in time” compilation, and it looks at the type of input data to the function, then compiles a version of the function that works on that data type only. This is another way to avoid type checking since it knows that the data received by the function is of one single type.
Since each element in an ndarray
is of the same type, then we shouldn’t be surprized that the ndarray
itself has a type, and that type corresponds to the contents.
ndarray
, we can optionally specify dtype
, in this case int
. dtype
stands for “data type” and it limits which type of values can be stored within arr
. The default is float
.
ndarray
, which is it.
ndarray
itself has an “attribute” which tells us the type
of the data inside arr
. numpy
gives us a bit more information that just int
…it says int64
which means an integer represented using 64-bits, so is extra accurate.
<class 'numpy.ndarray'>
int64
We can create arrays with all the usual types:
= np.array([1, 2, 3], dtype=float)
arr2 print(arr2)
= np.array([0, 1, 0], dtype=bool)
arr3 print(arr3)
= np.array([1, 2, 3], dtype=complex)
arr4 print(arr4)
[1. 2. 3.]
[False True False]
[1.+0.j 2.+0.j 3.+0.j]
You can also specify the number of bits to use, to manually balance higher accuracy (high bits) and lower memory usage (lower bits). Operations on arrays with lower bits are also faster since there is less “stuff” to compute. numpy
includes these extra types, such as np.int8
and np.float128
. We can confirm that they do take up different amounts space:
import numpy as np
= np.array([1, 2, 3], dtype=np.int8)
arr2 print(arr2.nbytes)
= np.array([1, 2, 3], dtype=np.int16)
arr3 print(arr3.nbytes)
= np.array([1, 2, 3], dtype=np.float64)
arr4 print(arr4.nbytes)
3
6
24
It is easy to convert between types using the astype()
method attached to each ndarray
:
= np.array([1, 2, 3])
arr = arr.astype(int)
arr print(arr)
[1 2 3]
One of the ways that numpy
let’s us avoid the use of for-loops is by changing they way mathematical operations like +
and *
work.
Recall that for lists
, multiplication by 2 doubled the length of the list
:
= [1, 2, 3] * 2
arr print(arr)
[1, 2, 3, 1, 2, 3]
With numpy
, multiplication does actual math!:
import numpy as np
= np.array([1, 2, 3]) * 2
arr print(arr)
[2 4 6]
This is called “elementwise” operation because it operates on each element, rather than on the whole array.
numpy
also allows for elementwise operations between 2 arrays:
= np.array([1, 2, 3])
arr1 = np.array([10, 20, 30])
arr2 print(arr1/arr2)
[0.1 0.1 0.1]
All mathematical operations are supported:
= np.array([1, 2, 3])
arr1 = np.array([4, 5, 6])
arr2 print( arr1 + arr2)
print( arr1 - arr2)
print( arr1 * arr2)
print( arr1 / arr2)
print( arr1 // arr2)
print( arr1 % arr2)
print( arr1**arr2)
[5 7 9]
[-3 -3 -3]
[ 4 10 18]
[0.25 0.4 0.5 ]
[0 0 0]
[1 2 3]
[ 1 32 729]
In all cases, each of the above lines performs the stated operation using element i
of arr1
and element i
of arr2
.
When we perform elementwise operations with ndarrays
what actually happens is that “behind the scenes” some special code is run which performs for-loops that do not check the type. In this way we can achieve much faster speeds.
In the above example we used elementwise operations to perform an operation on an image, providing a huge speed-up. The lesson is that we should NOT use for-loops to process the individual elements a big array. However, it is still OK to use a for-loop to process a big batch of images, using numpy
functions inside the loop, like this:
= [im1, im2, im3, im4, im5]
ims = []
mx for im in ims:
max()) mx.append(im.
The point is that looping over a small number of items and doing big calculations on each loop is fine, common, and usually unavoidable.
When we looked at Python containers like lists
and strings
, we used the various methods that were attached to them. These methods are just functions which operate on the object, so vals.sort()
is the same as sorted(vals)
.
ndarray
also have a lot of methods attached to them, which we’ll cover later. They also have a lot of “attributes” attached to them. Attributes store information about the array, such as size
and shape
. A list of useful “attributes” on ndarrary
is given below:
ndarrays
Attributes | Description |
---|---|
ndim |
Number of dimension of the array |
size |
Number of elements in the array |
shape |
The size of the array in each dimension |
dtype |
Data type of elements in the array |
itemsize |
The size (in bytes) of each elements in the array |
data |
The buffer containing actual elements of the array in memory |
T |
View of the transposed array |
dict |
Information about the memory layout of the array |
flat |
A 1-D iterator over the array |
imag |
The imaginary part of the array |
real |
The real part of the array |
nbytes |
Total bytes consumed by the elements of the array |
The shape
, size
and ndim
get used a lot, while the rest are mostly there for debugging.
import numpy as np
= np.array([[2, 3, 4], [3, 4, 5]])
arr print(arr.shape)
print(arr.size)
print(arr.ndim)
(2, 3)
6
2
Attributes discussed above are values which do not need to be computed, like arr.size
. There are many other properties of an array which would like to know, such as the maximum or minimum value within it. This sort of information must be calculated on demand using the methods attached to the ndarrays
.
= np.array([4, 3, 6])
arr print(arr.max())
print(arr.min())
6
3
Note that because these methods require an actual computation on the array, it is best to store their result if it needs to be used many times:
= np.array([4, 3, 6])
arr = arr.max() amax
ndarrays
which calculate some property based on the array contents
Method | Description |
---|---|
max |
Return the maximum along a given axis. |
min |
Return the minimum along a given axis. |
trace |
Returns the sum along diagonals of the array. |
sum |
Return the sum of the array elements over the given axis. |
cumsum |
Returns the cumulative sum of the elements along the given axis. |
mean |
Returns the average of the array elements along given axis. |
var |
Returns the variance of the array elements, along given axis. |
std |
Returns the standard deviation of the array elements along given axis. |
prod |
Returns the product of the array elements over the given axis |
cumprod |
Returns the cumulative product of the elements along the given axis. |
all |
Returns True if all elements evaluate to True . |
any |
Returns True if any of the elements of a evaluate to True . |
By default all of the methods in Table 8.2 operate on all axes. For instance, max
returns a single value which is the maximum value for the entire array. Most of these methods also accept an axis
argument, in which case a new ndarray
is returned which is one dimension smaller than the original array, containing the result obtained by only looking at values along a given axis.
Multidimensional arrays are very common in numerical programming. Some examples are:
[200, 200, 3]
.As “simple humans” we can visualize 1D, 2D and 3D images and arrays quite well, but not higher. Figure 8.5 shows this progression. It also includes the definition of array “shape” as well:
It is actually possible to use the “list of lists” style indexing with ndarrays
, as shown below:
import numpy as np
= np.random.rand(5, 5)
arr print(arr[0][0])
print(arr[0, 0])
0.2977770674188823
0.2977770674188823
But numpy
offers several special features which require the [0, 0]
style indexing, so it’s necessary to learn this. For instance, we can extract a subsection of an array using:
import numpy as np
= np.random.rand(9, 9)
arr = arr[:3, :3]
arr2 print(arr2)
[[0.08277747 0.7282641 0.79949882]
[0.79218985 0.45573028 0.80296608]
[0.3168688 0.56676578 0.97257359]]
ndarrays
Indexing an ndarray
behaves a bit differently than we have seen with nested lists
. Recall that indexing into a nested list
(i.e. a “list of lists”) worked as follows:
[0]
, we retrieve the list
stored at location 0
.
[0]
we index into the list
we obtained from the first [0]
.
[1, 2, 3]
1
For an ndarray
, the indexing is done within a single pair of brackets:
= np.array([[1, 2, 3],
arr 4, 5, 6],
[7, 8, 9]])
[1print(arr[0, 0])
row
and col
simultaneously and receive the value stored at that location.
1
Visualizing 4D and higher becomes a challenge. Humans are fundamentally incapable of visualizing higher dimension, so we need some tricks, as illustrated in the following example.
It is sometimes helpful to think of multidimensional arrays as being reshaped to 1D arrays. In fact, this is how computers “think” about. The values are laid out 1D block in memory. We can change the shape of any array as follows:
= np.arange(24)
arr print(arr)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
= arr.reshape([4, 6])
arr print(arr)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
= arr.reshape([2, 12])
arr print(arr)
[[ 0 1 2 3 4 5 6 7 8 9 10 11]
[12 13 14 15 16 17 18 19 20 21 22 23]]
We can swap between any shapes as long as the total number of elements is conserved.
We can also retrieve subsections of an array using “slicing” that we discussed in the previous chapter:
= np.arange(25)
arr = arr.reshape([5, 5])
arr print(arr)
print(arr[0:3, 0:3])
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 0 1 2]
[ 5 6 7]
[10 11 12]]
We can also write to subsections of an array:
= np.arange(25)
arr = arr.reshape([5, 5])
arr 0:3, 0:3] = 0
arr[print(arr)
[[ 0 0 0 3 4]
[ 0 0 0 8 9]
[ 0 0 0 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
The defaults also apply, so we can omit some values:
= np.arange(25)
arr = arr.reshape([5, 5])
arr 13, 3:] = 0
arr[:print(arr)
:3
means the first 3 rows (0
, 1
and 2
), but does not include row 3
.
[[ 0 1 2 0 0]
[ 5 6 7 0 0]
[10 11 12 0 0]
[15 16 17 18 19]
[20 21 22 23 24]]
As with mathematical functions, logical comparisons are also done elementwise:
arr
are less than 0.5
mask
reveals that it contains True
and False
values, which are the result of the logical comparison.
[[ True True False False True]
[False False False False False]
[ True False True True True]
[ True True True False False]
[ True False False True False]]
We can use ndarrays
filled with boolean
values as “masks” to retrieve or write values from the locations where the mask is True
.
mask
will contain a True
value in all locations where arr < 0.1
.
arr
that were < 0.1
using mask
for the index.
[0.04582099 0.03236906 0.03972382]
We can also use masks to write values:
import numpy as np
= np.random.rand(5, 5)
arr = arr < 0.5
mask 1= 0.0
arr[mask] print(arr)
numpy
to put the value of 0.0
into arr
at all locations where mask
is True
.
[[0. 0.74978164 0. 0. 0.94784757]
[0.52248844 0. 0. 0.56699272 0.90824512]
[0. 0. 0. 0. 0. ]
[0.74340212 0.58336774 0.94278052 0.6906021 0. ]
[0.58078615 0.99266985 0.90875989 0. 0.5816829 ]]
Fancy indexing works like the masks discussed above for masks, but with actual numerical index values:
[0. 0.08517893 0.19606165 0.26219769 0. 0.
0.47375759 0.72581009 0. 0.84949167]
When working with higher-dimensional arrays (2D, 3D, etc), we must specify the index of each axis in its own list
(like above for 1D), but then combine each list
in a tuple
:
import numpy as np
= np.random.rand(5, 5)
arr = [0, 0, 1, 3, 3]
ind_x = [0, 1, 3, 3, 4]
ind_y = (ind_x, ind_y)
indices = 0.0
arr[indices] print(arr)
[[0. 0. 0.1474802 0.33153472 0.21024793]
[0.14531768 0.19587551 0.56404873 0. 0.85686775]
[0.91742026 0.9183996 0.7044764 0.26802012 0.46294791]
[0.82980378 0.75089697 0.51382404 0. 0. ]
[0.03311313 0.85773795 0.05355766 0.87403983 0.72244447]]
In all of the above code snippets and examples we have seen that ndarrays
can be created by lists
as follows:
= [1, 2, 3, 4, 5]
vals = np.array(vals, dtype=int)
arr print(arr)
[1 2 3 4 5]
It is also possible to convert an ndarray
back to a list
:
= [1, 2, 3, 4, 5]
vals = np.array(vals, dtype=int)
arr = arr.tolist()
vals2 print(type(vals2))
<class 'list'>
Often we want to create an ndarray
directly. We can create an empty array with a given shape and type:
= np.ndarray(shape=[5, 2])
arr print(arr)
[[0. 0.08517893]
[0.19606165 0.26219769]
[0. 0. ]
[0.47375759 0.72581009]
[0. 0.84949167]]
Note that the above array was filled with gibberish. These values are the numerical representation of whatever leftover data was stored in the memory that was assigned to hold arr
. It is probably more useful to create an array of 1’s or 0’s, so numpy
provides functions for that:
= np.ones(5, dtype=int)
arr1 = np.zeros(5, dtype=int)
arr2 print(arr1)
print(arr2)
[1 1 1 1 1]
[0 0 0 0 0]
Often we want to combine two 1D arrays into a 2D array, or vice versa. numpy
uses the term “stack” to refer to joining arrays. This refers to the fact that arrays “stack” like blocks, either beside or on top of each other.
Splitting arrays into smaller ones is also fairly common. We have already seen that slice indexing can be used for this, such as:
= np.arange(16).reshape([4, 4])
arr = arr[:2, :]
a1 = arr[2:, :]
a2 print(a1)
print(a2)
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
However, it is often more convenient to use numpy's
built-in functions. This makes it less likely to make mistakes for instance, and it’s also fewer lines of code:
= np.arange(16).reshape([4, 4])
arr 1= np.vsplit(arr, 2)
a1, a2 print(a1)
print(a2)
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]