diff --git a/source/_static/lecture_specific/parallelization/htop_parallel_npmat.png b/source/_static/lecture_specific/parallelization/htop_parallel_npmat.png new file mode 100644 index 00000000..ddb95667 Binary files /dev/null and b/source/_static/lecture_specific/parallelization/htop_parallel_npmat.png differ diff --git a/source/rst/index_python_scientific_libraries.rst b/source/rst/index_python_scientific_libraries.rst index da5f78d4..f7d0911f 100644 --- a/source/rst/index_python_scientific_libraries.rst +++ b/source/rst/index_python_scientific_libraries.rst @@ -23,4 +23,4 @@ Next we cover the third party libraries most useful for scientific work in Pytho matplotlib scipy numba - sci_libs + parallelization diff --git a/source/rst/need_for_speed.rst b/source/rst/need_for_speed.rst index 3620dd3f..e9d0140a 100644 --- a/source/rst/need_for_speed.rst +++ b/source/rst/need_for_speed.rst @@ -61,11 +61,14 @@ routines we want to use. For example, it's almost always better to use an existing routine for root finding than to write a new one from scratch. +(For standard algorithms, efficiency is maximized if the community can coordinate on a +common set of implementations, written by experts and tuned by users to be as fast and robust as possible.) + But this is not the only reason that we use Python's scientific libraries. Another is that pure Python, while flexible and elegant, is not fast. -So we need libraries that help us accelerate our Python code. +So we need libraries that are designed to accelerate execution of Python code. As we'll see below, there are now Python libraries that can do this extremely well. @@ -131,17 +134,19 @@ Indeed, the standard implementation of Python (called CPython) cannot match the Does that mean that we should just switch to C or Fortran for everything? -The answer is: no, no and one hundred times no! +The answer is: No, no and one hundred times no! + +(This is what you should say to the senior professor insisting that the model +needs to be rewritten in Fortran or C++.) There are two reasons why: First, for any given program, relatively few lines are ever going to be time-critical. -Hence we should write most of our code in a high productivity language like -Python. +Hence it is far more efficient to write most of our code in a high productivity language like Python. -Second, for those lines of code that *are* time-critical, we can now achieve the same speed as C or Fortran using Python's scientific libraries. +Second, even for those lines of code that *are* time-critical, we can now achieve the same speed as C or Fortran using Python's scientific libraries. Where are the Bottlenecks? @@ -150,6 +155,8 @@ Where are the Bottlenecks? Before we learn how to do this, let's try to understand why plain vanila Python is slower than C or Fortran. +This will, in turn, help us figure out how to speed things up. + Dynamic Typing ^^^^^^^^^^^^^^ @@ -281,16 +288,19 @@ Let's look at some ways around these problems. There is a clever method called **vectorization** that can be used to speed up high level languages in numerical applications. -The key idea is to send array processing operations in batch to precompiled +The key idea is to send array processing operations in batch to pre-compiled and efficient native machine code. The machine code itself is typically compiled from carefully optimized C or Fortran. +For example, when working in a high level language, the operation of inverting a large matrix can be subcontracted to efficient machine code that is pre-compiled for this purpose and supplied to users as part of a package. + This clever idea dates back to MATLAB, which uses vectorization extensively. -Vectorization can greatly accelerate many (but not all) numerical computations. +Vectorization can greatly accelerate many numerical computations (but not all, +as we shall see). -Let's see how it works in Python, using NumPy. +Let's see how vectorization works in Python, using NumPy. Operations on Arrays @@ -471,13 +481,15 @@ In the vectorized version, all the looping takes place in compiled code. As you can see, the second version is **much** faster. -(We'll make it even faster again below when we discuss Numba) +(We'll make it even faster again later on, using more scientific programming tricks.) + .. _numba-p_c_vectorization: -Pros and Cons of Vectorization ------------------------------- +Beyond Vectorization +==================== + At its best, vectorization yields fast, simple code. @@ -488,17 +500,17 @@ One issue is that it can be highly memory-intensive. For example, the vectorized maximization routine above is far more memory intensive than the non-vectorized version that preceded it. +This is because vectorization tends to create many intermediate arrays before +producing the final calculation. + Another issue is that not all algorithms can be vectorized. In these kinds of settings, we need to go back to loops. -Fortunately, there are nice ways to speed up Python loops. - - -Beyond Vectorization -==================== +Fortunately, there are alternative ways to speed up Python loops that work in +almost any setting. -In the last few years, a new Python library called `Numba +For example, in the last few years, a new Python library called `Numba `__ has appeared that solves the main problems with vectorization listed above. diff --git a/source/rst/numba.rst b/source/rst/numba.rst index 613345e1..77b02223 100644 --- a/source/rst/numba.rst +++ b/source/rst/numba.rst @@ -32,20 +32,21 @@ Let's start with some imports: Overview ======== -In our lecture on :doc:`NumPy `, we learned one method to improve speed -and efficiency in numerical work. +In an :doc:`earlier lecture ` we learned about vectorization, which is one method to improve speed and efficiency in numerical work. -That method, called *vectorization*, involved sending array processing +Vectorization involves sending array processing operations in batch to efficient low-level code. -Unfortunately, as we :ref:`discussed previously `, vectorization is limited and has several weaknesses. +However, as :ref:`discussed previously `, vectorization has several weaknesses. -One is that it is highly memory-intensive. +One is that it is highly memory-intensive when working with large amounts of data. -Another is that only some algorithms can be vectorized. +Another is that the set of algorithms that can be entirely vectorized is not universal. -In the last few years, a new Python library called `Numba -`__ has appeared that solves many of these problems. +In fact, for some algorithms, vectorization is entirely ineffective. + +Fortunately, a new Python library called `Numba `__ +solves many of these problems. It does so through something called **just in time (JIT) compilation**. @@ -53,8 +54,7 @@ JIT compilation is effective in many numerical settings and can generate extreme It can also do other tricks such as multithreading (a form of parallelization well suited to numerical work). -Numba will be a key part of our lectures --- especially those lectures -involving dynamic programming. +Numba will be a key part of our lectures --- especially those lectures involving dynamic programming. This lecture introduces the main ideas. @@ -67,16 +67,18 @@ Numba aims to automatically compile functions to native machine code instruction When it succeeds, the compiled code is extremely fast. But the process isn't flawless, since Numba needs to infer type information on -all variables to generate pure machine instructions. +all variables to generate fast machine-level instructions. + +(See our :doc:`earlier lecture ` on scientific computing for a discussion of types.) -Inference isn't possible in every setting, so you will need to invest some time in learning how Numba works. +Such inference isn't possible in every setting, so you will need to invest some time in learning how Numba works. However, you will find that, for simple routines, Numba infers types very well. -Moreover, the "hot loops" that we actually need to speed up are often -relatively simple. +Moreover, the "hot loops" that we actually need to speed up often fit into this category. -This explains why, despite its imperfections, we are huge fans of Numba at QuantEcon! +This explains why, despite its imperfections, Numba is used to accelerate a +great deal of our code. .. _numba_link: @@ -96,12 +98,9 @@ machine code during runtime. An Example ---------- -Let's consider some problems that are difficult to vectorize. - -One is generating the trajectory of a difference equation given an initial -condition. +Let's consider a problem that is difficult to vectorize: generating the trajectory of a difference equation given an initial condition. -Let's take the difference equation to be the quadratic map +We will take the difference equation to be the quadratic map .. math:: @@ -131,17 +130,23 @@ To speed this up using Numba is trivial using Numba's ``jit`` function from numba import jit - qm_numba = jit(qm) # qm_numba is now a 'compiled' version of qm + qm_numba = jit(qm) + +The function `qm_numba` is a version of `qm` that is "targeted" for +JIT-compilation. + +We will explain what this means momentarily. -Let's time and compare identical function calls across these two versions: +First let's time and compare identical function calls across these two versions, starting with the original function `qm`: .. code-block:: python3 - n = 1_000_000 + n = 10_000_000 qe.util.tic() qm(0.1, int(n)) time1 = qe.util.toc() +Now let's try `qm_numba` .. code-block:: python3 @@ -149,10 +154,9 @@ Let's time and compare identical function calls across these two versions: qm_numba(0.1, int(n)) time2 = qe.util.toc() +This is already a massive speed gain. -The first execution is relatively slow because of JIT compilation (see below). - -Next time and all subsequent times it runs much faster: +In fact, the next time and all subsequent times it runs even faster: .. _qm_numba_result: @@ -166,14 +170,44 @@ Next time and all subsequent times it runs much faster: time1 / time2 # Calculate speed gain -On our machines the output here is typically a two orders of magnitude speed gain. -(Your mileage will vary depending on hardware and so on.) +This kind of speed gain is huge relative to how simple and clear the implementation is. + + +How and When it Works +--------------------- + +Numba attempts to generate fast machine code using the infrastructure provided by the `LLVM Project `_. + +It does this by inferring type information on the fly. + +The basic idea is this: + +* Python is very flexible and hence we could call the function `qm` with many + types. + + * e.g., `x0` could be a NumPy array or a list, `n` could be an integer or a float, etc. + +* This makes it hard to *pre*-compile the function. + +* However, when we do actually call the function, by executing `qm(0.5, 10)`, + say, the types of `x0` and `n` become clear. + +* Moreover, the types of other variables in `qm` become clear at this point. + +* So the strategy of Numba and other JIT compilers is to wait until this + moment, and *then* compile the function. + +That's why it is called "just-in-time" compilation. + +Note that, if you make the call `qm(0.5, 10)` and then follow it with `qm(0.9, 20)`, compilation only takes place on the first call. + +The compiled code is then cached and recycled as required. + -Nonetheless, this kind of speed gain is huge relative to how simple and clear the implementation is. Decorator Notation -^^^^^^^^^^^^^^^^^^ +------------------ If you don't need a separate name for the "numbafied" version of ``qm``, you can just put ``@jit`` before the function @@ -192,14 +226,15 @@ you can just put ``@jit`` before the function This is equivalent to ``qm = jit(qm)``. -How and When it Works ---------------------- -Numba attempts to generate fast machine code using the infrastructure provided by the `LLVM Project `_. -It does this by inferring type information on the fly. +How Well Does Numba Work? +------------------------- + +It is clear from the above discussion that type inference is a key part of JIT +compilation. -As you can imagine, this is easier for simple Python objects (simple scalar data types, such as floats, integers, etc.). +As you can imagine, inferring types is easier for simple Python objects (e.g., simple scalar data types, such as floats and integers). Numba also plays well with NumPy arrays, which it treats as typed memory regions. @@ -213,13 +248,19 @@ When Numba cannot infer all type information, some Python objects are given gene In this second setting, Numba typically provides only minor speed gains --- or none at all. +* Note that you can force an error when this occurs, so you know what is + happening, by using `@jit(nopython=True)` or `@njit` instead of `@jit`. + Hence, it's prudent when using Numba to focus on speeding up small, time-critical snippets of code. This will give you much better performance than blanketing your Python programs with ``@jit`` statements. + A Gotcha: Global Variables -^^^^^^^^^^^^^^^^^^^^^^^^^^ +-------------------------- + +Here's one thing to be careful about when using Numba. Consider the following example @@ -247,70 +288,6 @@ When Numba compiles machine code for functions, it treats global variables as co -Numba for Vectorization ------------------------ - -Numba can also be used to create custom :ref:`ufuncs ` with the `@vectorize `__ decorator. - -To illustrate the advantage of using Numba to vectorize a function, we -return to a maximization problem :ref:`discussed previously ` - -.. code-block:: python3 - - def f(x, y): - return np.cos(x**2 + y**2) / (1 + x**2 + y**2) - - grid = np.linspace(-3, 3, 1000) - x, y = np.meshgrid(grid, grid) - - qe.tic() - np.max(f(x, y)) - qe.toc() - -This is plain vanilla vectorized code using NumPy. - -Here's a Numba version that does the same task, using `@vectorize` - -.. code-block:: python3 - - from numba import vectorize - - @vectorize - def f_vec(x, y): - return np.cos(x**2 + y**2) / (1 + x**2 + y**2) - - grid = np.linspace(-3, 3, 1000) - x, y = np.meshgrid(grid, grid) - - np.max(f_vec(x, y)) # Run once to compile - - qe.tic() - np.max(f_vec(x, y)) - qe.toc() - -So far there's no real advantage to using the `@vectorize` version. - -However, we can gain further speed improvements using Numba's automatic -parallelization feature by specifying ``target='parallel'``. - -In this case, we need to specify the types of our inputs and outputs - -.. code-block:: python3 - - @vectorize('float64(float64, float64)', target='parallel') - def f_vec(x, y): - return np.cos(x**2 + y**2) / (1 + x**2 + y**2) - - np.max(f_vec(x, y)) # Run once to compile - - qe.tic() - np.max(f_vec(x, y)) - qe.toc() - -Now our code runs significantly faster than the NumPy version. - -We'll discuss parallelization again below. - Compiling Classes ================== @@ -318,7 +295,7 @@ As mentioned above, at present Numba can only compile a subset of Python. However, that subset is ever expanding. -Numba is now quite effective at compiling classes. +For example, Numba is now quite effective at compiling classes. If a class is successfully compiled, then its methods acts as JIT-compiled functions. @@ -424,3 +401,181 @@ When we call the methods in the class, the methods are compiled just like functi plt.show() + + +Alternatives to Numba +===================== + +.. index:: + single: Python; Cython + + +There are additional options for accelerating Python loops. + +Here we quickly review them. + +However, we do so only for interest and completeness. + +If you prefer, you can safely skip this section. + +Cython +------ + +Like :doc:`Numba `, `Cython `__ provides an approach to generating fast compiled code that can be used from Python. + +As was the case with Numba, a key problem is the fact that Python is dynamically typed. + +As you'll recall, Numba solves this problem (where possible) by inferring type. + +Cython's approach is different --- programmers add type definitions directly to their "Python" code. + +As such, the Cython language can be thought of as Python with type definitions. + +In addition to a language specification, Cython is also a language translator, transforming Cython code into optimized C and C++ code. + +Cython also takes care of building language extensions --- the wrapper code that interfaces between the resulting compiled code and Python. + +While Cython has certain advantages, we generally find it both slower and more +cumbersome than Numba. + +Interfacing with Fortran via F2Py +--------------------------------- + +.. index:: + single: Python; Interfacing with Fortran + +If you are comfortable writing Fortran you will find it very easy to create +extension modules from Fortran code using `F2Py +`_. + +F2Py is a Fortran-to-Python interface generator that is particularly simple to +use. + +Robert Johansson provides a `nice introduction +`_ +to F2Py, among other things. + +Recently, `a Jupyter cell magic for Fortran +`_ has been developed --- you might want to give it a try. + + + + +Exercises +========= + +.. _speed_ex1: + +Exercise 1 +---------- + +Later we'll learn all about :doc:`finite-state Markov chains `. + +For now, let's just concentrate on simulating a very simple example of such a chain. + +Suppose that the volatility of returns on an asset can be in one of two regimes --- high or low. + +The transition probabilities across states are as follows + +.. figure:: /_static/lecture_specific/sci_libs/nfs_ex1.png + + +For example, let the period length be one day, and suppose the current state is high. + +We see from the graph that the state tomorrow will be + +* high with probability 0.8 + +* low with probability 0.2 + +Your task is to simulate a sequence of daily volatility states according to this rule. + +Set the length of the sequence to ``n = 1_000_000`` and start in the high state. + +Implement a pure Python version and a Numba version, and compare speeds. + +To test your code, evaluate the fraction of time that the chain spends in the low state. + +If your code is correct, it should be about 2/3. + + + +Solutions +========= + + + +Exercise 1 +---------- + +We let + +- 0 represent "low" +- 1 represent "high" + +.. code-block:: python3 + + p, q = 0.1, 0.2 # Prob of leaving low and high state respectively + +Here's a pure Python version of the function + +.. code-block:: python3 + + def compute_series(n): + x = np.empty(n, dtype=np.int_) + x[0] = 1 # Start in state 1 + U = np.random.uniform(0, 1, size=n) + for t in range(1, n): + current_x = x[t-1] + if current_x == 0: + x[t] = U[t] < p + else: + x[t] = U[t] > q + return x + +Let's run this code and check that the fraction of time spent in the low +state is about 0.666 + +.. code-block:: python3 + + n = 1_000_000 + x = compute_series(n) + print(np.mean(x == 0)) # Fraction of time x is in state 0 + + +Now let's time it + +.. code-block:: python3 + + qe.util.tic() + compute_series(n) + qe.util.toc() + + +Next let's implement a Numba version, which is easy + +.. code-block:: python3 + + from numba import jit + + compute_series_numba = jit(compute_series) + +Let's check we still get the right numbers + +.. code-block:: python3 + + x = compute_series_numba(n) + print(np.mean(x == 0)) + + +Let's see the time + +.. code-block:: python3 + + qe.util.tic() + compute_series_numba(n) + qe.util.toc() + + +This is a nice speed improvement for one line of code! + diff --git a/source/rst/parallelization.rst b/source/rst/parallelization.rst new file mode 100644 index 00000000..be515871 --- /dev/null +++ b/source/rst/parallelization.rst @@ -0,0 +1,440 @@ +.. _parallel: + +.. include:: /_static/includes/header.raw + +*************** +Parallelization +*************** + +.. contents:: :depth: 2 + +In addition to what's in Anaconda, this lecture will need the following libraries: + +.. code-block:: ipython + :class: hide-output + + !pip install --upgrade quantecon + + +**Co-author: Matthew McKay** + + +Overview +======== + + +The growth of CPU clock speed (i.e., the speed at which a single chain of logic can +be run) has slowed dramatically in recent years. + +This is unlikely to change in the near future, due to inherent physical +limitations on the construction of chips and circuit boards. + +Chip designers and computer programmers have responded to the slowdown by +seeking a different path to fast execution: parallelization. + +Hardware makers have increased the number of cores (physical CPUs) embedded in each machine. + +For programmers, the challenge has been to exploit these multiple CPUs by running many processes in parallel (i.e., simultaneously). + +This is particularly important in scientific programming, which requires handling + +* large amounts of data and + +* CPU intensive simulations and other calculations. + +In this lecture we discuss parallelization for scientific computing, with a focus on + +#. the best tools for parallelization in Python and + +#. how these tools can be applied to quantitative economic problems. + + +Let's start with some imports: + +.. code-block:: ipython + + import numpy as np + import quantecon as qe + import matplotlib.pyplot as plt + + %matplotlib inline + + +Types of Parallelization +======================== + +Large textbooks have been written on different approaches to parallelization but we will keep a tight focus on what's most useful to us. + +We will briefly review the two main kinds of parallelization in common use in +scientific computing and discuss their pros and cons. + + +Multiprocessing +--------------- + +Multiprocessing means concurrent execution of multiple processes using more than one processor. + +In this context, a **process** is a chain of instructions (i.e., a program). + +Multiprocessing can be carried out on one machine with multiple CPUs or on a +collection of machines connected by a network. + +In the latter case, the collection of machines is usually called a +**cluster**. + +With multiprocessing, each process has its own memory space, although the +physical memory chip might be shared. + + +Multithreading +-------------- + +Multithreading is similar to multiprocessing, except that, during execution, the threads all share the same memory space. + +Native Python struggles to implement multithreading due to some `legacy design +features `__. + +But this is not a restriction for scientific libraries like NumPy and Numba. + +Functions imported from these libraries and JIT-compiled code run in low level +execution environments where Python's legacy restrictions don't apply. + + +Advantages and Disadvantages +---------------------------- + +Multithreading is more lightweight because most system and memory resources +are shared by the threads. + +In addition, the fact that multiple threads all access a shared pool of memory +is extremely convenient for numerical programming. + +On the other hand, multiprocessing is more flexible and can be distributed +across clusters. + +For the great majority of what we do in these lectures, multithreading will +suffice. + + +Implicit Multithreading in NumPy +================================ + +Actually, you have already been using multithreading in your Python code, +although you might not have realized it. + +(We are, as usual, assuming that you are running the latest version of +Anaconda Python.) + +This is because NumPy cleverly implements multithreading in a lot of its +compiled code. + +Let's look at some examples to see this in action. + +A Matrix Operation +------------------ + +The next piece of code computes the eigenvalues of a large number of randomly +generated matrices. + +It takes a few seconds to run. + +.. code-block:: python3 + + n = 20 + m = 1000 + for i in range(n): + X = np.random.randn(m, m) + λ = np.linalg.eigvals(X) + +Now, let's look at the output of the `htop` system monitor on our machine while +this code is running: + +.. figure:: /_static/lecture_specific/parallelization/htop_parallel_npmat.png + +We can see that 4 of the 8 cores are running at full speed. + + +This is because NumPy's ``eigvals`` routine neatly splits up the tasks and +distributes them to different threads. + + +A Multithreaded Ufunc +--------------------- + +Over the last few years, NumPy has managed to push this kind of multithreading +out to more and more operations. + +For example, let's return to a maximization problem :ref:`discussed previously `: + +.. code-block:: python3 + + def f(x, y): + return np.cos(x**2 + y**2) / (1 + x**2 + y**2) + + grid = np.linspace(-3, 3, 1000) + x, y = np.meshgrid(grid, grid) + + qe.tic() + np.max(f(x, y)) + qe.toc() + +If you have a system monitor such as `htop` (Linux/Mac) or `perfmon` +(Windows), then try running this and then observing the load on your CPUs. + +(You will probably need to bump up the grid size to see large effects.) + +At least on our machine, the output shows that the operation is successfully +distributed across multiple threads. + +This is one of the reasons why the vectorized code above is fast. + +A Comparison with Numba +----------------------- + +To get some basis for comparison for the last example, let's try the same +thing with Numba. + +In fact there is an easy way to do this, since Numba can also be used to +create custom :ref:`ufuncs ` with the `@vectorize +`__ decorator. + +.. code-block:: python3 + + from numba import vectorize + + @vectorize + def f_vec(x, y): + return np.cos(x**2 + y**2) / (1 + x**2 + y**2) + + grid = np.linspace(-3, 3, 1000) + x, y = np.meshgrid(grid, grid) + + np.max(f_vec(x, y)) # Run once to compile + + qe.tic() + np.max(f_vec(x, y)) + qe.toc() + +At least on our machine, the difference in the speed between the +Numba version and the vectorized NumPy version shown above is not large. + +But there's quite a bit going on here so let's try to break down what is +happening. + +Both Numba and NumPy use efficient machine code that's specialized to these +floating point operations. + +However, the code NumPy uses is, in some ways, less efficient. + +The reason is that, in NumPy, the operation ``np.cos(x**2 + y**2) / (1 + +x**2 + y**2)`` generates several intermediate arrays. + +For example, a new array is created when ``x**2`` is calculated. + +The same is true when ``y**2`` is calculated, and then ``x**2 + y**2`` and so on. + +Numba avoids creating all these intermediate arrays by compiling one +function that is specialized to the entire operation. + +But if this is true, then why isn't the Numba code faster? + +The reason is that NumPy makes up for its disadvantages with implicit +multithreading, as we've just discussed. + +Multithreading at Numba Ufunc +----------------------------- + +Can we get both of these advantages at once? + +In other words, can we pair + +* the efficiency of Numba's highly specialized JIT compiled function and + +* the speed gains from parallelization obtained by NumPy's implict + multithreading? + +It turns out that we can, by adding some type information plus ``target='parallel'``. + +.. code-block:: python3 + + @vectorize('float64(float64, float64)', target='parallel') + def f_vec(x, y): + return np.cos(x**2 + y**2) / (1 + x**2 + y**2) + + np.max(f_vec(x, y)) # Run once to compile + + qe.tic() + np.max(f_vec(x, y)) + qe.toc() + +Now our code runs significantly faster than the NumPy version! + + + +Multithreaded Loops in Numba +============================ + +We just saw one approach to parallelization in Numba, using the ``parallel`` +flag in ``@vectorize``. + +This is neat but, it turns out, not well suited to many problems we consider. + +Fortunately, Numba provides another approach to multithreading that will work +for us almost everywhere parallelization is possible. + +To illustrate, let's look first at a simple, single-threaded (i.e., non-parallelized) piece of code. + +The code simulates updating the wealth :math:`w_t` of a household via the rule + +.. math:: + + w_{t+1} = R_{t+1} w_t + Y_{t+1} + +Here :math:`R` is the gross rate of return on assets and :math:`Y` is labor +income. + +We model both :math:`R` and :math:`Y` as independent draws from a lognormal +distribution. + +Here's the code: + +.. code-block:: ipython + + from numpy.random import randn + from numba import jit + + @jit + def h(w, r=0.1, s=0.3, v1=0.1, v2=1.0): + """ + Dynamics of wealth. + """ + + # Draw shocks + R = np.exp(v1 * randn()) * (1 + r) + Y = np.exp(v2 * randn()) + + # Update wealth + w = R * s * w + Y + return w + + +Let's have a look at how wealth evolves under this rule. + +.. code-block:: ipython + + fig, ax = plt.subplots(figsize=(9, 6)) + + T = 100 + w = np.empty(T) + w[0] = 5 + for t in range(T-1): + w[t+1] = h(w[t]) + + ax.plot(range(T), w) + + +Now let's suppose that we have a large population of households and we want to +know what median wealth will be. + +This is not easy to solve with pencil and paper, so we will use simulation +instead. + +In particular, we will simulate a large number of households and then +calculate median wealth for this group. + +Suppose we are interested in the long-run average of this median over time. + +It turns out that, for the specification that we've chosen above, we can +calculate this this by taking a one-period snapshot of what has happened to median +wealth of the group at the end of a long simulation. + +Moreover, provided the simulation period is long enough, initial conditions +don't matter. + +* This is due to something called ergodicity, which we will discuss later on. + +So, in summary, we are going to simulate 50,000 households by + +#. arbitrarily setting initial wealth to 1 and + +#. simulating forward in time for 1,000 periods. + +Then we'll calculate median wealth at the end period. + +Here's the code: + +.. code-block:: ipython + + @jit + def compute_long_run_median(w0=1, T=1000, num_reps=50_000): + + obs = np.empty(num_reps) + for i in range(num_reps): + w = w0 + for t in range(T): + w = h(w) + obs[i] = w + + return np.median(obs) + +Let's see how fast this runs: + +.. code-block:: ipython + + %%time + compute_long_run_median() + + +To speed this up, we're going to parallelize it via multithreading. + +To do so, we add the ``parallel=True`` flag and change ``range`` to ``prange``: + +.. code-block:: ipython + + from numba import prange + + @jit(parallel=True) + def compute_long_run_median_parallel(w0=1, T=1000, num_reps=50_000): + + obs = np.empty(num_reps) + for i in prange(num_reps): + w = w0 + for t in range(T): + w = h(w) + obs[i] = w + + return np.median(obs) + +Let's look at the timing: + +.. code-block:: ipython + + %%time + compute_long_run_median_parallel() + +The speed-up is significant. + +A Warning +--------- + +Parallelization works well in the outer loop of the last example because the individual tasks inside the loop are independent of eachother. + +If this independence fails then parallelization is often problematic. + +For example, each step inside the inner loop depends on the last step, so +independence fails, and this is why we use ordinary ``range`` instead of ``prange``. + +When you see us using ``prange`` in later lectures, it is because the +independence of tasks hold true. + +When you see us using ordinary ``range`` in a jitted function, it is either because the speed gain from parallelization is small or because independence fails. + +.. Dask + +.. To be added. + + +.. GPUs + +.. Just say a few words about them. How do they relate to the foregoing? Explain that we can't introduce executable GPU code here. + diff --git a/source/rst/sci_libs.rst b/source/rst/sci_libs.rst deleted file mode 100644 index acf13533..00000000 --- a/source/rst/sci_libs.rst +++ /dev/null @@ -1,575 +0,0 @@ -.. _speed: - -.. include:: /_static/includes/header.raw - -************************** -Other Scientific Libraries -************************** - -.. contents:: :depth: 2 - -In addition to what's in Anaconda, this lecture will need the following libraries: - -.. code-block:: ipython - :class: hide-output - - !pip install --upgrade quantecon - -Overview -======== - - - -In this lecture, we review some other scientific libraries that are useful for -economic research and analysis. - -We have, however, already picked most of the low hanging fruit in terms of -economic research. - -Hence you should feel free to skip this lecture on first pass. - - -:index:`Cython` -=============== - -.. index:: - single: Python; Cython - - -Like :doc:`Numba `, `Cython `__ provides an approach to generating fast compiled code that can be used from Python. - -As was the case with Numba, a key problem is the fact that Python is dynamically typed. - -As you'll recall, Numba solves this problem (where possible) by inferring type. - -Cython's approach is different --- programmers add type definitions directly to their "Python" code. - -As such, the Cython language can be thought of as Python with type definitions. - -In addition to a language specification, Cython is also a language translator, transforming Cython code into optimized C and C++ code. - -Cython also takes care of building language extensions --- the wrapper code that interfaces between the resulting compiled code and Python. - - - -**Important Note:** - -In what follows code is executed in a Jupyter notebook. - -This is to take advantage of a Cython `cell magic `_ that makes Cython particularly easy to use. - -Some modifications are required to run the code outside a notebook. - -* See the book `Cython `__ by Kurt Smith or `the online documentation `_. - - - - - -A First Example ---------------- - -Let's start with a rather artificial example. - -Suppose that we want to compute the sum :math:`\sum_{i=0}^n \alpha^i` for given :math:`\alpha, n`. - -Suppose further that we've forgotten the basic formula - -.. math:: - - \sum_{i=0}^n \alpha^i = \frac{1 - \alpha^{n+1}}{1 - \alpha} - - -for a geometric progression and hence have resolved to rely on a loop. - -Python vs C -^^^^^^^^^^^ - -Here's a pure Python function that does the job - -.. code-block:: python3 - - def geo_prog(alpha, n): - current = 1.0 - sum = current - for i in range(n): - current = current * alpha - sum = sum + current - return sum - -This works fine but for large :math:`n` it is slow. - -Here's a C function that will do the same thing - -.. code-block:: c - :class: no-execute - - double geo_prog(double alpha, int n) { - double current = 1.0; - double sum = current; - int i; - for (i = 1; i <= n; i++) { - current = current * alpha; - sum = sum + current; - } - return sum; - } - - -If you're not familiar with C, the main thing you should take notice of is the -type definitions - -* ``int`` means integer - -* ``double`` means double precision floating-point number - -* the ``double`` in ``double geo_prog(...`` indicates that the function will - return a double - -Not surprisingly, the C code is faster than the Python code. - -A Cython Implementation -^^^^^^^^^^^^^^^^^^^^^^^ - -Cython implementations look like a convex combination of Python and C. - -We're going to run our Cython code in the Jupyter notebook, so we'll start by -loading the Cython extension in a notebook cell - -.. code-block:: ipython - - %load_ext Cython - -In the next cell, we execute the following - -.. code-block:: ipython - - %%cython - def geo_prog_cython(double alpha, int n): - cdef double current = 1.0 - cdef double sum = current - cdef int i - for i in range(n): - current = current * alpha - sum = sum + current - return sum - -Here ``cdef`` is a Cython keyword indicating a variable declaration and is followed by a type. - -The ``%%cython`` line at the top is not actually Cython code --- it's a Jupyter cell magic indicating the start of Cython code. - -After executing the cell, you can now call the function ``geo_prog_cython`` from within Python. - -What you are in fact calling is compiled C code with a Python call interface - -.. code-block:: python3 - - import quantecon as qe - qe.util.tic() - geo_prog(0.99, int(10**6)) - qe.util.toc() - -.. code-block:: python3 - - qe.util.tic() - geo_prog_cython(0.99, int(10**6)) - qe.util.toc() - - - - - -Example 2: Cython with NumPy Arrays ------------------------------------ - -Let's go back to the first problem that we worked with: generating the iterates of the quadratic map - -.. math:: - - x_{t+1} = 4 x_t (1 - x_t) - - -The problem of computing iterates and returning a time series requires us to work with arrays. - -The natural array type to work with is NumPy arrays. - -Here's a Cython implementation that initializes, populates and returns a NumPy -array - -.. code-block:: ipython - - %%cython - import numpy as np - - def qm_cython_first_pass(double x0, int n): - cdef int t - x = np.zeros(n+1, float) - x[0] = x0 - for t in range(n): - x[t+1] = 4.0 * x[t] * (1 - x[t]) - return np.asarray(x) - - -If you run this code and time it, you will see that its performance is disappointing --- nothing like the speed gain we got from Numba - -.. code-block:: python3 - - qe.util.tic() - qm_cython_first_pass(0.1, int(10**5)) - qe.util.toc() - -This example was also computed in the :ref:`Numba lecture `, and you can see Numba is around 90 times faster. - -The reason is that working with NumPy arrays incurs substantial Python overheads. - -We can do better by using Cython's `typed memoryviews `_, which provide more direct access to arrays in memory. - -When using them, the first step is to create a NumPy array. - -Next, we declare a memoryview and bind it to the NumPy array. - -Here's an example: - -.. code-block:: ipython - - %%cython - import numpy as np - from numpy cimport float_t - - def qm_cython(double x0, int n): - cdef int t - x_np_array = np.zeros(n+1, dtype=float) - cdef float_t [:] x = x_np_array - x[0] = x0 - for t in range(n): - x[t+1] = 4.0 * x[t] * (1 - x[t]) - return np.asarray(x) - -Here - -* ``cimport`` pulls in some compile-time information from NumPy - -* ``cdef float_t [:] x = x_np_array`` creates a memoryview on the NumPy array ``x_np_array`` - -* the return statement uses ``np.asarray(x)`` to convert the memoryview back to a NumPy array - -Let's time it: - -.. code-block:: python3 - - qe.util.tic() - qm_cython(0.1, int(10**5)) - qe.util.toc() - -This is fast, although still slightly slower than ``qm_numba``. - - - -Summary -------- - -Cython requires more expertise than Numba, and is a little more fiddly in terms of getting good performance. - -In fact, it's surprising how difficult it is to beat the speed improvements provided by Numba. - -Nonetheless, - -* Cython is a very mature, stable and widely used tool. - -* Cython can be more useful than Numba when working with larger, more sophisticated applications. - - - - - -Joblib -====== - - -`Joblib `__ is a popular Python library for -caching and parallelization. - -To install it, start Jupyter and type - -.. code-block:: ipython - - !pip install joblib - - -from within a notebook. - -Here we review just the basics. - - - - -Caching -------- - - -Perhaps, like us, you sometimes run a long computation that simulates a model -at a given set of parameters --- to generate a figure, say, or a table. - -20 minutes later you realize that you want to tweak the figure and now you have to -do it all again. - -What caching will do is automatically store results at each parameterization. - -With Joblib, results are compressed and stored on file, and automatically served -back up to you when you repeat the calculation. - - - -An Example ----------- - -Let's look at a toy example, related to the quadratic map model discussed :ref:`above `. - -Let's say we want to generate a long trajectory from a certain initial -condition :math:`x_0` and see what fraction of the sample is below 0.1. - -(We'll omit JIT compilation or other speedups for simplicity) - -Here's our code - -.. code-block:: python3 - - from joblib import Memory - location = './cachedir' - memory = Memory(location='./joblib_cache') - - @memory.cache - def qm(x0, n): - x = np.empty(n+1) - x[0] = x0 - for t in range(n): - x[t+1] = 4 * x[t] * (1 - x[t]) - return np.mean(x < 0.1) - - -We are using `joblib `_ to cache the result of calling `qm` at a given set of parameters. - -With the argument `location='./joblib_cache'`, any call to this function results in both the input values and output values being stored a subdirectory `joblib_cache` of the present working directory. - -(In UNIX shells, `.` refers to the present working directory) - -The first time we call the function with a given set of parameters we see some -extra output that notes information being cached - -.. code-block:: python3 - - qe.util.tic() - n = int(1e7) - qm(0.2, n) - qe.util.toc() - -The next time we call the function with the same set of parameters, the result is returned almost instantaneously - -.. code-block:: python3 - - qe.util.tic() - n = int(1e7) - qm(0.2, n) - qe.util.toc() - - - - - - - -Other Options -============= - -There are in fact many other approaches to speeding up your Python code. - -One is interfacing with Fortran. - -.. index:: - single: Python; Interfacing with Fortran - -If you are comfortable writing Fortran you will find it very easy to create -extension modules from Fortran code using `F2Py `_. - -F2Py is a Fortran-to-Python interface generator that is particularly simple to -use. - -Robert Johansson provides a `very nice introduction `_ to F2Py, among other things. - -Recently, `a Jupyter cell magic for Fortran -`_ has been developed --- you might want to give it a try. - - - - - - -Exercises -========= - -.. _speed_ex1: - -Exercise 1 ----------- - -Later we'll learn all about :doc:`finite-state Markov chains `. - -For now, let's just concentrate on simulating a very simple example of such a chain. - -Suppose that the volatility of returns on an asset can be in one of two regimes --- high or low. - -The transition probabilities across states are as follows - -.. figure:: /_static/lecture_specific/sci_libs/nfs_ex1.png - - -For example, let the period length be one month, and suppose the current state is high. - -We see from the graph that the state next month will be - -* high with probability 0.8 - -* low with probability 0.2 - -Your task is to simulate a sequence of monthly volatility states according to this rule. - -Set the length of the sequence to ``n = 100000`` and start in the high state. - -Implement a pure Python version, a Numba version and a Cython version, and compare speeds. - -To test your code, evaluate the fraction of time that the chain spends in the low state. - -If your code is correct, it should be about 2/3. - - - -Solutions -========= - - - -Exercise 1 ----------- - -We let - -- 0 represent "low" -- 1 represent "high" - -.. code-block:: python3 - - p, q = 0.1, 0.2 # Prob of leaving low and high state respectively - -Here's a pure Python version of the function - -.. code-block:: python3 - - def compute_series(n): - x = np.empty(n, dtype=np.int_) - x[0] = 1 # Start in state 1 - U = np.random.uniform(0, 1, size=n) - for t in range(1, n): - current_x = x[t-1] - if current_x == 0: - x[t] = U[t] < p - else: - x[t] = U[t] > q - return x - -Let's run this code and check that the fraction of time spent in the low -state is about 0.666 - -.. code-block:: python3 - - n = 100000 - x = compute_series(n) - print(np.mean(x == 0)) # Fraction of time x is in state 0 - - -Now let's time it - -.. code-block:: python3 - - qe.util.tic() - compute_series(n) - qe.util.toc() - - -Next let's implement a Numba version, which is easy - -.. code-block:: python3 - - from numba import jit - - compute_series_numba = jit(compute_series) - -Let's check we still get the right numbers - -.. code-block:: python3 - - x = compute_series_numba(n) - print(np.mean(x == 0)) - - -Let's see the time - -.. code-block:: python3 - - qe.util.tic() - compute_series_numba(n) - qe.util.toc() - - -This is a nice speed improvement for one line of code. - -Now let's implement a Cython version - -.. code-block:: ipython - - %load_ext Cython - -.. code-block:: ipython - - %%cython - import numpy as np - from numpy cimport int_t, float_t - - def compute_series_cy(int n): - # == Create NumPy arrays first == # - x_np = np.empty(n, dtype=int) - U_np = np.random.uniform(0, 1, size=n) - # == Now create memoryviews of the arrays == # - cdef int_t [:] x = x_np - cdef float_t [:] U = U_np - # == Other variable declarations == # - cdef float p = 0.1 - cdef float q = 0.2 - cdef int t - # == Main loop == # - x[0] = 1 - for t in range(1, n): - current_x = x[t-1] - if current_x == 0: - x[t] = U[t] < p - else: - x[t] = U[t] > q - return np.asarray(x) - -.. code-block:: python3 - - compute_series_cy(10) - -.. code-block:: python3 - - x = compute_series_cy(n) - print(np.mean(x == 0)) - - -.. code-block:: python3 - - qe.util.tic() - compute_series_cy(n) - qe.util.toc() - - -The Cython implementation is fast but not as fast as Numba.