Foundations of Deep Learning for the Social Sciences:
Day 1 Python Tutorial

This notebook will explore the concepts and models we discussed today using the Keras and PyTorch Python packages.

Familiarity with Python may be helpful but is not strictly necessary. For those who are familiar with R — Python is syntactically similar to R, but differs from R in that it is more of an object-oriented language, whereas R is more of a functional language. What this means in practice is that in Python, you'll see a lot of syntax that looks like object.attribute. For example, you might define a statistical model (the object), then access its coefficients (the attribute) by typing model.coef. In R, on the other hand, you typically use a function to get the parameters: coef(model). In my experience, this isn't too difficult to adjust to.

Python does have the upper hand, however, in that its deep learning packages are further developed and better supported than R's. Hence, most of our tutorials will be in Python. There are now R wrappers for Tensorflow and PyTorch, however, so I will also provide some R code for comparison.

Just like in R, in Python you need to import the packages you intend to use. In Python, packages are imported as objects called modules that you are welcome to rename for convenience. The functions included in each package can be used by typing module.function() (with appropriate arguments); this may feel similar to typing package::function() in R.

Let's begin by importing the deep learning packages we'll be using in this tutorial. We will also import NumPy, an important Python package for scientific computing.

Tensors: The Fundamental Deep Learning Data Structure

Tensors are the basic objects used for computing in deep learning frameworks. In the deep learning context, a tensor is just a multidimensional array. In statistics, we usually only work with 0-dimensional arrays (scalars), 1-dimensional arrays (vectors), and 2-dimensional arrays (matrices):

\begin{equation} \nonumber \underbrace{a}_{\text{a scalar}}, \qquad \mathbf{a} = \underbrace{\begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ \end{bmatrix}}_{\text{a vector}}, \qquad \mathbf{A} = \underbrace{\begin{bmatrix} a_{1, 1} & a_{1, 2} & a_{1, 3} \\ a_{2, 1} & a_{2, 2} & a_{2, 3} \\ a_{3, 1} & a_{3, 2} & a_{3, 3} \\ \end{bmatrix}}_{\text{a matrix}}. \end{equation}

In deep learning frameworks, however, arrays can have more dimensions than just two. For example, a 3-dimensional tensor is a stack of matrices:

3d-tensor.jpg

As an example, we might use a 3D tensor to hold a longitudinal data set where we have 30 people answer 10 questions at 5 time points. In this case, we'd typically format our data as a $5 \times 30 \times 10$ tensor. This is rather practical, because now instead of having a list of $5$ $30 \times 10$ matrices that must be operated on separately, all our data is contained in a single array structure that we can perform operations on directly.

Creating tensors in Keras/Tensorflow and PyTorch is straightforward. In NumPy, the equivalent data structure is the array, which is pretty much just a tensor (without automatic differentiation — although this can be added). It's also straightforward to convert between frameworks.

See below for some examples of creating and indexing tensors. For more in-depth tensor tutorials, see:

Univariate Logistic Regression

Similar to the first lecture, we will demonstrate some basic deep learning concepts using logistic regression. This will also help further introduce the syntax used by Keras and PyTorch before getting into more complicated deep learning examples.

We'll use the same toy example we used in lecture where $x \geq 0$ is the number of hours slept last night and $y \in \{0, 1\}$ is whether one experiences a depressive episode the following day ($y = 1$ means a depressive episode occurred). The data is generated as follows: \begin{align} x &\sim \text{Uniform}(0, 12) \nonumber \\ y &\sim \text{Bernoulli}(\sigma(-0.5 x + 3)), \nonumber \end{align} where $\sigma(z) = 1 / (1 + \exp[-z])$ is the inverse logistic link or sigmoid function. In words, the probability of experiencing a depressive episode decreases as the number of hours slept last night increases.

We simulate the data below using NumPy.

We're going to fit a univariate logistic regression model: \begin{equation} \hat{y} = \sigma(wx + b), \nonumber \end{equation} where $w$ is a slope, $b$ is an intercept, and $\hat{y}$ is the predicted probability that the outcome is equal to $1$. This is the model we used in lecture when we introduced backpropagation and stochastic gradient methods.

We'll begin by defining this model using Keras. Keras is a high-level interface to Tensorflow, one of Python's major deep learning frameworks. In Keras, we first the define the model, then compile and fit it to data. This approach amounts to defining a static (i.e., fixed) computational graph that we repeatedly make forward and backward passes through. In contrast, we will see that PyTorch allows us create a dynamic computational graph that can easily change across fitting iterations.

Below, we define and fit a univariate logistic regression model to the simulated data using Keras. The basic model building process is:

  1. define the model as a set of layers that process the predictors sequentially;
  2. tell the model the loss function and optimizer to use for fitting; and
  3. fit the model.

Keras keeps all of the details of backpropagation under the hood — the model is fitted with a simple model.fit() call. By default, upddates are printed to the screen as fitting progresses.

We can see that the fitted parameters are pretty accurate.

We can also plot how the loss function and the predictive accuracy change as fitting progresses. These plots are typically called training curves. The training curves show that the loss quickly decreases and the accuracy quickly increases during fitting.

Now let's try the same thing in PyTorch. In PyTorch, we need to be a bit more hands-on than in Keras. In particular, we actually need to write a for loop to iteratively fit the model. This is what I meant earlier when I said that PyTorch makes a dynamic computational graph: we're actually re-building the computational graph during each loop!

We can see that this PyTorch code executes faster than the Keras code and that the final parameter estimates are similar.

Again, we can plot the training curves. They look very similar to the Keras curves.

PyTorch also lets us look a bit more closely at the backpropagation algorithm.

Recall that backprop starts with a forward pass to compute the loss, then makes a backward pass to compute the derivatives of the loss with respect to the parameters. Let's reproduce our example forward and backward pass from the lecture. Here's the forward pass:

forward_pass.png

Here's how you'd write the forward pass in PyTorch:

Here's the backward pass:

backward_pass.png

And here's how you'd do it in PyTorch:

Multivariate Logistic Regression

It's quite easy to extend things to the multivariate setting in both frameworks. Let's say we now have an additional predictor $x_2 \geq 0$ representing the number of hours of exercise yesterday. Our data generating model is now: \begin{align} x_1 &\sim \text{Uniform}(0, 12) \nonumber \\ x_2 &\sim \text{Uniform}(0, 3) \nonumber \\ y &\sim \text{Bernoulli}(\sigma(-0.5 x_1 - x_2 + 3)). \nonumber \end{align} In words, increasing either number of hours of sleep last night ($x_1$) or number of hours of exercise yesterday ($x_2$) decreases the chances of having a depressive episode today.

Let's sample the data below.

Now we're fitting the following logistic regression model: \begin{equation} \hat{y} = \sigma(\mathbf{w}^\top \mathbf{x} + b), \nonumber \end{equation} where $\mathbf{w}$ is a slope vector and $\mathbf{x}$ is a predictor vector.

It's pretty easy to fit this model using Keras. In fact, we hardly need to change anything at all -- just the input data!

We see below that the parameter estimates are accurate.

Let's fit the same model in PyTorch. In the last example, we hand coded a bit more than we needed to. In fact, PyTorch has many built-in classes to make building complicated models easier. We demonstrate some of these classes below.

Again, fitted parameters are accurate.

Multilayer Perceptrons

Finally, we'll get into deep learning! Just like in lecture, we'll start with the multilayer perceptron, one of the most basic neural networks.

We'll begin with a simulated data set. Here's the setup: suppose we want to predict whether or not students obtain less than $80\%$ correct on an exam ($y \in \{0, 1 \}$. We have two predictors: the average number of hours they study per week ($0 \leq x_1 \leq 10$) and a continuous measure of their stress level ($0 \leq x_2 \leq 10$).

We simulate and plot the data below. Feel free to skip over the simulation procedure, which is a little complicated. Looking at the scatterplot, we see that students who have very high or very low stress levels don't study much, whereas those with moderate stress levels study the most. We can also see that studying less increases one's chances of getting less than $80\%$ correct.

We will first fit a logistic regression model as a baseline. We place $20\%$ of the data in a holdout test set and fit the model using the remaining $80\%$.

The test set accuracy is pretty low — only $0.575$.

As before, we show the loss and accuracy as fitting progresses. The loss and accuracy quickly decrease to a minimum and maximum, respectively.

We next plot the test set data along with the logistic regression decision boundry — that is, the curve below which the model flags students as at risk of scoring lower than $80\%$ (i.e., $\hat{y} > 0.5$) and above which students are not flagged (i.e., $\hat{y} < 0.5$). The logistic regression decision boundry is a straight line, which does a poor job of separating the red and blue points.

Now let's try to model the data using a multilayer perceptron (MLP). We'll fit the following model with a hidden layer of size $10$ and a $\text{tanh}$ hidden layer activation function:

After fitting, the model's test set accuracy is much better — $88.5\%$!

The loss and accuracy curves show that the MLP takes much longer to converge than the logistic regression model, but the final loss and accuracy are much better.

A plot of the MLP decision boundary shows that it cleanly separates the red and blue points — we've successfully flagged at-risk students!

Let's fit a similar MLP using PyTorch, but now with a ReLU hidden layer activation function. Again, this code executes faster than the Keras code, but the final accuracy is similar.

The training curves and the decision boundary are similar to those produced by Keras. Notice, however, that the boundary is now piecewise linear — that's because we're using the ReLU activation, which itself is piecewise linear.

Recurrent Neural Networks

Recurrent neural networks (RNNs) are highly flexible models for longitudinal data. They are capable of leveraging complex temporal relationships to predict the outcome(s) of interest.

Intuitively, the simplest RNNs are MLPs with loops carrying information forward through time. A basic RNN schematic diagram looks like this:

You can see that at each time point, we have a vector of predictors $\mathbf{x}^{(t)}$ that is mapped to a hidden layer $\mathbf{h}^{(t)}$ which is then mapped to an outcome $y^{(t)}$.

We will demonstrate RNNs using simulated data. Imagine that we track people's levels of sadness ($x_1^{(t)}$) and loneliness ($x_2^{(t)}$) each day in the morning; these are the predictors. We also record whether or not they experience a severe depressive episode by the nighttime ($y^{(t)} = 1$ if a severe depressive episode occurs); this is the outcome. We want to predict whether a person will experience a severe depressive episode each day using their morning mood time series up to that day.

The predictors are generated from the following stochastic dynamical system: \begin{equation} \nonumber x_1^{(0)}, x_2^{(0)} \sim \text{Uniform}(-2, 2), \qquad \Delta\begin{bmatrix} x_1^{(t)} \\ x_2^{(t)} \end{bmatrix} = \begin{bmatrix} -0.01 & 0.13 \\ -0.1 & -0.01 \end{bmatrix} \begin{bmatrix} x_1^{(t)} \\ x_2^{(t)} \end{bmatrix}\Delta t + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, 0.0025), \end{equation} which produces noisy oscillations over time. We simulate depressive episodes as follows: If the sum of a person's sadness and loneliness scores has been higher than $1$ any time in the past $3$ days, they have an $80\%$ chance of experiencing a depressive episode; otherwise, they have a $20\%$ chance.

We simulate the data below and plot the predictors.

We will try to fit a long short-term memory (LSTM) network to the data (for details, see this excellent blog post colah.github.io/posts/2015-08-Understanding-LSTMs/). We'll start with Keras.

This is where things can get a little tricky. We need to make sure our data tensors are in the correct format for the framework we're using. The arrays we simulated have dimensions $\text{time_points} \times \text{batch_size} \times \text{n_variables}$, but Keras wants tensors with dimension $\text{batch_size} \times \text{time_points} \times \text{n_variables}$ (see, e.g., www.tensorflow.org/guide/keras/rnn). So we need to reformat these arrays for use with Keras.

The accuracy quickly increases to about $80\%$ (what we'd expect), and the loss decays a little more slowly.

Now we'll fit the same model in PyTorch. By default, PyTorch takes tensor inputs of size $\text{time_points} \times \text{batch_size} \times \text{n_variables}$, so we don't need to mess with our data tensor's shape. The model obtains similar accuracy to the Keras model, and the training curves are quite similar.