In R a lot of operations will take advantage of parallel processing, provided the BLAS being used supports it. As in python it’s also possible to manually split calculations up between multiple cores. The code below uses the doParallel package and a foreach loop combined with %dopar% to split up a matrix multiplication task. If I limit the BLAS to using one thread, this will speed up the calculation. This sort of construction could also be used to run independent simulations.

I think this is a lot more intuitive than the parallel python implementation (see post). I’m curious to look at other parallel processing python modules to see how they compare.

require(parallel)
require(doParallel)
library(foreach)
library(iterators)
parFun <- function(A, B){
A%*%B
}
nCores <- 2
n <- 4000 # for this toy code, make sure n is divisible by nCores
a <- rnorm(n*n)
A <- matrix(a, ncol=n)
registerDoParallel(nCores)
systime <- system.time(
result <- foreach(i = 1:nCores, .combine = cbind) %dopar% {
rows <- ((i - 1)*n/nCores + 1):(i*n/nCores)
out <- parFun(A, A[, rows])
}
)
print(paste("Cores = ", nCores))
print(systime)

I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!

If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).

To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:

This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.

import pp
import time
import random
def randprod(n):
"""Waste time by inefficiently multiplying an n x n matrix"""
random.seed(0)
n = int(n)
# Generate an n x n matrix of random numbers
X = [[random.random() for _ in range(n)] for _ in range(n)]
Y = [[None]*n]*n
for i in range(n):
for j in range(n):
Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
return Y
if __name__ == "__main__":
test_n = [400]*4
# Do 4 serial calculations:
start_time = time.time()
series_sums = [randprod(n) for n in test_n]
elapsed = (time.time() - start_time)
print "Time for serial processing: %.1f" % elapsed + " s"
time.sleep(10) # provide a break in windows CPU usage graph
# Do the same 4 calculations in parallel, using parallel python
par_start_time = time.time()
n_cpus = 4
job_server = pp.Server(n_cpus)
jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
par_sums = [job() for job in jobs]
par_elapsed = (time.time() - par_start_time)
print "Time for parallel processing: %.1f" % par_elapsed + " s"
print par_sums == series_sums