How is multiprocessing different from parallel computing in Python? , by Budji N’Vuicha | October, 2022

A look at ways to speed up your projects

Speeding up calculations is a goal that everyone wants to achieve. In this article, we will learn what multiprocessing is, its benefits, and how we can use parallel computing to speed up Python projects.

Image 1: Photo by @denismphoto — Unsplash.com

Multiprocessing enables computers to use multiple CPU cores to run tasks. Usually, when we run a Python script, our code sometimes becomes a process that runs on a single core of our CPU.

But modern computers have more than one core, so what if we could use more cores for our calculations?

Let’s look at a simple example to illustrate the main aspects of Python multiprocessing. Python’s multiprocessing module comes in the standard library. task()The function sleeps for two seconds and prints before and after sleeping:

multiprocessing.Process(name=task) Defines a multi-process instance with which we trigger p1.start()And p2.start(),

Here is the output:

Starting the Task
Starting the Task
Task done
Task done
Done in 2.0744 seconds

Python’s multiprocessing works by starting a new process and then adding it back to the primary process join() Celebration. Running a thousand processes will run out of memory.

The better way is to run a process loop to limit the number of simultaneous processes running.multiprocessing.Pool()A better way is to use multiprocessing by distributing tasks to the available processors.

In the code below, we will illustrate this with a simple example. we have a cube(x) Function that simply takes an integer and returns its square root.

Here is the output:

with Pool() as pool:
result = pool.map(cube, range(10,n))
print("Program finished!")

Parallel computing refers to the process of executing multiple computing resources or processors to solve a problem.

When working on a data analysis or machine learning project, we want to parallelize Pandas DataFrames, which are the most commonly used objects for storing tabular data.

Dusk is a free and open-source library for parallel computing in Python. With just a few code changes, Dask Dataframes will enable us to work with huge datasets to manipulate data and build ML models. It is compatible with NumPy, SciKit-Learn and other Python packages.

Dusk DataFrame for Parallel Pandas

Let’s use a practical example to learn how to use Desk. It can be installed separately with conda, pip, or installed from source. Specific functionality requires optional dependencies.

installation

$ conda install dask

We can start a Desk client providing a dashboard to gain insight into the calculations.

from dask.distributed import Clientclient = Client(n_workers=2, threads_per_worker=2, memory_limit="1GB")
client

This started a dashboard on a specific IP address on our local machine. Dashboards are built with Bokeh (a Python library for creating interactive visualizations) and will start automatically whenever the scheduler is created, returning a link to the dashboard.

Figure 2 – Desk Dashboard

To learn how Dusk works, we’ll use a small dataset of stock market analysis from Kaggle for Tesla Inc.

# Import librairies
import numpy as np
import pandas as pd
import dask.dataframe as dd

Dusk DataFrames can read and store data in several formats, similar to Pandas DataFrames.

df = dd.read_csv("tesla_inc.csv")
df

Unlike Pandas, Desk DataFrame is loaded when it is needed for computation. The next figure shows that no data is printed. Instead, it is replaced by the ellipsis (...,

Most common Pandas operations can be used on a Dask dataframe in a similar fashion. This example shows how to slice data based on a condition and then determine the mean of the data in a column.

df2 = df[df.Close > 0]
df3 = df2.groupby("Date").Open.mean()
df3
computed_df = df3.compute()
type(computed_df)
computed_df

Despite the difficulty of profiling parallel code, Dask’s dashboard simplifies this by providing real-time operations monitoring.

Figure 23 – Dusk Monitoring Operation

Making complex computations more efficient with persistence

Dask allows us to store intermediate computation results so that they can be reused. using the persist() By method, Dask tries to keep the intermediate result in as much memory as possible. This can be very useful to speed up the calculations if we have large datasets.

df = df.persist()

dusk arrays and dusk-ml

Similar to Dask DataFrames to parallelize pandas code, we can use Dask Arrays to execute NumPy code in parallel.

Dask-ML is a great tool that provides scalable machine learning in Python using Dask along with popular machine learning libraries such as Scikit-Learn, XGBoost, and others.

As shown in the figure, Dusk has two main components:

  • Collections, which establish steps to perform computations in parallel
  • Schedulers, which manage the actual computation work
Figure 3 — Key components of Dask — https://docs.dask.org/

dask.distributed The scheduler works well on a single machine and scales to multiple machines in a cluster. Kubernetes is a popular system for deploying distributed applications on clusters, especially in the cloud. The next picture shows the components of Dask when running on Kubernetes.

Figure 3 – Kubernetes Cluster – https://docs.dask.org/

In this article, we have talked about performance optimization of Python code using multiprocessing. We introduced parallel computing and its implementation with Desk.

As always, it is a matter of considering the specific problem we are facing and evaluating the pros and cons of different solutions.

Leave a Comment