Python multiprocessing - Process a large number of rows simultaneously

Python multiprocessing - Process a large number of rows simultaneously - python

SO community,
I am struggling with Python multicore programming. Whilst I cannot disclose too many details about the goal of my code, due to the company's information security policy, I'll try to be as specific as I can.
I am developing a code which will evaluate the weekly scores for several people, divided in 5 groups according to their overall scores. I have a fairly large dataframe, composed of dozens of thousands of rows, each one with the player's ID, the considered week (in a nine week period) and his/her score in the given week. On top of that, I have five arrays of indexes so I can loc/iloc the data for selecting people on one of five groups, depending on their scores.
I wrote a serialized version of this code, but it takes well over an hour to run for a small part (some 150k lines) of our data (I'm developing with a small sample, so I can test the code and, once it's A-OK, I can fire it up with the complete dataset). The structure of the serialized code in pythonesque pseudocode:
def weekscore(week, ID, players):
mask(week)
players.loc[players['player'==ID]
score = (calculate score)
return score
And then, for all players and all weeks:
sc_week = []
for play_ID in playerlist:
for week in np.arange(0,10): # To consider 9 weeks
sc_week.append(weekscore(week, play_ID, players)
i.e., for each player on the list, gets his/her score for the given week and append on a 9-element list, for further processing.
I know this isn't the best way for approaching this problem, but since I was trying to implement a multicore algorithm, I didn't bothered in optimizing this one. After all, this seemed the perfect opportunity to learn an important skill that will be useful later.
Anyway, I tried every single way I could find on SO to implement a multiprocessing algorithm for this problem. The idea is to delegate different player_IDs to different process, so I could greatly cut the total processing time. I have access to 32-core machines, so if I could make this run with 20 cores I would expect the code to run at least 10 times faster, minimum.
But no matter what I implemented, even though the code forks the process into several processes, it still just processes a single player_ID at a time. I monitored the execution through htop on Linux and through task manager on Windows and in both systems, setting a poolsize of 8, 8 subprocesses were created. Then, one of them would spike to 100% of CPU, output something to the screen, then another one would spike to 100%, output to the screen, another process spike to 100% and output and so on. I was expecting all processes to achieve 100%, but I couldn't managed to do this.
The closest I got was with this:
if __name__ == '__main__':
users, primeiro, dados, index_pred_80, intervalos, percents_ = prepara()
print('Multiprocessing step')
import multiprocessing as mp
from multiprocessing import freeze_support
from functools import partial
freeze_support()
pool = mp.Pool(10)
constparams = partial(perc1, week = week, ID = ID, players = players)
pool.map(constparams, users)
pool.close()
pool.join()
Calling the function perc1 which is (anonymized for privacy reasons)
def perc1(week, ID, players):
list1 = []
for week in np.arange(0,10):
mask(week)
players.loc[players['player'==ID]
print('Week = ', week, 'Player ID = ', player_ID)
score = (calculate score)
list1.append(score)
return (lista1)
So, after the long post, what is the best way for achieving this goal? What's the best way to delegate batches of IDs for different processes, so each process can churn through a part of the data?
EDIT 1: Fixed small typos.
EDIT 2: I am aware of the limitations of interactive interpreters for multicore Python, so I tested the code in command line both in Windows and Linux, but the behaviour was the same on both cases.

Related

How to maintain the presence of group of Interval variables in scheduling problem using Cplex Python?

So basically I have a scheduling problem where all tasks (interval variables) are optional and it belongs to different groups. Like as an house building problem, but here I have different tasks (not same at all) for different houses and I want to model that it fully completes as many houses as possible by finishing all tasks for those houses. I am also checking for the resource capacity that it won't exceed the given capacity. (So main goal is to fully complete the houses as much as possible and if capacity permits also select some tasks to use the remaining capacity, even if last house can't be complete )
So here I'm trying to count the presence_of() tasks (Model picks in solution) from particular house and then check if those are all tasks for a house to be finish completely, and then try to maximize the number of completed houses. But it's not working and giving me an error that "CPO expression can not be used as boolean".
Hope I've given the enough explanation. Please let me know if there's any question. I would really appreciate any help to make it work! Thank you in advance!
Edit:
I have list of Tasks unlike same task list for all houses. It's different tasks for different house. Like Tasks = {T1H1, T2H1, T3H2, T4H3, ... }. So aim (Objective) is to maximize the count of Houses that can be fully completed (100% completed houses from solution).
I tried below code for this, which gives me the error "CpoException: CPO expression can not be used as Boolean."
mdl0.add(cp.maximize(sum ( 1
for H in Houses
if
sum((cp.presence_of(dictIntrvalVarsTasks[T]) * 1) for T in Tasks) ==
dfAllTasks.loc[dfAllTasks['House'] == H, 'TaskName'].count()
)))
Here, in the code first I'm trying to get sum of houses which can be completed 100% using the condition (
if
sum((cp.presence_of(dictIntrvalVarsTasks[T]) * 1) for T in Tasks) == dfAllTasks.loc[dfAllTasks['House'] == H, 'TaskName'].count()
), and check that if sum of present tasks in solution are equal to the total tasks for the house? If so that means all tasks for that house is completed and so the house. And maximize that sum.
Hope I explained it enough, but please let me know if there is any question. Thanks a lot in advance. Hope you or somebody could help me with this! :)

How to monitor multiple assets at the same time with a trading robot?

I am developing a trading robot in Python 3.8 and I need to know if you can give me any ideas to monitor multiple open orders simultaneously.
The situation is as follows: When you want to sell an asset, the robot can monitor conditions permanently and easily evaluate the indicators to place the sell order (limit or market).
But when you have 3, 4, 5 assets or more the situation gets complicated because the robot monitors one asset and then moves on to the next one and so on. This means that when monitoring asset # 2 (for example) asset # 5 (which is not being monitored) may be suffering a sudden strong fluctuation that makes you lose money.
My question is: Is there a way to keep an eye on all 5 assets at the same time?

Investigating thoroughly on this problem I found a way to theoretically and technically solve this problem. This is multiprocessing in Python.
The technique consists of distributing the memory of our PC in portions to execute the same process many times and at the same time.
Graphically I explain it with the following images. Python runs sequentially as we see in this image:
This has the consequence that if the monitoring loop is calculating the indicators of asset 1, then asset 130 (for example) is unsupervised and could generate considerable losses.
But if we divide the memory of our machine or use multiple cores, we can execute the same process at the same time for several assets, as I show in the following image:
In this link you can see the result of applying a multithreading (take a good look at the time) and a multiprocess: http://pythondiario.com/2018/07/multihilo-y-multiprocesamiento.html
I also leave the link of the library: https://docs.python.org/3/library/multiprocessing.html
More information and more detailed examples of multiprocessing can be seen here: https://www.genbeta.com/desarrollo/multiprocesamiento-en-python-benchmarking
It only remains to develop the code and put it to the test.

Will the for loop effect the speed in pyspark dataframe

I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes

A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.

Multiprocessing with large no of files

I am trying to solve a problem. I would appreciate your valuable input on this.
Problem statement:
I am trying to read a lot of files (of the order of 10**6) in the same base directory. Each file has the name that matches the pattern (YYYY-mm-dd-hh), and the content of the files are as follows
mm1, vv1
mm2, vv2
mm3, vv3
.
.
.
where mm is the minute of the day and vv” is some numeric value with respect to that minute. I need to find, given a start-time (ex. 2010-09-22-00) and an end-time (ex. 2017-09-21-23), the average of all vv’s.
So basically user will provide me with a start_date and end_date, and I will have to get the average of all the files in between the given date range. So my function would be something like this:
get_average(start_time, end_time, file_root_directory):
Now, what I want to understand is how can I use multiprocessing to average out the smaller chunks, and then build upon that to get the final values.
NOTE: I am not looking for linear solution. Please advise me on how do I break the problem in smaller chunks and then sum it up to find the average.
I did tried using multiprocessing module in python by creating a pool of 4 processes, but I am not able to figure out how do I retain the values in memory and add the result together for all the chunks.

You process is going to be I/O bound.
Multiprocessing may not be very useful, if not counterproductive.
Moreover your storage system, base on enormous number of small files, is not the best. You should look at a time serie database such as influxdb.

Given that the actual processing is trivial—a sum and count of each file—using multiple processes or threads is not going to gain much. This is because 90+% of the effort is opening each file and transferring into memory its content.
However, the most obvious partitioning would be based on some per-data-file scheme. So if the search range is (your example) 2010-09-22-00 through 2017-09-21-23, then there are seven years with (maybe?) one file per hour for a total of 61,368 files (including two leap days).
61 thousand processes do not run very effectively on one system—at least so far. (Probably it will be a reasonable capability some years from now.) But for a real (non-supercomputing) system, partitioning the problem into a few segments, perhaps twice or thrice the number of CPUs available to do the work. This desktop computer has four cores, so I would first try 12 processes where each independently computes the sum and count (number of samples present, if variable) of 1/12 of the files.
Interprocess communication can be eliminated by using threads. Or for process oriented approach, setting up a pipe to each process to receive the results is a straightforward affair.

How to parallelize a backtesting for a trading strategy based on a machine learning model?

For the purpose of backtesting a trading strategy, based on a machine learning model, I wanted to compute the retraining procedure of the model in parallel.
Now my question is:
Is there an opportunity to improve the speed of my algorithm?
It is implemented in python using scikit-learn. The procedure of backtesting is defined as follows:
Train the model on 900 ( the first three years ) data points
Make a prediction for the next day t+1
Retrain the model on the next datapoint t+1
again make a prediction with the model for the day t+2
Retrain the model ....
make a prediction ...
Simply, make a prediction for the next data point and retrain the model on this data point. Then do this till the current day ( last data point ). For some stock predictions, this could be up to 5000 data points, which means by starting this backtesting procedure with a model trained on the first 900 data points, I need to retrain and predict the model 4100 times.
To parallelize the procedure I am using multiprocessing. I have the chance to use a server, that provides 40 CPU kernels. So what I am doing is:
Divide the data points 4100 in 40 parts
Start a process on every kernel which runs the procedure for 1 part
after finishing the procedure, write the result on disk
collect every result and put it together
Python Code:
pool = Pool()
results = []
for i in range(len(parts)):
try:
result = pool.apply_async(classify_parts,[parts[i], parts[i+1]])
results.append(result)
except IndexError:
continue
for result in results:
result.get()
The method classify_parts, starts the procedure for the given range.
For instance, if I have 500 data points and will start the whole backtesting by training the model on 100 data points, I have 400 data points ( days ) left for the backtesting, which means:
divide 400 data points in 40 parts [10,20,30,...,380,390,400]
start a process on every kernel:
classify_parts( 10, 20 ), ... , classify_parts( 390, 400 )
collect the results from disk and put them together
Hopefully I could illustrated my concept in a clear manner.
So my biggest question is, if there is another more efficient way of doing backtesting with a machine learning model, that is retrained on every next data point ( day )? Because with this concept, one backtesting procedure for 5000 data points runs more than 10 minutes.
Maybe incremental / online learning is here the way to go?

At first, this is not parallelizing the process scheduling.
If in doubts, one may revisit theory details on SERIAL, PARALLEL and "just"-CONCURRENT process scheduling before reading further.
Put in plain English
Things may go into SERIAL mode of operations by-principle ( by intent ) or by some sort of resource-limitation. If one is to type a word "parallel" on a keyboard, the letters p, next a, next r ... etc have to be touched SERIAL-ly, one after another, otherwise the input process fails to provide the guaranteed, correct result - i.e. the word "parallel" on screen ( and let's assume we forget about a chance to use a Shift key for the clarity of this example ).
Anyone may test, what gets produced, if one tries to press all the keys at the same moment, or in pairs, or while a cat is standing at your keyboard ... all additional ideas are welcome for hands-on experimentation.
SERIAL processes cannot go faster
In any other circumstances than by getting faster processing-resources and/or by lowering inter-process transport-delays ( or in case a new, smarter algorithm exists and is known to the kind programmer and yet additionally shows to be feasible to get implemented on resources available ).
What is indeed a PARALLEL process?
On the other hand, some things must be organised so as to happen right in the opposite end of the spectra -- if all things simply have to start + happen + finish at the same time, otherwise the results from the process are considered wrong. What could be brought as an example?
imagine a few visually simply "readable" examples - the show of Red Arrows - a group of 9 jet planes, which displays figures and manoeuvres of high pilotage
or Frecce Tricolori
where all planes move at the same time exactly the given ( same ) way, otherwise the lovely show would fail.
imagine a concert, that has to take place. This is exactly the ultimately PARALLEL process and one would immediately recognise how devastating that would be, if it were not
imagine a robot-control system for several degrees-of-freedom, where all motion has to be carried out and has to be kept under a control of a closed-loop [ actor - detector - actuator ] subsystem in sync, coordinated for all of the several programming-controlled axes. The system must control all the NC-axes together as a PARALLEL process -- otherwise the robot-arm will never be able to follow the black-line - the defined, curved, spatio-temporal trajectory in +4-D space ( best to view in real-time ).
The robot-control example shows, that the farther we move from the aerobatics or the symphony examples, for which we have some direct experience ( or where one's imagination can provide some comprehension ), the PARALLEL process-scheduling may seem harder to sense or to realise, but the symphony seems to be the best one to remind, when questioning if a problem is indeed a truly PARALLEL or not.
Just ask yourself, what would happen, if just half of the ensemble comes in due time to start playing at 19:00 sharp, or what would happen, if some of the violins plays at an accelerated speed... or what would happen, if the players will start playing just one note, in a sequence and let his/her left side-neighbour to play his/her note, waiting until the order of playing gets back to the same player to perform his/her next note in the concert. Correct, that would be pure-SERIAL scheduling. )
So, would 40-CPU-cores allow to somehow accelerate my problem?
The problem still depends on the processing.
If the learning-method ( in an undisclosed MCVE ) has a built in dependence, for example if classify_parts( 20, 30 ) needs to get some input or a piece of information, generated from a "previous" step in classify_parts( 10, 20 ), then the answer is no.
If the learning-method is independent, in the above defined sense, then there is a chance to schedule the process as a "just"-CONCURRENT and benefit from the fact that there are more "free"-lanes on the highway to allow cars run in not just one lane, but allow one to overtake another, slower one, on the way from start to finish. A rugby team can move from city A to city B "faster", if there is 2 or 3 or more 4-seat cars available in the city A stadium, than if there is just 1 and it has to go there and back and forth to move all the players successfully to city B.
Next, even if there are cars enough to move to city B for all players, the duration would be uncomfortably high in case, there is just one lane road from A to B and there is a high traffic in the opposite direction from B to A ( not allowing to overtake ) and a single tractor, also going from A to B, but at a speed of 1 mph.
Warning: ( if in a search for really FAST PROCESSING, the more for the FASTEST POSSIBLE one )
Still, even if there are none of the above sketched "external" obstacles, one has to be very careful, not to lose the benefits available from "bigger" machine + "wider" highways ( more resources available ) for a "just"-CONCURRENT process scheduling.
Yes, the problem here is related with programming tools.
Python can help a lot for fast prototyping, but beware for the Global Interpreter Lock ( GIL ) code-execution implications.
In your case, using a Pool() of 40-threads might look attractive, but will still wait for a GIL-acquisition/release and ALL THE WORK WILL turn to be SERIAL again, just due to omission to notice the known GIL-mechanics on a Pool() of python-threads.
Yes, there are other more promising alternatives, like a ThreadPool() et al, but not only the code-related GIL-stepping is devastating your efforts to accelerate the speed of the process, but also the shared data-structures is a problem to watch and to avoid. Additionally, the Pool() naively-replicated environments may grow beyond one's system RAM capacities and the swapping mechanism will yield such attempt to accelerate the process totally un-usable at all. Share nothing, otherwise your dream to accelerate processing will lose on sharing-mechanics and you are back on the square No.1.
The best way is to design the code as independent processes, that have zero-sharing, non-blocking ( MEM, IO ) operations.
Given all that above, the 40-CPU-cores may help to accelerate the "just"-CONCURRENT process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.