I need to parse around 1000 URLs. So far I have a function that returns a pandas dataframe after parsing the URL. How should I best structure the program so I can add all the dataframes together? I'm also unsure how to return arguments into 'futures'. In the below example, how can I eventually merge all temp dataframes into a single dataframe (i.e. finalDF=finalDF.append(temp)
import concurrent.futures
def Parser(ptf):
temp=pd.DataFrame()
URL="http://"+str(URL)
#..some complex operations, including a requests.get(URL) which returns eventually a temp: a pandas dataframe
return temp #returns a pandas dataframe
def conc_caller(ptf):
temp=Parser(ptf)
#this won't work because finalDF is not defined, unclear how to handle this
finalDF= finalDF.append(temp)
return df
booklist=['a','b','c']
finalDF=pd.DataFrame()
executor = concurrent.futures.ProcessPoolExecutor(3)
futures = [executor.submit(conc_caller, item) for item in booklist]
concurrent.futures.wait(futures)
Another problem is that I get the error message:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
Any suggestion how to fix the code are appreciated.
You have to protect your launch code with if __name__ == '__main__': to prevent creating processes forever.
Just before concurrent.futures.wait(futures)
Related
I recently put a post on Stack Overflow, but it wasn't enough information, so it got closed.
Because I did some more research, and more trial and error and also got help from my original post
I am trying to post it again, and see if I have more luck.
I am using Jupyter Notebook on 64 bit Windows machine using Python3.
I am trying to add multi processing to a for loop in python, So that instead of the code looking at one item at a time and appending to a tmplist which is an array, that will eventually be converted to a data frame, it should run multiple times for different values, making it go quicker.
I replaced the for loop to be a function. Instead of saying
for StockId id in ArrId, I replaced it with def AddToTable(StockId) and return the record from the tmplist. I want to add Multiprocessing, and then add all the returns from tmplist into an array so that I can have all the returns in one array.
import numpy as np
from multiprocessing import Pool
import multiprocessing
ArrId=[2345,1234,234532,4532,3464532,245345456,34534,234323,4524]
tmplist = []
def AddToTable(StockID):
return tmplist.append(StockId)
#(If you run this - you will see what is returned)
for StockId in ArrId:
AddToTable(StockId)
print(tmplist)
#Then I added multi processing (see below), but the block of code never finishes.
if __name__ == '__main__':
jobs = []
with multiprocessing.Pool() as p:
results=p.map(AddToTable,ArrId)
jobs.append(results)
p.close()
p.join()
#**Any help will be appreciated!**
I'm creating a function that reads and entire folder, creates a Dask dataframe, then processes the partitions of this dataframe and sums the results, like this:
import dask.dataframe as dd
from dask import delayed, compute
def partitions_func(folder):
df = dd.read_csv(f'{folder}/*.csv')
partial_results = []
for partition in df.partitions:
partial = another_function(partition)
partial_results.append(partial)
total = delayed(sum)(partial_results)
return total
The function being called in partitions_func (another_function) is also delayed.
#delayed
def another_function(partition):
# Partition processing
return result
I checked and the variables created during the processing are all small, so they shouldn't cause any issues. The partitions can be quite large but not larger than the available RAM.
When I execute partitions_func(folder), the process gets killed. At first, I thought the problem had to do with having two delayed, one on another_function and one on delayed(sum).
Removing the delayed decorator from another_function causes issues because the argument is a Dask dataframe and you can't do operations like tolist(). I tried removing delayed from sum, because I thought it could be a problem with parallelisation and the available resources but the process also gets killed.
However, I know there are 5 partitions. If I remove the statement total = delayed(sum)(partial_results) from partitions_func and compute the sum "manually" instead, everything works as expected:
total = partial_results[0].compute() + partial_results[1].compute() + partial_results[2].compute() \
+ partial_results[3].compute() + partial_results[4].compute()
Thanks!
Dask dataframe creates a series of delayed objects, so when you call a delayed function another_function that becomes a nested delayed and dask.compute will not be able to handle it. One option is to use .map_partitions(), the typical example is df.map_partitions(len).compute(), which will compute length of each partition. So if you can rewrite another_function to accept a pandas dataframe, and remove the delayed decorator, then your code will roughly look like this:
df = dd.read_csv(f'{folder}/*.csv')
total = df.map_partitions(another_function)
Now total is a delayed object which you can pass to dask.compute (or simply run total = df.map_partitions(another_function).compute()).
I am processing 100,000s of rows of text data using Pandas Dataframes. Every so often (<5 per 100,000) I have an error for a row that I have chosen to drop. The error handling function is as follows:
def unicodeHandle(datai):
for i, row in enumerate(datai['LDTEXT']):
print(i)
#print(text)
try:
text = row.read()
text.strip().split('[\W_]+')
print(text)
except UnicodeDecodeError as e:
datai.drop(i, inplace=True)
print('Error at index {}: {!r}'.format(i, row))
print(e)
return datai
The function works fine, and I have been using it a few weeks.
The problem is that I never know when the error will occur as the data comes from a DB that is constantly being added to (or I may pull different data). Point being, I must iterate through every row to run my error test function unicodeHandle in order initialize my data. This process takes about ~5 minutes which gets a little annoying. I am trying to implement multiprocessing to speed up the loop. Via the web and various tutorials, I have come up with:
def unicodeMP(datai):
chunks = [datai[i::8] for i in range(8)]
pool = mp.Pool(processes=8)
results = pool.apply_async(unicodeHandle, chunks)
while not results.ready():
print("One Sec")
return results.get()
if __name__ == "__main__":
fast = unicodeMP(datai)
When I run it the multiprocessing, it takes the same amount of time as regular even through my CPU says it is running at a WAY higher utilization. In addition, the code returns the error as a normal error instead of my completed clean dataframe. What am I missing here?
How can I use multiprocessing for functions on DataFrames?
You can try dask for multiprocessing a dataframe
import dask.dataframe as dd
partitions = 7 # cpu_cores - 1
ddf = dd.from_pandas(df, npartitions=partitions)
ddf.map_partitions(lambda df: df.apply(unicodeHandle).compute(scheduler='processes')
You can read more about dask here
My code flow is something like:
import pandas as pd
import threading
import helpers
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
# testing if the previous instance is running
if isinstance(upload_thread, threading.Thread):
if upload_thread.isAlive():
print('waiting for the last upload op to finish')
upload_thread.join()
# starts the upload in another thread, so the loop can continue on the next chunk
upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
upload_thread.start()
It works, the problem is: running it with threading makes it slower!
My idea of code flow is:
process a chunk of data
after its done, upload it on the background
while uploading, advance the loop to the next step, that is
processing the next chunk of data
In theory, sounds great, but after a lot of trials and timing, I believe the threading is slowing down the code flow.
I'm sure I messed something up, please help me to find out what it is.
Also, this function 'helpers.uploading' returns important results to me. How can I access those results? Ideally I need to append the result of each iteration to a list of results. Without threading, this would be something like:
import pandas as pd
import helpers
results = []
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
result = helpers.uploading(**kwargs)
results.append(result)
Thanks!
I'm trying to make an expensive part of my pandas calculations parallel to speed up things.
I've already managed to make Multiprocessing.Pool work with a simple example:
import multiprocessing as mpr
import numpy as np
def Test(l):
for i in range(len(l)):
l[i] = i**2
return l
t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
E = pool.map(Test,L)
pool.close()
pool.join()
No problems here. Now my own algorithm is a bit more complicated, I can't post it here in its full glory and terribleness, so I'll use some pseudo-code to outline the things I'm doing there:
import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd --> self-written class that reads in all the data and puts it into dataframes
=== Settings
=== Use ClassGetDataFrames to get data
=== Lots of single-thread calculations and manipulations on the dataframe
=== Cut dataframe into 4 evenly big chunks, make list of them called DDC
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
LLT = pool.map(mpf.processChunks,DDC)
pool.close()
pool.join()
=== Join processed Chunks LLT back into one dataframe
=== More calculations and manipulations
=== Data Output
When I'm running this script the following happens:
It reads in the data.
It does all calculations and manipulations until the Pool statement.
Suddenly it reads in the data again, fourfold.
Then it goes into the main script fourfold at the same time.
The whole thing cascades recursively and goes haywire.
I have read before that this can happen if you're not careful, but I do not know why it does happen here. My multiprocessing code is protected by the needed name-main-statement (I'm on Win7 64), it is only 4 lines long, it has close and join statements, it calls one defined worker function which then calls a second worker function in a loop, that's it. By all I know it should just create the pool with four processes, call the four processes from the imported script, close the pool and wait until everything is done, then just continue with the script. On a sidenote, I first had the worker functions in the same script, the behaviour was the same. Instead of just doing what's in the pool it seems to restart the whole script fourfold.
Can anyone enlighten me what might cause this behaviour? I seem to be missing some crucial understanding about Python's multiprocessing behaviour.
Also I don't know if it's important, I'm on a virtual machine that sits on my company's mainframe.
Do I have to use individual processes instead of a pool?
I managed to make it work by enceasing the entire script into the if __name__ == "__main__":-statement, not just the multiprocessing part.