Python Multiprocessing Executing Function on Many Files - python

I have a list of csv files that contain numbers, all the same size. I also have lists of data associated with each file (e.g. list of dates, list of authors of each). What I'm trying to do is read in the file, perform a function , then put the maximum value with its value, along with the data associated with it, into a dataframe.
I have approximately 100,000 files to compute, so being able to use multiprocessing would save an incredible amount of time. Important to note: I'm running the following code in Jupyter Notebook on windows, which I know has known issues with multiprocessing. The code I've tried so far is:
import needed things
files = [file1, file2,...,fileN]
dates = [date1, ... , dateN]
authors = [author1, ... , authorN]
def func(file):
#Reads in file, computes func on values
index = files.index(file)
d1 = {'Filename': file, 'Max': max, 'X-value': x-values, 'Date': dates[index], 'Author': authors[index]}
return d1
if __name__ == '__main__':
with mp.Pool() as pool:
data = pool.map(FindTriggers, files)
summaryDF = pd.DataFrame(data)
This runs indefinitely, a known issue, but I'm not sure what the error is or how to fix it. Thank you in advance for any help.

Related

Pandas won´t create .csv file

I recently started diving into algo trading and building a bot for crypto trading.
For this i created a backtester with pandas to run different strategies with different parameters. The datasets (csv files) I use are rather larger (around 40mb each).
These are processed, but as soon as i want to save the processed data to a csv, nothing happens. No output whatsoever, not even an error message. I tried to use the full path, I tried to save it just with the filename, I even tried to save it as a .txt file. Nothing seems to work. I also tried the solutions I was able to find on stackoverflow.
I am using Anaconda3 in case that could be the source of my problem.
Here you can find the part of my code ,which tries to save the dataframe to a file.
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
for i in range(2, len(results_df)):
if results_df.capital.iloc[i] < results_df.capital.iloc[0]:
results_df.drop([i],axis="index")
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
Thank you for your help!
You can simplifiy your code by a great deal and write it as (should also run faster):
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
values_to_delete= indexer_capital_smaller[indexer_capital_smaller].index
results_df.drop(index=values_to_delete, inplace=True)
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
I think, the main problem in your code might be, that you write the csv each time you found an entry in the dataframe where capital sattisfies the condition and you write it only if you find such a case.
And if you just do the deletion for the csv output but don't need the dataframe in memory anymore, you can make it even simpler:
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
#results to csv
current_dir = os.getcwd()
results_df[indexer_capital_smaller].to_csv(os.getcwd()+'\\file.csv')
print(results_df[indexer_capital_smaller])
This second variant only applies a filter before writing the filtered lines and before printing the content.

Return response as an excel file written using multiprocessing, BytesIO, xlsxwriter

I have the following list schema:
multiple_years_data = [
{
'dates': ... (list of dates for one year),
'values': ... (list of values for one year),
'buffer': output
},
{
'dates': ... (list of dates for another year),
'values': ... (list of values for another year),
'buffer': output
},
... (and so on)
]
All the data from above i want to insert it into an excel file with two columns(date, value) & for each dict from the list i want to create a new workspace.
For example, if i would have 5 dicts in the list, i would have 5 workspaces, with data depending on the dict.
Having this much data, doing all the insert synchronously takes some time, so i thought about implementing multiprocessing so that i can add some processing power from other cores. Now if i would have data for 5 years, i would create 5 processes. From the start i tried it only in one process & one dict, and if it would work i will add more processes.
By the way, i should mention that i use the Django framework.
So these are the most important parts of my code:
def add_one_year_to_excel(data):
workbook = xlsxwriter.Workbook(data['output'])
worksheet = data['buffer'].add_worksheet('first_worksheet')
# insert those 2 lists of data(`dates`, `values`) in two columns
worksheet.write_column(1, 0, data['dates'])
worksheet.write_column(1, 1, data['values'])
return workbook
output = BytesIO()
workbook = xlsxwriter.Workbook(output)
pool = Pool(1)
result = pool.apply(add_one_year_to_excel, multiple_years_data)
pool.close()
# this line makes so that i can automatically download the result file if i access the assigned url
response = HttpResponse(result.read(),
content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
response['Content-Disposition'] = f'attachment; filename=example.xlsx'
response['Access-Control-Expose-Headers'] = 'Content-Disposition'
return response
I know some parts are written a little bad, but i tried to do a very simple schema of my code so you can understand my problem.
Now the problem is that, once the result from function add_one_year_to_excel(data) is returned it gets problematic:
MaybeEncodingError at /api/v1/forecasted_profile_excel/
Error sending result: '<xlsxwriter.workbook.Workbook object at 0x7f5874306dc0>'. Reason: 'PicklingError("Can't pickle <class 'xlsxwriter.worksheet.Number'>: attribute lookup Number on xlsxwriter.worksheet failed")'
In the end i understood that only BytesIO() can be pickled and if i add a worksheet, the object can't be pickled anymore. So, all the problem spins around pickling, at returning and at sharing data between multiple processes.
I've been struggling for 3 days with this problem. Looked throughout the internet and still i can't a way to make this work.
So i'm asking you if you can help me get over this problem ? Maybe a better way of doing this but the same result ? I'm kinda new in this area so maybe my code isn't well written.
Thanks in advance.

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt

python pandas dataframe variable not updating

I'm working on creating a program that will pull some data from a preformatted file that does not include a timestamp but requires one. I know the following things:
The name of the file, which includes that hour at which the data was logged. I can assume that the first data point was collected at the start of the hour and I can parse that.
I know that each data point was collected at a frequency of 64Hz, so I know the time delta between each data point.
As I write the code chunk to extract these data, I am running into this problem that my date is updating, but my hour isn't. The result is that all my data have the correct date, but the same hour. I'm hoping this is just the result of missing something logical as a result of sleep deprivation, but I'd appreciate it if someone could point out the problem with my code.
#Paths for files to process
advpath = '/Users/stnixon/Dropbox/GradSchool/Research/EddyCovarianceData/data/palmyra2016/**'
#Create list of files to process
advfiles = glob.glob(os.path.join(advpath,'*.A16'))
#create data frames, load files, concatenate, and sort adv files and dfetfiles
advframe = []
for f in advfiles:
advdf = pd.read_csv(f, sep='\s+', names=['ID','u','v','w','u1','v1','w1','ucorr','vcorr','wcorr'], usecols=[0,1,2,3,7,8,9])
file_now = os.path.basename(f)
print(int(file_now[4:6]))
advdf['Time'] = pd.to_datetime(int(file_now[4:6]),unit='h')
advdf['Date'] = pd.to_datetime('2016'+file_now[0:2]+file_now[2:4])
advframe.append(advdf)
advdata = pd.concat(advframe)
Essentially, the Date column gives me the right date across each row, while the Time column just gives me the same value for all.
It turns out that this was not a bug, but instead a weird coincidence. The files I needed to process were parsing the hour correctly, but they were being processed in a random order and it just so happens that the hours for the first two and last two files were all the same. So it looked in the terminal as if it wasn't updating.

Join/merge multiple NetCDF files using xarray

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?
If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.

Categories

Resources