I'm writing a Python code that processes thousands of files, puts the data of each file in a data frame, and each data frame gets appended in an array. Afterwards, it takes this array and concatenates it so that the end result is one matrix containing all the data of all the data frames.
Here is the code to illustrate:
for root, dirs, filenames in os.walk(folder_name):
for f in filenames:
if f == '.DS_Store':
continue
fullpath = os.path.join(folder_name, f)
book = open(fullpath, 'r')
data = {u[0]:u[1] for u in json.load(book)}
books.append(pd.DataFrame(data=[data], index=[f]))
df = pd.concat(books, axis=0).fillna(0).sort_index()
M = df.as_matrix()
I encounter no issue in the processing part; the for loop works perfectly. However, when I try to concatenate, the code keeps running for 20 minutes or so then the script stops with an "exit code -9". Any idea what that could mean and/or how this could be fixed?
Any suggestion would be very appreciated !
Related
I am a newbie of multiprocessing and i am using the said library in python to parallelize the computation of a parameter for the rows of a dataframe.
The idea is the following:
I have two functions, g for the actual computation and f for filling the dataframe with the computed values. I call the function f with pool.apply_async. The problem is that at the end of the poo.async the dataframe has not been updated even though a print inside f easily shows that it is saving correctly the values. So I thought to save the results in a file excel inside the f function as showed in my pseudo code below. However, what I obtain is that the file excel where i save the results stops to be updated after 2 values and the kernel keeps running even though the terminal shows that the script has computed all the values.
This is my pseudo code:
def g(path to image1, path to image 2):
#vectorize images
#does computation
return value #value is a float
def f(row, index):
value= g(row.image1, row.image2)
df.at[index, 'value'] = value
df.to_csv('dftest.csv')
return df
def callbackf(result):
global results
results.append(result)
inside the main:
results=[]
pool = mp.Pool(N_CORES)
for index, row in df.iterrows():
pool.apply_async(f,
args=(row, index),
callback=callbackf)
I tried to use with get_context("spawn").Pool() as pool inside the main as suggested by https://pythonspeed.com/articles/python-multiprocessing/ but it didn't solve my problem. What am I doing wrong? Is it possible that the vectorizing the images at each row causes problem to the multiprocessing?
At the end I saved the results in a txt instead of a csv and it worked. I don't know why it didn't work with csv though.
Here's the code I put instead of the csv and pickle lines:
with open('results.txt', 'a') as f:
f.write(image1 +
'\t' + image2 +
'\t' + str(value) +
'\n')
I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,
topics =os.listdir(my_directory)
df =[]
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
data = f.read().replace('\n', ' ')
print(data)
f.close()
df = np.append(df, data)
However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,
df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]
I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?
There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1 decode and newline replacement. But realistically, it won't make a huge difference.
import multiprocessing.pool
MEG = 2**20
filelist = []
topics =os.listdir(my_directory)
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
filelist.append(my_directory+ '/'+ topic+ '/'+file)
def worker(filename):
with open(filename, encoding ='latin1', bufsize=1*MEG) as f:
data = f.read().replace('\n', ' ')
#print(data)
return data
with multiprocessing.pool.ThreadPool() as pool:
datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)
Generator functions allow you to declare a function that behaves like
an iterator, i.e. it can be used in a for loop.
generators
lazy function generator
def read_in_chunks(file, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file.read(chunk_size)
if not data:
break
yield data
with open('big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)
Note
I misread the line df = np.append(df, data) and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.
Actual Problem
It looks like your question may not actually solve your actual problem.
Have you measured the performance of your two most important calls?
files = os.listdir (my_directory+ '/'+ topic)
df = np.append(df, data)
The way you formatted your code makes me think there is a bug: df = np.append(df, data) is outside the file's for loop scope so I think only your last data is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame is slow.
Potential Solution
As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all of the files beforehand and only then insert them in a DataFrame this could prove to be faster.
The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame's from_records or one of its other factory methods.
A nice SO question that has a little more discussion I found:
Improve Row Append Performance On Pandas DataFrames
TL;DR
Measure the time to read all the files without dealing with pandas at all
If it proves to be much much faster and you have enough memory to load all the files' contents at once use another way to construct your DataFrame, say DataFrame.from_records
I'm using Python 3.5 to move through directories and subdirectories to access csv files and fill arrays with data from those files. The first csv file the code encounters looks like this:
The code I have is below:
import matplotlib.pyplot as plt
import numpy as np
import os, csv, datetime, time, glob
gpheight = []
RH = []
dewpt = []
temp = []
windspd = []
winddir = []
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
count2 = 0
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.datetime.strptime(dirname[:8], '%m%d%Y')
csv_folder = os.path.join(dirpath, dirname)
for csv_file2 in glob.glob(os.path.join(csv_folder, 'figs', '0*.csv')):
if os.stat(csv_file2).st_size == 0:
continue
#create new arrays for each case
gpheight.append([])
RH.append([])
temp.append([])
dewpt.append([])
windspd.append([])
winddir.append([])
with open(csv_file2, newline='') as f2_input:
csv_input2 = csv.reader(f2_input,delimiter=' ')
for j,row2 in enumerate(csv_input2):
if j == 0:
continue #skip header row
#fill arrays created above
windspd[count2].append(float(row2[5]))
winddir[count2].append(float(row2[6]))
gpheight[count2].append(float(row2[1]))
RH[count2].append(float(row2[4]))
temp[count2].append(float(row2[2]))
dewpt[count2].append(float(row2[3]))
count2 = count2 + 1
except ValueError as e:
pass
I have it set up to create a new array for each new csv file. However, when I print the third (temperature) column,
for n in range(0,len(temp)):
print(temp[0][n])
it only partially prints that column of data:
-70.949997
-68.149994
-60.449997
-63.649994
-57.449997
-51.049988
-45.349991
-40.249985
-35.549988
-31.249985
-27.149994
-24.549988
-22.149994
-19.449997
-16.349976
-13.25
-11.049988
-8.949982
-6.75
-4.449982
-2.25
-0.049988
In addition, I believe a related problem is that when I simply do,
print(temp)
it prints
with the highlighted section the section that belongs to this one csv file, and should therefore be in one array. There are also additional empty arrays at the end that should not be there.
I have (not shown) a section of code before this that does the same thing but with different csv files, and that works as expected, separating each file's data into a new array, with no empty arrays. I appreciate any help!
The issue had been my use of try and pass. All the files that matched my criteria were met, but some of those files had issues with how their contents were read, which caused the errors I was receiving later in the code. For anyone looking to use try and pass, make sure that you are able to safely pass on any exceptions that block of code may encounter. Otherwise, it could cause problems later. You may still get an error if you don't pass on it, but that will force you to fix it appropriately instead of ignoring it.
My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.
For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.
My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.
have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.
from multiprocessing import Process, Manager
def worker_func(proc_id,results):
# Do your thing
results[proc_id] = ["your dataset from %s" % proc_id]
def convert_dataset_to_csv(dataset):
# Placeholder example. I realize what its doing is ridiculous
converted_dataset = [ ','.join(data.split()) for data in dataset]
return converted_dataset
m = Manager()
d_results= m.dict()
worker_count = 100
jobs = [Process(target=worker_func,
args=(proc_id,d_results))
for proc_id in range(worker_count)]
for j in jobs:
j.start()
for j in jobs:
j.join()
with open('somecsv.csv','w') as f:
for d in d_results.values():
# if the actual conversion function benefits from multiprocessing,
# you can do that there too instead of here
for r in convert_dataset_to_csv(d):
f.write(r + '\n')
If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:
with open("unified.csv", "w") as unified_csv_file:
for partial_csv_name in partial_csv_names:
with open(partial_csv_name) as partial_csv_file:
unified_csv_file.write(partial_csv_file.read())
Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))
for f in files:
print "files added {n}".format(n=n)
concat_file.write(f())
n+=1
concat_file.close()
I have a large number of .asc files containing (x,y) coordinates for two given satellites. There are approximately 3,000 separate files for each satellite (e.g. Satellite1 = [file1,file2,..., file3000] and Satellite2= [file1,file2,..., file3000]).
I'm trying to write some code in Python (version 2.7.8 |Anaconda 2.0.) that finds the multiple points on the Earth's surface where both satellite tracks crossover.
I've written some basic code that takes two files as input(i.e. one from Sat1 and one from Sat2) using loadtxt. In a nutshell, the code looks like this:
sat1_in = loadtxt("sat1_file1.asc", usecols = (1,2), comments = "#")
sat2_in = loadtxt("sat2_file1.asc", usecols = (1,2), comments = "#")
def main():
xover_search() # Returns True or False whether a crossover is found.
xover_final() # Returns the (x,y) coordinates of the crossover.
write_output() # Appends this coordinates to a txt file for later display.
if __name__ == "__main__":
main()
I would like to implement this code to the whole dataset, using a function that outputs "sat1_in" and "sat2_in" for all possible combinations of files between satellite 1 and satellite 2. These are my ideas so far:
#Create two empty lists to store all the files to process for Sat1 and Sat2:
sat1_files = []
sat2_files = []
#Use os.walk to fill each list with the respective file paths:
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'sat1*.asc'):
sat1_files.append(os.path.join(root, filename))
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'sat2*.asc'):
sat2_files.append(os.path.join(root, filename))
#Calculate all possible combinations between both lists using itertools.product:
iter_file = list(itertools.product(sat1_files, sat2_files))
#Extract two lists of files for sat1 and sat2 to be compared each iteration:
sat1_ordered = [seq[0] for seq in iter_file]
sat2_ordered = [seq[1] for seq in iter_file]
And this is where I get stuck. How to iterate through "sat1_ordered" and "sat2_ordered" using loadtxt to extract the lists of coordinates for every single file?The only thing I have tried is:
for file in sat1_ordered:
sat1_in = np.loadtxt(file, usecols = (1,2),comments = "#")
But this will create a huge list containing all the measurements for satellite 1.
Could someone give me some ideas about how to tackle this problem?
Maybe you are searching something like that:
for file1, file2 in iter_file:
sat1_in = np.loadtxt(file1, usecols = (1,2),comments = "#")
sat2_in = np.loadtxt(file2, usecols = (1,2),comments = "#")
....