concatenating csv files nicely with python

concatenating csv files nicely with python - python

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.
For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.
My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.
from multiprocessing import Process, Manager
def worker_func(proc_id,results):
# Do your thing
results[proc_id] = ["your dataset from %s" % proc_id]
def convert_dataset_to_csv(dataset):
# Placeholder example. I realize what its doing is ridiculous
converted_dataset = [ ','.join(data.split()) for data in dataset]
return converted_dataset
m = Manager()
d_results= m.dict()
worker_count = 100
jobs = [Process(target=worker_func,
args=(proc_id,d_results))
for proc_id in range(worker_count)]
for j in jobs:
j.start()
for j in jobs:
j.join()
with open('somecsv.csv','w') as f:
for d in d_results.values():
# if the actual conversion function benefits from multiprocessing,
# you can do that there too instead of here
for r in convert_dataset_to_csv(d):
f.write(r + '\n')

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:
with open("unified.csv", "w") as unified_csv_file:
for partial_csv_name in partial_csv_names:
with open(partial_csv_name) as partial_csv_file:
unified_csv_file.write(partial_csv_file.read())

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))
for f in files:
print "files added {n}".format(n=n)
concat_file.write(f())
n+=1
concat_file.close()

Related

Write examples to the same TFRecord file in parallel

I have a process that reads URIs of CSV files located in cloud storage, serializes the data (one file is an "example" in tensorflow speak), and writes them to the same TFRecord file.
The process is very slow and I would like to parallelize the writing using python multiprocessing. I've searched high and low and tried multiple implementations to no avail. This question is very similar to mine, but the question is never really answered.
This is the closest I've come (unfortunately, I can't really provide a replicable example due to the read from cloud storage):
import pandas as pd
import multiprocessing
import tensorflow as TF
TFR_PATH = "./tfr.tfrecord"
BANDS = ["B2", "B3","B4","B5","B6","B7","B8","B8A","B11","B12"]
def write_tfrecord(tfr_path, df_list, bands):
with tf.io.TFRecordWriter(tfr_path) as writer:
for _, grp in df_list:
band_data = {b: [] for b in bands}
for i, row in grp.iterrows():
try:
df = pd.read_csv(row['uri'])
except FileNotFoundError:
continue
df = prepare_df(df, bands)
label = row['FS_crop'].encode()
for b in bands:
band_data[b].append(list(df[b].astype('Int64')))
# pad to same length and flatten
mlen = max([len(j) for j in band_data[list(band_data.keys())[0]]])
npx = len(band_data[list(band_data.keys())[0]])
flat_band_data = {k: [] for k in band_data}
for k,v in band_data.items(): # for each band
for b in v:
flat_band_data[k].extend(b + [0] * int(mlen - len(b)))
example_proto = serialize_example(npx, flat_band_data, label)
writer.write(example_proto)
# List of grouped DF object, may be 1000's long
gqdf = list(qdf.groupby("field_centroid_str"))
n = 100 #Groups of files to write
processes = [multiprocessing.Process(target=write_tfrecord, args=(TFR_PATH, gqdf[i:i+n], BANDS)) for i in range(0, len(gqdf), n)]
for p in processes:
p.start()
for p in processes:
p.join()
p.close()
This processes will finish, but when I go to read a record, I like so:
raw_dataset = tf.data.TFRecordDataset(TFR_PATH)
for raw_record in raw_dataset.take(10):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
I always end up with a corrupted data error DataLossError: corrupted record at 7462 [Op:IteratorGetNext]
Any ideas on the correct approach for doing something like this? I've tried using Pool instead of Process, but the tf.io.TFRecordWriter can't be pickled, so it doesn't work.

Efficient way to read a lot of text files using python

I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,
topics =os.listdir(my_directory)
df =[]
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
data = f.read().replace('\n', ' ')
print(data)
f.close()
df = np.append(df, data)
However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,
df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]
I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?

There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1 decode and newline replacement. But realistically, it won't make a huge difference.
import multiprocessing.pool
MEG = 2**20
filelist = []
topics =os.listdir(my_directory)
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
filelist.append(my_directory+ '/'+ topic+ '/'+file)
def worker(filename):
with open(filename, encoding ='latin1', bufsize=1*MEG) as f:
data = f.read().replace('\n', ' ')
#print(data)
return data
with multiprocessing.pool.ThreadPool() as pool:
datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)

Generator functions allow you to declare a function that behaves like
an iterator, i.e. it can be used in a for loop.
generators
lazy function generator
def read_in_chunks(file, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file.read(chunk_size)
if not data:
break
yield data
with open('big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)

Note
I misread the line df = np.append(df, data) and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.
Actual Problem
It looks like your question may not actually solve your actual problem.
Have you measured the performance of your two most important calls?
files = os.listdir (my_directory+ '/'+ topic)
df = np.append(df, data)
The way you formatted your code makes me think there is a bug: df = np.append(df, data) is outside the file's for loop scope so I think only your last data is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame is slow.
Potential Solution
As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all of the files beforehand and only then insert them in a DataFrame this could prove to be faster.
The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame's from_records or one of its other factory methods.
A nice SO question that has a little more discussion I found:
Improve Row Append Performance On Pandas DataFrames
TL;DR
Measure the time to read all the files without dealing with pandas at all
If it proves to be much much faster and you have enough memory to load all the files' contents at once use another way to construct your DataFrame, say DataFrame.from_records

How to read multiple files into multiple threads/processess to optimize data analyses?

I am trying to read 3 different files in python and do something to extract the data outof it. Then I want to merge the data into one big file.
Since each individual files are already big and take sometime doing the data processing, I am thinking if
I can read all three files at once (in multiple threads/process)
wait for the process for all files to finish
when all output are ready then pipe all the data to downstream function to merge it.
Can someone suggest some improvement to this code to do what I want.
import pandas as pd
file01_output = ‘’
file02_output = ‘’
file03_output = ‘’
# I want to do all these three “with open(..)” at once.
with open(‘file01.txt’, ‘r’) as file01:
for line in file01:
something01 = do something in line
file01_output += something01
with open(‘file02.txt’, ‘r’) as file01:
for line in file01:
something02 = do something in line
file02_output += something02
with open(‘file03.txt’, ‘r’) as file01:
for line in file01:
something03 = do something in line
file03_output += something03
def merge(a,b,c):
a = file01_output
b = file01_output
c = file01_output
# compile the list of dataframes you want to merge
data_frames = [a, b, c]
df_merged = reduce(lambda left,right: pd.merge(left,right,
on=['common_column'], how='outer'), data_frames).fillna('.')

There are many ways to use multiprocessing in your problem so I'll just propose one way. Since, as you mentioned, the processing happening on the data in the file is CPU bound you can run that in a separate process and expect to see some improvement (how much improvement, if any, depends on the problem, algorithm, # cores, etc.). For example, the overall structure could look like just having a pool which you map a list of all the filenames which you need to process and in that function you do your computing.
It's easier with a concrete example. Let's pretend we have a list of CSVs 'file01.csv', 'file02.csv', 'file03.csv' which have a NUMBER column and we want to compute whether that number is prime (CPU bound). Example, file01.csv:
NUMBER
1
2
3
...
And the other files look similar but with different numbers to avoid duplicating work. The code to compute the primes could then look like this:
import pandas as pd
from multiprocessing import Pool
from sympy import isprime
def compute(filename):
# IO (probably not faster)
my_data_df = pd.read_csv(filename)
# do some computing (CPU)
my_data_df['IS_PRIME'] = my_data_df.NUMBER.map(isprime)
return my_data_df
if __name__ == '__main__':
filenames = ['file01.csv', 'file02.csv', 'file03.csv']
# construct the pool and map to the workers
with Pool(2) as pool:
results = pool.map(compute, filenames)
print(pd.concat(results))
I've used the sympy package for a convenient isprime method and I'm sure the structure of my data is quite different but, hopefully, that example illustrates a structure you could use too. The plan of performing your CPU bound computations in a pool (or list of Processes) and then merge/reduce/concatenating the result is a reasonable approach to the problem.

Processing a huge amount of files in python

I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:
# imports
class Parser(object):
def __init__(self):
# load some configurations
# open some log files
# set some initial values for some variables
def parse_packet(self, tags):
# extract some values from line
def found_matched(self, packet):
# search in the related list to find matched line
def save_packet(self, packet):
# write the line in the appropriate files and increase or decrease some counters
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
file_list = dname[2]
for fname in file_list:
self.parse(os.path.join(dname[0], fname))
self.finalize_process()
def finalize_process(self):
# closing files
I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?
Edit 1:
I have changed my code according to #Reut Sharabani's answer:
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
process_pool = multiprocessing.Pool(10)
for fname in file_list:
file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
process_pool.map(self.parse, file_list)
self.finalize_process()
I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed:
import copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file.
Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10) before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.
EDIT 2:
I also tried using ThreadPoolExecutor this way:
with ThreadPoolExecutor(max_workers=10) as executor:
for fname in file_list:
executor.submit(self.parse, fname)
Results were correct, but it took an hour and a half to be completed.

First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.
Furthermore, a simple multiprocessing-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).
multiprocessing architecture
Just have a look at multiprocessing, your architecture likely will involve one manager process (the parent), which defines a task Queue, and a Pool of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.
file processing method
This is quite inefficient:
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
...
readlines() reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.

You should be using threads here. If you're blocked by cpu later, you can use processes.
To explain I first created a ten thousand files (0.txt ... 9999.txt), with a line count that's equivalent to the name (+1), using this command:
for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done
Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
import time
import sys
print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))
files = ["%d.txt" % i for i in range(1000)]
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f.readlines():
if int(line.strip()) % 2 == 0:
line_count += 1
print "finished file %s" % filename
return line_count
start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.
Edit:
As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:
#!/usr/bin/env python
import multiprocessing
import time
import sys
files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f:
if int(line.strip()) % 2 == 0:
line_count += 1
return line_count
print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))
start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
Results:
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043

Aside from using parallel processing, your parse method is rather inefficient as #Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:
def parse(self, file_addr):
with open(file_addr, 'r') as f:
line_no = 0
for line in f:
line_no += 1
if line_no % 2 != 0:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
Or using your style (assuming you use python 3):
def parse(self, file_addr):
with open(file_addr, 'r') as f:
filtered = (l for index,l in enumerate(f) if index % 2 != 0)
for line in filtered:
# and so on
The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.

Chunking data from a large file for multiprocessing?

I'm trying to a parallelize an application using multiprocessing which takes in
a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size
file.
Currently I do a list(file_obj), which unfortunately is loaded entirely
into memory (I think) and I then I break that list up into n parts, n being the
number of processes I want to run. I then do a pool.map() on the broken up
lists.
This seems to have a really, really bad runtime in comparison to a single
threaded, just-open-the-file-and-iterate-over-it methodology. Can someone
suggest a better solution?
Additionally, I need to process the rows of the file in groups which preserve
the value of a certain column. These groups of rows can themselves be split up,
but no group should contain more than one value for this column.

list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.
In particular, we can use
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
to split the file into processable chunks, and
groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)
to have the multiprocessing pool work on num_chunks chunks at a time.
By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.
import multiprocessing as mp
import itertools
import time
import csv
def worker(chunk):
# `chunk` will be a list of CSV rows all with the same name column
# replace this with your real computation
# print(chunk)
return len(chunk)
def keyfunc(row):
# `row` is one row of the CSV file.
# replace this with the name column.
return row[0]
def main():
pool = mp.Pool()
largefile = 'test.dat'
num_chunks = 10
results = []
with open(largefile) as f:
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
while True:
# make a list of num_chunks chunks
groups = [list(chunk) for key, chunk in
itertools.islice(chunks, num_chunks)]
if groups:
result = pool.map(worker, groups)
results.extend(result)
else:
break
pool.close()
pool.join()
print(results)
if __name__ == '__main__':
main()

I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.