Efficient way to read a lot of text files using python

Efficient way to read a lot of text files using python - python

I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,
topics =os.listdir(my_directory)
df =[]
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
data = f.read().replace('\n', ' ')
print(data)
f.close()
df = np.append(df, data)
However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,
df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]
I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?

There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1 decode and newline replacement. But realistically, it won't make a huge difference.
import multiprocessing.pool
MEG = 2**20
filelist = []
topics =os.listdir(my_directory)
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
filelist.append(my_directory+ '/'+ topic+ '/'+file)
def worker(filename):
with open(filename, encoding ='latin1', bufsize=1*MEG) as f:
data = f.read().replace('\n', ' ')
#print(data)
return data
with multiprocessing.pool.ThreadPool() as pool:
datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)

Generator functions allow you to declare a function that behaves like
an iterator, i.e. it can be used in a for loop.
generators
lazy function generator
def read_in_chunks(file, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file.read(chunk_size)
if not data:
break
yield data
with open('big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)

Note
I misread the line df = np.append(df, data) and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.
Actual Problem
It looks like your question may not actually solve your actual problem.
Have you measured the performance of your two most important calls?
files = os.listdir (my_directory+ '/'+ topic)
df = np.append(df, data)
The way you formatted your code makes me think there is a bug: df = np.append(df, data) is outside the file's for loop scope so I think only your last data is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame is slow.
Potential Solution
As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all of the files beforehand and only then insert them in a DataFrame this could prove to be faster.
The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame's from_records or one of its other factory methods.
A nice SO question that has a little more discussion I found:
Improve Row Append Performance On Pandas DataFrames
TL;DR
Measure the time to read all the files without dealing with pandas at all
If it proves to be much much faster and you have enough memory to load all the files' contents at once use another way to construct your DataFrame, say DataFrame.from_records

Related

How to append chunks of 2D numpy array to binary file as the chunks are created?

I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:
<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...
I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.
I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.
I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("singleFrameHeader", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
if current_data.header == 0:
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f)
This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.
What would be great is something like
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("bigMatrix", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
index = current_data.header
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f, index)
Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.

You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.
A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).
You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.
from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys
class Header:
__slots__ = ('id', 'count', 'file', 'name')
def __init__(self, id):
self.id = id
self.count = 0
self.file = NamedTemporaryFile(delete=False)
self.name = self.file.name
def write_frame(self, frame):
data = np.array(frame.data).view(float)
self.count += data.size
data.tofile(self.file)
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}
with ExitStack() as stack:
while True:
frame = input_data.next_frame()
if frame is None:
break # recast this loop as necessary
if frame.header not in file_map:
header = Header(frame.header)
stack.enter_context(header.file)
file_map[frame.header] = header
else:
header = file_map[frame.header]
header.write_frame(frame)
max_header = max(file_map)
max_count = max(h.count for h in file_map)
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
for i in range max_header:
if i in file_map:
h = file_map[i]
with open(h.name, 'rb') as input:
copyfileobj(input, output)
remove(h.name)
if h.count < max_count:
np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
else:
np.full(max_count, np.nan, dtype=np.float).tofile(output)
The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.
Alternative
If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:
header_size = 500 * N # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
header_map = {}
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
while True:
frame = input_data.next__frame()
if frame is None:
break
if frame.header not in header_map:
header_map[frame.header] = 0
data = np.array(frame.data).view(float)
output.seek(16 + frame.header * header_size + header_map[frame.header])
data.tofile(output)
header_map[frame.header] += data.size * data.dtype.itemsize
I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?

Reading specific chunks pandas / not reading all chunks in pandas

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.

You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:
fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
row_count += 1
pos[row_count] = fp.tell()
line = fp.readline()
Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:
chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts:
fp.seek(pos[i*chunksize]) # this will bring you to the first line of the desired chunk
obj = pd.read_csv(fp, chunksize=chunksize) # read your chunk lazily
df = obj.get_chunk() # convert to DataFrame object
plt.plot(df["Variable"]) # do something
fp.close() # Don't forget to close the file when finished.
And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:
obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])
P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.

I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.
So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.
Instead of:
for chunk in pd.read_csv(file, chunksize=chunksize):
plt.plot(chunk['Variable'])
Use:
for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
plt.plot(chunk)
And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).
I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.
With my test data, quadrupling the chunksize cuts down processing time in half:
chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s
And specifying usecols at read time also cuts down processing time in half again:
chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s

As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.
In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.
I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.
Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).
import io
from itertools import islice
class DfReaderChunks:
def __init__(self, filename, chunksize, n):
self.fo = open(filename)
self.chs = chunksize
self.skiplines = self.chs * n
self.header = next(self.fo)
def getchunk(self):
ll = list(islice(self.fo, self.chs))
if len(ll) == 0:
raise StopIteration
dd = list(islice(self.fo, self.skiplines))
return self.header + ''.join(ll)
def __iter__(self):
return self
def __next__(self):
return io.StringIO(self.getchunk())
def close(self):
self.fo.close()
def __del__(self):
self.fo.close()
Using this class, your can read from your file:
reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
df = pd.read_csv(dfst)
print(df) #here I print to stdout, you can plot
reader.close()
which is "equivalent" to your setup:
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
if (counter % 5 == 0):
print(df) #again I print, you can plot
counter += 1
I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).
On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.
The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.

Only read certain rows in a csv file with python

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.
Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?
For example:
Seek to row 50
read from 50 to 60
next time:
seek to row 27
read 27 to 34
and so on
So not only seeking continuesly forward through the file but also backwards.
Thank you a lot

An option would be to use Pandas. For example:
import pandas as pd
# Select file
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)

You can use chunksize
import pandas as pd
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)

If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.
As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:
with open(file_name) as f:
for line,_ in zip(f, range(50)):
pass
lines = [line for line,_ in zip(f, range(10))]
...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.
However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:
from itertools import accumulate
with open(file_name) as f:
cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))
This would allow you to seek to any line number in the file without processing the whole thing ever again:
def seek_line(path, line_num, cum_lens):
with open(path) as f:
f.seek(cum_lens[line_num], 0)
return f.readline()
class LineX:
"""A file reading object that can quickly obtain any line number."""
def __init__(self, path, cum_lens):
self.cum_lens = cum_lens
self.path = path
def __getitem__(self, i):
return seek_line(self.path, i, self.cum_lens)
linex = LineX(file_name, cum_lens)
line50 = linex[50]
But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.

As others are saying the most obvious solution is to use pandas read csv !
The method has a parameter called skiprows:
from the doc there is what is said :
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
You can have something like this :
import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))
Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial
he said :
The parameter essentially means the number of rows to be read into a
dataframe at any single time in order to fit into the local memory.
Since the data consists of more than 70 millions of rows, I specified
the chunksize as 1 million rows each time that broke the large data
set into many smaller pieces.
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
You can try this and iterate over the chunk to retrieve only the rows you are looking for.
The function should return true if the row number is in the specified list

its that easy:
with open("file.csv", "r") as file:
print(file.readlines()[50:60])

concatenating csv files nicely with python

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.
For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.
My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.
from multiprocessing import Process, Manager
def worker_func(proc_id,results):
# Do your thing
results[proc_id] = ["your dataset from %s" % proc_id]
def convert_dataset_to_csv(dataset):
# Placeholder example. I realize what its doing is ridiculous
converted_dataset = [ ','.join(data.split()) for data in dataset]
return converted_dataset
m = Manager()
d_results= m.dict()
worker_count = 100
jobs = [Process(target=worker_func,
args=(proc_id,d_results))
for proc_id in range(worker_count)]
for j in jobs:
j.start()
for j in jobs:
j.join()
with open('somecsv.csv','w') as f:
for d in d_results.values():
# if the actual conversion function benefits from multiprocessing,
# you can do that there too instead of here
for r in convert_dataset_to_csv(d):
f.write(r + '\n')

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:
with open("unified.csv", "w") as unified_csv_file:
for partial_csv_name in partial_csv_names:
with open(partial_csv_name) as partial_csv_file:
unified_csv_file.write(partial_csv_file.read())

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))
for f in files:
print "files added {n}".format(n=n)
concat_file.write(f())
n+=1
concat_file.close()

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.

Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.