What would make this code that combines some flat files run faster? - python

I'm new to Python and haven't gotten into any optimization work yet. I'm attempting to take a bunch of files that themselves are already pretty large and combine them into one large file that will probably wind up being close to 50-100GB would be my guess. More memory than I have at any rate. I was given the code below and it works great for small files. When I try to run it over the actual files for my use case, it will totally lock up my computer.
I understand that Pandas is fast. I'm guessing that data frames are stored in memory. If that is the case then that is probably what is wrecking stuff up here. Is there any kind or mechanism to spill to disk or possibly write to an existing file instead of trying to hold the whole thing in a dataframe before writing to disk? Or possibly another option that I didn't think of?
import pandas as pd
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
df = pd.concat((pd.read_csv(fn) for fn in csvfiles))
df.to_csv(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'), index=False)
for m in file_masks:
combine_files(m)

Here's a non pandas solution that doesn't load everything to memory. I haven't tested it but it should work.
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
with open(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'),'w') as fout:
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
for in_file in csvfiles:
with open(in_file,'r') as fin:
# f.next() # comment this out if you want to remove the headers
for line in fin:
fout.write(line)
for m in file_masks:
combine_files(m)

You don't need Python to do that. There are a lot of tools in a linux system that can join files and are optimized or have parameters to do this very efficiently: join, cat, dd...
This is not the most efficient option, but, for example:
cat input/*.csv > output/combined.csv
If you want a high-performance Python version I recommend you to read and write the files in chunks instead of reading the files line by line.
Your biggest problem is the I/O and you can optimize this by reading and writing larger information blocks of the hard disk. If you read and write in the optimal size of your hard drive and your filesystem you will notice the difference.
For exmaple, a common block size for newer HDDs is 4096-byte (4 KiB).
You can try something like the following:
NEW_LINE = '\n'
def read_in_chunks(f, chunksize=4096):
while True:
chunk = f.read(chunksize)
if not chunk:
break
yield chunk
(...)
fout = open('output.csv', 'w')
for fname in files:
with open(fname) as fin:
buffer = ''
for chunk in read_in_chunks(fin):
buffer += chunk
lines, tmp_buffer = buffer.rsplit(NEW_LINE, 1)
lines += NEW_LINE # rsplit removes the last new-line char. I re-add it
fout.write(lines)
buffer = tmp_buffer
fout.close()

Related

Parallel computing for loop with no last function

I'm trying to parallelize the reading the content of 16 gzip files with script:
import gzip
import glob
from dask import delayed
from dask.distributed import Client, LocalCluster
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
reads = [read.decode("utf-8") for read in reads]
return reads
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
read_files = glob.glob("*.txt.gz")
all_files = []
for file in read_files:
reads = get_gzip_delayed(file)
all_files.extend(reads)
with open("all_reads.txt", "w") as f:
w = delayed(all_files.writelines)(f)
w.compute()
However, I get the following error:
> TypeError: Delayed objects of unspecified length are not iterable
How do I parallelize a for loop with extend/append and writing the function to a doc. All dask examples always include some final function performed on for loop product.
List all_files consists of delayed values, and calling delayed(f.writelines)(all_files) (note the different arguments relative to the code in the question) is not going to work for several reasons, the main is that you prepare lazy instructions for writing, but execute them only after closing the file.
There are different ways to solve this problem, at least two are:
if the data from the files fits into memory, then it's easiest to compute it and write to the file:
all_files = dask.compute(all_files)
with open("all_reads.txt", "w") as f:
f.writelines(all_files)
if the data cannot fit into memory, then another option is to put the writing inside the get_gzip_delayed function, so data doesn't travel between worker and client:
from dask.distributed import Lock
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
# create a lock to prevent others from writing at the same time
with Lock("all_reads.txt"):
with open("all_reads.txt", "a") as f: # need to be careful here, since files are appending
f.writelines([read.decode("utf-8") for read in reads])
Note that if memory is a severe constraint, then the above can also be refactored to process the files line-by-line (at the cost of slower IO).

Python - multiprocessing with large strings slower

I am working on a textual analysis of a large sample of 10Ks (about 150,000) and desperately trying to speed up my program with multiprocessing. The relevant function loads the txt files, parses them with some RegExp and saves them as "clean":
def plain_10k(f):
input_text = open(ipath + "\\" + f, errors = "ignore").read()
# REGEXP
output_file = open(opath + "\\" + f, "w", errors = "ignore")
output_file.write(input_text)
output_file.close()
I try to perform this function over a list of file names as follows:
with Pool(processes = 8) as pool, tqdm(total = len(files_10k)) as pbar:
for d in pool.imap_unordered(plain_10k, files_10k):
pbar.update()
Unfortunately, the program seems to be stuck as it is not returning (i.e. saving clean txt files) anything. Even with a small list of 10 files, nothing happens.
What is the problem here?
If it is relevant: the size of the input txt files ranges from 10kb to 10mb with the majority beeing smaller than 1mb.
I am quite new to Python, so the code above is the result of hours of googling and certainly not very good. I am happy about any comments and suggestions.
Thank you very much in advance!

How to downsample .json file

I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?
The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%
I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here

Read Large Gzip Files in Python

I am trying to read a gzip file (with size around 150 MB) and using this script (which I know is badly written):
import gzip
f_name = 'file.gz'
a = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
a.append(line.split(' '))
new_array1 = []
for l in a:
for i in l:
if i.startswith('/bin/movie/tribune'):
new_array1.append(l)
filtered = []
for q in range(0, len(new_array1)):
filtered.append(new_array1[q])
#at this point filtered array can be printed
The problem is that I am able to read files upto 50 MB using this technique into an array, but file sizes from 80 MB and above are not readable. Is there some problem with a technique that I am using or is there a memory constraint? If this is the second case, then what should be the best technique to read a large gz file (above 100 MB) in python array? Any help will be appreciated.
Note: I am not using NumPy because I ran into some serious issues with C compilers on my server which are required for numpy and therefore I am not able to have it. So, please suggest something that uses native Pythonic approach (or anything other than NumPy). Thanks.
My guess is that the problem is constructing a in your code, as that will undoubtedly contain a massive number of entries if your .gz is that large. This modification should solve that problem:
import gzip
f_name = 'file.gz'
filtered = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
for i in line.split(' '):
if i.startswith('/bin/movie/tribune'):
filtered.append(line)
break # to avoid duplicates
If your problem is the memory consumption (you didn't include the error message...), you can save up a lot of memory by avoiding storing the temporary lists, by using generators.
E.g.
import gzip
f_name = 'file.gz'
def get_lines(infile):
for line in infile:
yield line.split()
def filter1(line_tokens):
return any( token.startswith('/bin/movie/tribune') for token in line_tokens )
def filter2(line_tokens):
# was there a filter2?
return True
infile = gzip.open(f_name, 'r')
filtered = ( line_tokens for line_tokens in get_lines(infile) if filter1(line_tokens) and filter2(line_tokens) )
for line in filtered:
print line
In my example filter2 is trivial, because it seems your filtered list is just a (un-filtered) copy of new_array1...
This way, you avoid storing the entire content in memory. Note that since filtered is a generator, you can only iterate over it once. If you do need to store it entirely, do filtered = list(filtered)

Writing multiple sound files into a single file in python

I have three sound files for example a.wav, b.wav and c.wav . I want to write them into a single file for example all.xmv (extension could be different too) and when I need I want to extract one of them and I want to play it (for example I want to play a.wav and extract it form all.xmv).
How can I do it in python. I have heard that there is a function named blockwrite in Delphi and it does the thing that I want. Is there a function in python that is like blockwrite in Delphi or how can I write these files and play them?
Would standard tar/zip files work for you?
http://docs.python.org/library/zipfile.html
http://docs.python.org/library/tarfile.html
If the archive idea (which is btw, the best answer to your question) doesn't suit you, you can fuse the data from several files in one file, e.g. by writing consecutive blocks of binary data (thus creating an uncompressed archive!)
Let paths be a list of files that should be concatenated:
import io
import os
offsets = [] # the offsets that should be kept for later file navigation
last_offset = 0
fout = io.FileIO(out_path, 'w')
for path in paths:
f = io.FileIO(path) # stream IO
fout.write(f.read())
f.close()
last_offset += os.path.getsize(path)
offsets.append(last_offset)
fout.close()
# Pseudo: write the offsets to separate file e.g. by pickling
# ...
# reading the data, given that offsets[] list is available
file_ID = 10 # e.g. you need to read 10th file
f = io.FileIO(path)
f.seek(offsets[file_ID - 1]) # seek to required position
read_size = offsets[filed_ID] - offsets[file_ID - 1] # get the file size
data = f.read(read_size) # here we are!
f.close()

Categories

Resources