Read Large Gzip Files in Python

Read Large Gzip Files in Python - python

I am trying to read a gzip file (with size around 150 MB) and using this script (which I know is badly written):
import gzip
f_name = 'file.gz'
a = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
a.append(line.split(' '))
new_array1 = []
for l in a:
for i in l:
if i.startswith('/bin/movie/tribune'):
new_array1.append(l)
filtered = []
for q in range(0, len(new_array1)):
filtered.append(new_array1[q])
#at this point filtered array can be printed
The problem is that I am able to read files upto 50 MB using this technique into an array, but file sizes from 80 MB and above are not readable. Is there some problem with a technique that I am using or is there a memory constraint? If this is the second case, then what should be the best technique to read a large gz file (above 100 MB) in python array? Any help will be appreciated.
Note: I am not using NumPy because I ran into some serious issues with C compilers on my server which are required for numpy and therefore I am not able to have it. So, please suggest something that uses native Pythonic approach (or anything other than NumPy). Thanks.

My guess is that the problem is constructing a in your code, as that will undoubtedly contain a massive number of entries if your .gz is that large. This modification should solve that problem:
import gzip
f_name = 'file.gz'
filtered = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
for i in line.split(' '):
if i.startswith('/bin/movie/tribune'):
filtered.append(line)
break # to avoid duplicates

If your problem is the memory consumption (you didn't include the error message...), you can save up a lot of memory by avoiding storing the temporary lists, by using generators.
E.g.
import gzip
f_name = 'file.gz'
def get_lines(infile):
for line in infile:
yield line.split()
def filter1(line_tokens):
return any( token.startswith('/bin/movie/tribune') for token in line_tokens )
def filter2(line_tokens):
# was there a filter2?
return True
infile = gzip.open(f_name, 'r')
filtered = ( line_tokens for line_tokens in get_lines(infile) if filter1(line_tokens) and filter2(line_tokens) )
for line in filtered:
print line
In my example filter2 is trivial, because it seems your filtered list is just a (un-filtered) copy of new_array1...
This way, you avoid storing the entire content in memory. Note that since filtered is a generator, you can only iterate over it once. If you do need to store it entirely, do filtered = list(filtered)

Related

Reading binary file. Translate matlab to python

I'm going to translate the working matlab code for reading the binary file to python code. Is there an equivalent for
% open the file for reading
fid=fopen (filename,'rb','ieee-le');
% first read the signature
tmp=fread(fid,2,'char');
% read sizes
rows=fread(fid,1,'ushort');
cols=fread(fid,1,'ushort');

there's the struct module to do that, specifically the unpack function which accepts a buffer, but you'll have to read the required size from the input using struct.calcsize
import struct
endian = "<" # little endian
with open(filename,'rb') as f:
tmp = struct.unpack(f"{endian}cc",f.read(struct.calcsize("cc")))
tmp_int = [int.from_bytes(x,byteorder="little") for x in tmp]
rows = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
cols = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
you might want to use the struct.Struct class for reading the rest of the data in chunks, as it is going to be faster than decoding numbers one at a time.
ie:
data = []
reader = struct.Struct(endian + "i"*cols) # i for integer
row_size = reader.size
for row_number in range(rows):
row = reader.unpack(f.read(row_size))
data.append(row)
Edit: corrected the answer, and added an example for larger chuncks.
Edit2: okay, more improvement, assuming we are reading 1 GB file of shorts, storing it as python int makes no sense and will most likely give an out of memory error (or system will freeze), the proper way to do it is using numpy
import numpy as np
data = np.fromfile(f,dtype=endian+'H').reshape(cols,rows) # ushorts
this way it'll have the same space in memory as it did on disk.

How to downsample .json file

I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?

The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%

I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here

What would make this code that combines some flat files run faster?

I'm new to Python and haven't gotten into any optimization work yet. I'm attempting to take a bunch of files that themselves are already pretty large and combine them into one large file that will probably wind up being close to 50-100GB would be my guess. More memory than I have at any rate. I was given the code below and it works great for small files. When I try to run it over the actual files for my use case, it will totally lock up my computer.
I understand that Pandas is fast. I'm guessing that data frames are stored in memory. If that is the case then that is probably what is wrecking stuff up here. Is there any kind or mechanism to spill to disk or possibly write to an existing file instead of trying to hold the whole thing in a dataframe before writing to disk? Or possibly another option that I didn't think of?
import pandas as pd
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
df = pd.concat((pd.read_csv(fn) for fn in csvfiles))
df.to_csv(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'), index=False)
for m in file_masks:
combine_files(m)

Here's a non pandas solution that doesn't load everything to memory. I haven't tested it but it should work.
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
with open(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'),'w') as fout:
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
for in_file in csvfiles:
with open(in_file,'r') as fin:
# f.next() # comment this out if you want to remove the headers
for line in fin:
fout.write(line)
for m in file_masks:
combine_files(m)

You don't need Python to do that. There are a lot of tools in a linux system that can join files and are optimized or have parameters to do this very efficiently: join, cat, dd...
This is not the most efficient option, but, for example:
cat input/*.csv > output/combined.csv
If you want a high-performance Python version I recommend you to read and write the files in chunks instead of reading the files line by line.
Your biggest problem is the I/O and you can optimize this by reading and writing larger information blocks of the hard disk. If you read and write in the optimal size of your hard drive and your filesystem you will notice the difference.
For exmaple, a common block size for newer HDDs is 4096-byte (4 KiB).
You can try something like the following:
NEW_LINE = '\n'
def read_in_chunks(f, chunksize=4096):
while True:
chunk = f.read(chunksize)
if not chunk:
break
yield chunk
(...)
fout = open('output.csv', 'w')
for fname in files:
with open(fname) as fin:
buffer = ''
for chunk in read_in_chunks(fin):
buffer += chunk
lines, tmp_buffer = buffer.rsplit(NEW_LINE, 1)
lines += NEW_LINE # rsplit removes the last new-line char. I re-add it
fout.write(lines)
buffer = tmp_buffer
fout.close()

Reading JSON from gigabytes of .txt files and add to the same list

I have 300 txt files (each between 80-100mb) that I should have to put to a list object and using the all content in the same time. I already created a working solution, but unfortunately it crashes due MemoryError when I load more than 3 txt-s. I'm not sure that it matters but I have a lot of ram so I could easily load 30GB to the memory if it can solve the problem.
Basically I would like to loop through the 300 txt files inside the same for loop. Is it possible to create a list object that holds 30GB of content? Or achieve it in any different way? I would really appreciate if somebody could explain me the ideal solution or any useful tips.
Here is how I tried, it produces the Memory Error after loading 3 txt.
def addContentToList(filenm):
with open(filenm, encoding="ISO-8859-1") as v:
jsonContentTxt.extend(json.load(v))
def createFilenameList(name):
for r in range(2,300):
file_str = "%s%s.txt" % (name, r,)
filenames.append(file_str)
filename1 = 'log_1.txt'
filename2 = 'log_'
filenames = []
jsonContentTxt = []
with open(filename, encoding="ISO-8859-1") as f:
jsonContentTxt = json.load(f)
createFilenameList(filename2)
for x in filenames:
addContentToList(x)
json_data = json.dumps(jsonContentTxt)
content_list = json.loads(json_data)
print (content_list)

Put down the chocolate-covered banana and step away from the European currency systems.
Text files are a really bad idea to store data like this. You should use a database. I recommend PostgreSQL and SQLite.
Apart from that, your error is probably due to using a 32-bit version of Python (which will cap your memory allocation to 2GB), use 64-bit instead. Even so I think you'd be better off by using a more proper tool for the job and not allocating 30GB of memory space.

Python- find the unique values from a large json file efficienctly

I've a json file data_large of size 150.1MB. The content inside the file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each item.
This is what I'm doing:-
import ijson # since json file is large, hence making use of ijson
f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.
Can I make print set(i['score'] for i in content) line more efficient. Currently it's taking 201secs to execute. Can it be made more efficient?

This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).
import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t
Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:
import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(obj.findall(data))

I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.
The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.
So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.

On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?
#!/usr/local/cpython-2.7/bin/python
from __future__ import print_function
import json # since json file is large, hence making use of ijson
with open('data_large', 'r') as file_:
content = json.load(file_)
print(set(element['score'] for element in content))

Try using a set
set([x['score'] for x in scores])
For example
>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read Large Gzip Files in Python - python

Related

Reading binary file. Translate matlab to python

How to downsample .json file

What would make this code that combines some flat files run faster?

Reading JSON from gigabytes of .txt files and add to the same list

Python- find the unique values from a large json file efficienctly

Categories

Resources