Loading 15GB file in Python - python

I have a 15GB text file containing 25000 lines.
I am creating a multi level dictionary in Python of the form :
dict1 = {'':int},
dict2 = {'':dict1}.
I have to use this entire dict2 multiple times (about 1000...in a for loop) in my program.
Can anyone please tell a good way to do that.
The same type of information is stored in the file
(count of different RGB values of 25000 images. 1 image per line)
eg : 1 line of the file would be like :
image1 : 255,255,255-70 ; 234,221,231-40 ; 112,13,19-28 ;
image2 : 5,25,25-30 ; 34,15,61-20 ; 102,103,109-228 ;
and so on.

The best way to do this is to use chunking.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
As a note as you start to process large files moving to a map-reduce idiom may help as you'll be able to work on separate chunked files independently without pulling the complete data set into memory.

In python, if you use a file object as an iterator, you can read a file line by line without opening the whole thing in memory.
for line in open("huge_file.txt"):
do_something_with(line)

Related

How to downsample .json file

I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?
The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%
I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?
If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')
You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

Counting lines in Azure Data Lake

I have some files in Azure Data Lake and I need to count how many lines they have to make sure they are complete. What would be the best way to do it?
I am using Python:
from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')
file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
# 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
counter = 0
with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
# i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
# counter1 = sum(1 for line in f)
for line in f:
counter = counter + 1
print(counter)
This works, but it takes hours for files that are 1 or 2 gigabytes. Shouldn't this be faster? Might there be a better way?
Do you need to count lines? Maybe it is enough to get size of the file?
You have AzureDLFileSystem.stat to get the file size, If you know how long is an average line size you could calculate the expected line count.
You could try:
for file in adl.walk('path/to/folder'):
counter += len(adl.cat(file).decode().split('\n'))
I'm not sure if this is actually faster, but it uses the unix built ins to get file output which might be quicker than explicit I/O
EDIT: The one pitfall of this method is in the case that file sizes exceed the RAM of the device you run this on, as cat will throw the contents into memory explicitly
The only faster way i found, was to actually download the file locally to where the script is running with
adl.put(remote_file, locally)
and then count line by line with out putting all file into the memory, download 500mgs takes around 30secs and reading 1mill lines around 4 secs =)

Consume a single generator by multiple filtered generators

I have a file that contain records which look like this:
>uniqueid#BARCODE1
content content
content content
>uniqueid#BARCODE2
content content
content content
>uniqueid#BARCODE1
content content
content content
...
There are ~10 million records with ~300 unique barcodes, and the order of barcodes is random. My goal is to split the file into 300 files. I'd like to do this passing through the file only once with a generator that yields records.
I have solved this problem in the past by reading through the file many times, or by loading the entire file into memory. I'd like to solve the problem using a generator approach to keep memory usage down and to only read through the file once.
Is it possible to instantiate 300 generators and have some logic that dispatches the records to the correct generator? Then I could simply write out the contents of each generator to a file. Would I need to then have handles open to all 300 files I want to write at once?
Yes, it could be done with generators, but we don't need 300 of them, just 1 is enough.
As far as i understand 'uniqueid#BARCODE1' barcode consist of two parts:
'>uniqueid#' prefix,
'BARCODE1' barcode itself.
so let's start with writing simple checker
BAR_CODE_PREFIX = '>uniqueid#'
def is_bar_code(text):
return text.startswith(BAR_CODE_PREFIX)
then for parsing barcodes contents we can write generator
def parse_content(lines):
lines_iterator = iter(lines)
# we assume that the first line is a barcode
bar_code = next(lines_iterator)
contents = []
for line in lines_iterator:
if is_bar_code(line):
# next barcode is found
yield bar_code, contents
bar_code = line
contents = []
else:
contents.append(line)
# to yield last barcode with its contents
yield bar_code, contents
then assuming you want to name files after barcodes we can write
def split_src(src_path):
with open(src_path) as src_file:
for bar_code, content_lines in parse_content(src_file):
bar_code_index = bar_code[len(BAR_CODE_PREFIX):].rstrip('\n')
# we use `append` mode
# because in the different parts of file
# there are contents of the same barcode
with open(bar_code_index, mode='a') as dst_file:
dst_file.writelines(content_lines)
which walks exactly one time through whole file.
Test
Let's create src.txt file which consist of
>uniqueid#BARCODE1
content11
>uniqueid#BARCODE2
content21
>uniqueid#BARCODE1
content12
content12
>uniqueid#BARCODE3
content31
>uniqueid#BARCODE2
content22
then after calling
split_src('src.txt')
there will be created next files:
BARCODE1 with lines
content11
content12
content12
BARCODE2 with lines
content21
content22
BARCODE3 with lines
content31

fastest method to read big data files in python

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.
I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.

Categories

Resources