How to downsample .json file - python

I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?

The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%

I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here

Related

Editing of txt files not saving when I concatenate them

I am fairly new to programming, so bear with me!
We have a task at school which we are made to clean up three text files ("Balance1", "Saving", and "Withdrawal") and append them together into a new file. These files are just names and sums of money listed downwards, but some of it is jumbled. This is my code for the first file Balance1:
with open('Balance1.txt', 'r+') as f:
f_contents = f.readlines()
# Then I start cleaning up the lines. Here I edit Anna's savings to an integer.
f_contents[8] = "Anna, 600000"
# Here I delete the blank lines and edit in the 50000 to Philip.
del f_contents[3]
del f_contents[3]
In the original text file Anna's savings is written like this: "Anna, six hundred thousand" and we have to make it look clean, so its rather "NAME, SUM (as integer). When I print this as a list it looks good, but after I have done this with all three files I try to append them together in a file called "Balance.txt" like this:
filenames = ["Balance1.txt", "Saving.txt", "Withdrawal.txt"]
with open("Balance.txt", "a") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
When I check the new text file "Balance" it has appended them together, but just as they were in the beginning and not with my edits. So it is not "cleaned up". Can anyone help me understand why this happens, and what I have to do so it appends the edited and clean versions?
In the first part, where you do the "editing" of Balance.txt` file, this is what happens:
You open the file in read mode
You load the data into memory
You edit the in memory data
And voila.
You never persisted the changes to any file on the disk. So when in the second part you read the content of all the files, you will read the data that was originally there.
So if you want to concatenate the edited data, you have 2 choices:
Pre-process the data by creating 3 final correct files (editing Balance1.txt and persisting it to another file, say Balance1_fixed.txt) and then in the second part, concatenate: ["Balance1_fixed.txt", "Saving.txt", "Withdrawal.txt"]. Total of 4 data file openings, more IO.
Use only the second loop you have, and correct the contents before writing it to the outfile. You can use readlines() like you did first, edit the specific line and then use writelines(). Total of 3 data file openings, less IO than previous option

Read Multiple files one by one and extract the content based on different variables for each file

I'm new to python i have set for Pcap files in a directory. i should read each file and extract the required data based on differnt variable for each file. i'm using pyshark for parsing pcap.
i have to take csv file column as an input for each filtering file
for example i have 4 files each in src- and dst- so first file i should be filtered by taking 10.272.726.227 only and for 2nd file 10.272.726.228 etc...
see below,
files = os.listdir('./Pcap')
csv_file=pd.read_csv('input.csv')
ip_src = csv_file.SRC_privateIp.tolist()
ip_dst = csv_file.DST_privateIp.tolist()
for file in files:
if file.startswith('src-'):
cap_src = pyshark.FileCapture(file, only_summaries = True)
for packet in cap_src:
line=str(packet)
formattedline = line.split(' ')
if formattedline[2] == ip_src and formattedline[3] == ip_dst:
print(formattedline)
if file.startswith('dst-'):
cap_src = pyshark.FileCapture(file, only_summaries = True)
for packet in cap_src:
line=str(packet)
formattedline = line.split(' ')
if formattedline[2] == ip_dst and formattedline[3] == ip_src :
print(formattedline)
I tried to open each file and do processing on each file separately but it is taking all files data in one string. I want each file to open one by one, do processing because each file has different conditions to filter out the necessary. The above code gives an error This event loop is already running. i dont know how to proceed further Can anybody help me out?
Thanks!
I do not understand what your question is, but I think you can use pandas for csv reading, subsetting and manipulations. It is the standard library for this kind of tasks.
Read csv with pandas or
Subset a dataframe with pandas

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?
If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')
You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

Read Large Gzip Files in Python

I am trying to read a gzip file (with size around 150 MB) and using this script (which I know is badly written):
import gzip
f_name = 'file.gz'
a = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
a.append(line.split(' '))
new_array1 = []
for l in a:
for i in l:
if i.startswith('/bin/movie/tribune'):
new_array1.append(l)
filtered = []
for q in range(0, len(new_array1)):
filtered.append(new_array1[q])
#at this point filtered array can be printed
The problem is that I am able to read files upto 50 MB using this technique into an array, but file sizes from 80 MB and above are not readable. Is there some problem with a technique that I am using or is there a memory constraint? If this is the second case, then what should be the best technique to read a large gz file (above 100 MB) in python array? Any help will be appreciated.
Note: I am not using NumPy because I ran into some serious issues with C compilers on my server which are required for numpy and therefore I am not able to have it. So, please suggest something that uses native Pythonic approach (or anything other than NumPy). Thanks.
My guess is that the problem is constructing a in your code, as that will undoubtedly contain a massive number of entries if your .gz is that large. This modification should solve that problem:
import gzip
f_name = 'file.gz'
filtered = []
with gzip.open(f_name, 'r') as infile:
for line in infile:
for i in line.split(' '):
if i.startswith('/bin/movie/tribune'):
filtered.append(line)
break # to avoid duplicates
If your problem is the memory consumption (you didn't include the error message...), you can save up a lot of memory by avoiding storing the temporary lists, by using generators.
E.g.
import gzip
f_name = 'file.gz'
def get_lines(infile):
for line in infile:
yield line.split()
def filter1(line_tokens):
return any( token.startswith('/bin/movie/tribune') for token in line_tokens )
def filter2(line_tokens):
# was there a filter2?
return True
infile = gzip.open(f_name, 'r')
filtered = ( line_tokens for line_tokens in get_lines(infile) if filter1(line_tokens) and filter2(line_tokens) )
for line in filtered:
print line
In my example filter2 is trivial, because it seems your filtered list is just a (un-filtered) copy of new_array1...
This way, you avoid storing the entire content in memory. Note that since filtered is a generator, you can only iterate over it once. If you do need to store it entirely, do filtered = list(filtered)

Loading 15GB file in Python

I have a 15GB text file containing 25000 lines.
I am creating a multi level dictionary in Python of the form :
dict1 = {'':int},
dict2 = {'':dict1}.
I have to use this entire dict2 multiple times (about 1000...in a for loop) in my program.
Can anyone please tell a good way to do that.
The same type of information is stored in the file
(count of different RGB values of 25000 images. 1 image per line)
eg : 1 line of the file would be like :
image1 : 255,255,255-70 ; 234,221,231-40 ; 112,13,19-28 ;
image2 : 5,25,25-30 ; 34,15,61-20 ; 102,103,109-228 ;
and so on.
The best way to do this is to use chunking.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
As a note as you start to process large files moving to a map-reduce idiom may help as you'll be able to work on separate chunked files independently without pulling the complete data set into memory.
In python, if you use a file object as an iterator, you can read a file line by line without opening the whole thing in memory.
for line in open("huge_file.txt"):
do_something_with(line)

Categories

Resources