I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.
Does anyone know a quicker way to do this?
Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.
Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?
I would post sample data but I can't even get it to finish loading.
Code:
df_review = pd.read_json('dataset/review.json', lines=True)
Update:
Code:
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
5 reviews += line
6
----> 7 testdf = pd.read_json(reviews,lines=True)
/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
273 # commas and put it in a json list to make a valid json object.
274 lines = list(StringIO(json.strip()))
--> 275 json = u'[' + u','.join(lines) + u']'
276
277 obj = None
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)
Update 2:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.
import pandas as pd
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
pd.read_json(reviews,lines=True)
Speeding up that one line would be challenging because it's already super optimized.
I would first check if you can get less rows/data from the provider, as you mentioned.
If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.
Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.
I agree with #Nathan H 's proposition. But the precise point will probably lies in parallelization.
import pandas as pd
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
lines = f.readlines()
buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]
def f(buf):
return pd.read_json( buf,lines=True)
#### single-thread
df_lst = map( f, buf_lst)
#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()
However, I am not sure how to combine pandas dataframe yet.
Related
everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)
Previously, I had been cleaning out data using the code snippet below
import unicodedata, re, io
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
return cc_re.sub('', s)
cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
There are newline characters in the file that i want to keep.
The following records the time taken for cc_re.sub('', s) to substitute the first few lines (1st column is the time taken and 2nd column is len(s)):
0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208
As #ashwinichaudhary suggested, using s.translate(dict.fromkeys(control_chars)) and the same time taken log outputs:
0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209
But the code is really slow for my 1GB of text. Is there any other way to clean out controlled characters?
found a solution working character by charater, I bench marked it using a 100K file:
import unicodedata, re, io
from time import time
# This is to generate randomly a file to test the script
from string import lowercase
from random import random
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars
fnam = 'filename.txt'
out=io.open(fnam, 'w')
for line in range(1000000):
out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()
# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
return cc_re.sub('', s)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
return ''.join(c for c in s if not c in control_chars)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line = rm_control_chars_1(line)
cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
the output is:
114.625444174
0.0149750709534
I tried on a file of 1Gb (only for the second one) and it lasted 186s.
I also wrote this other version of the same script, slightly faster (176s), and more memory efficient (for very large files not fitting in RAM):
t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
out.write(rm_control_chars_1(line))
out.close()
print time() - t0
As in UTF-8, all control characters are coded in 1 byte (compatible with ASCII) and bellow 32, I suggest this fast piece of code:
#!/usr/bin/python
import sys
ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]
with open(filename, 'rb') as f1:
with open(filename + '.txt', 'wb') as f2:
b = f1.read(1)
while b != '':
if ord(b) not in ctrl_chars:
f2.write(b)
b = f1.read(1)
Is it ok enough?
Does this have to be in python? How about cleaning the file before you read it in python to start with. Use sed which will treat it line by line anyway.
See removing control characters using sed.
and if you pipe it out to another file you can open that. I don't know how fast it would be though. You can do it in a shell script and test it. according to this page - sed is 82M characters per second.
Hope it helps.
If you want it to move really fast? Break your input into multiple chunks, wrap up that data munging code as a method, and use Python's multiprocessing package to parallelize it, writing to some common text file. Going character-by-character is the easiest method to crunch stuff like this, but it always takes a while.
https://docs.python.org/3/library/multiprocessing.html
I'm surprised no one has mentioned mmap which might just be the right fit here.
Note: I'll put this in as an answer in case it's useful and apologize that I don't have the time to actually test and compare it right now.
You load the file into memory (kind of) and then you can actually run a re.sub() over the object. This helps eliminate the IO bottleneck and allows you to change the bytes in-place before writing it back at once.
After this, then, you can experiment with str.translate() vs re.sub() and also include any further optimisations like double buffering CPU and IO or using multiple CPU cores/threads.
But it'll look something like this;
import mmap
f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
A nice excerpt from the mmap documentation is;
..You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a',..
A couple of things I would try.
First, do the substitution with a replace all regex.
Second, setup a regex char class with known control char ranges instead
of a class of individual control char's.
(This is incase the engine doesn't optimize it to ranges.
A range requires two conditionals on the assembly level,
as opposed to individual conditional on each char in the class)
Third, since you are removing the characters, add a greedy quantifier
after the class. This will negate the necessity to enter into substitution
subroutines after each single char match, instead grabbing all adjacent chars
as needed.
I don't know pythons syntax for regex constructs off the top of my head,
nor all the control codes in Unicode, but the result would look something
like this:
[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+
The largest amount of time would be in copying the results to another string.
The smallest amount of time would be in finding all the control codes, which
would be miniscule.
All things being equal, the regex (as described above) is the fastest way to go.
I have a code in python to index a text file that contain arabic words. I tested the code on an english text and it works well ,but it gives me an error when i tested an arabic one.
Note: the text file is saved in unicode encoding not in ANSI encoding.
This is my code:
from whoosh import fields, index
import os.path
import csv
import codecs
from whoosh.qparser import QueryParser
# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]
schema = fields.Schema(juza=fields.NUMERIC,
chapter=fields.NUMERIC,
verse=fields.NUMERIC,
voc=fields.TEXT)
# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)
# Open a writer for the index
with ix.writer() as writer:
with open("h.txt", 'r') as txtfile:
lines=txtfile.readlines()
# Read each row in the file
for i in lines:
# Create a dictionary to hold the document values for this row
doc = {}
thisline=i.split()
u=0
# Read the values for the row enumerated like
# (0, "juza"), (1, "chapter"), etc.
for w in thisline:
# Get the field name from the "columns" list
fieldname = columns[u]
u+=1
#if isinstance(w, basestring):
# w = unicode(w)
doc[fieldname] = w
# Pass the dictionary to the add_document method
writer.add_document(**doc)
with ix.searcher() as searcher:
query = QueryParser("voc", ix.schema).parse(u"بسم")
results = searcher.search(query)
print(len(results))
print(results[1])
Then the error is :
Traceback (most recent call last):
File "C:\Python27\yarab.py", line 38, in <module>
fieldname = columns[u]
IndexError: list index out of range
this is a sample of the file:
1 1 1 كتاب
1 1 2 قرأ
1 1 3 لعب
1 1 4 كتاب
While I cannot see anything obviously wrong with that, I would make sure you're designing for error. Make sure you catch any situation where split() returns more than expected amount of elements and handle it promptly (e.g. print and terminate). It looks like you might be dealing with ill-formatted data.
You missed the header of Unicode in your script. the first line should be:
encoding: utf-8
Also to open a file with the unicode encoding use:
import codecs
with codecs.open("s.txt",encoding='utf-8') as txtfile:
I am trying to show correlation between two individual lists. Before installing Numpy, I parsed World Bank data for GDP values and the number of internet users and stored them in two separate lists. Here is the snippet of code. This is just for gdp07. I actually have more lists for more years and other data such as unemployment.
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(fields [0])
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(fields2 [0])
print np.correlate(gdp07,netnum07,"full")
The error I get is this:
Traceback (most recent call last):
File "Project3,py", line 83, in ,module.
print np.correlate(gdp07, netnum07, "full")
File "/usr/lib/python2.6/site-packages/numpy/core/numeric.py", line 645, in correlate
return multiarray.correlate2(a,v,mode))
ValueError: data type must provide an itemsize
Just for the record, I am using Cygwin with Python 2.6 on a Windows computer. I am only using Numpy along with its dependencies and other parts of its build (gcc compiler). Any help would be great. Thx
Perhaps that is the error when you try to input data as string, since according to python docs strip() return a string
http://docs.python.org/library/stdtypes.html
Try parsing the data to whatever type you want
As you can see here
In [14]:np.correlate(["3", "2","1"], [0, 1, 0.5])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/dog/<ipython-input-14-a0b588b9af44> in <module>()
----> 1 np.correlate(["3", "2","1"], [0, 1, 0.5])
/usr/lib64/python2.7/site-packages/numpy/core/numeric.pyc in correlate(a, v, mode, old_behavior)
643 return multiarray.correlate(a,v,mode)
644 else:
--> 645 return multiarray.correlate2(a,v,mode)
646
647 def convolve(a,v,mode='full'):
ValueError: data type must provide an itemsize
try parsing the values
In [15]: np.correlate([int("3"), int("2"),int("1")], [0, 1, 0.5])
Out[15]: array([ 2.5])
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(int(fields [0]))
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(int(fields2 [0]))
print np.correlate(gdp07,netnum07,"full")
your other error is a character ending problem
i hope this works, since I dont think I can reproduce it since I have a linux box that supports utf-8 by default.
I went by ipython help(codecs) documentation
http://code.google.com/edu/languages/google-python-class/dict-files.html
import codecs
f = codecs.open(file, "r", codecs.BOM_UTF8)
for line in f:
fields = line.strip().split()
gdp07.append(int(fields [0]))
Try to cast data to float type. it works for me!
Basically I have been having real fun with this today. I have this data file called test.csv which is encoded as UTF-8:
"Nguyễn", 0.500
"Trần", 0.250
"Lê", 0.250
Now I am attempting to read it with this code and it displays all funny like this: Trần
Now I have gone through all the Python docs for 2.6 which is the one I use and I can't get the wrapper to work along with all the ideas on the internet which I am assuming are all very correct just not being applied properly by yours truly. On the plus side I have learnt that not all fonts will display those characters correctly anyway something I hadn't even thought of previously and have learned a lot about Unicode etc so it certainly was not wasted time.
If anyone could point out where I went wrong I would be most grateful.
Here is the code updated as per request below that returns this error -
Traceback (most recent call last):
File "surname_generator.py", line 39, in
probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
File "surname_generator.py", line 27, in unicode_csv_reader
for row in csv_reader:
File "surname_generator.py", line 33, in utf_8_encoder
yield line.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
from random import random
import csv
class ChooseFamilyName(object):
def __init__(self, probs):
self._total_prob = 0.
self._familyname_levels = []
for familyname, prob in probs:
self._total_prob += prob
self._familyname_levels.append((self._total_prob, familyname))
return
def pickfamilyname(self):
pickfamilyname = self._total_prob * random()
for level, familyname in self._familyname_levels:
if level >= pickfamilyname:
return familyname
print "pickfamilyname error"
return
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
familynamelist = 'familyname_vietnam.csv'
a = 0
while a < 10:
a = a + 1
probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
familynamepicker = ChooseFamilyName(probfamilynames)
print(familynamepicker.pickfamilyname())
unicode_csv_reader(open(familynamelist)) is trying to pass non-unicode data (byte strings with utf-8 encoding) to a function you wrote expecting unicode data. You could solve the problem with codecs.open (from standard library module codecs), but that's to roundabout: the codecs would be doing utf8->unicode for you, then your code would be doing unicode->utf8, what's the point?
Instead, define a function more like this one...:
def encoded_csv_reader_to_unicode(encoded_csv_data,
coding='utf-8',
dialect=csv.excel,
**kwargs):
csv_reader = csv.reader(encoded_csv_data,
dialect=dialect,
**kwargs)
for row in csv_reader:
yield [unicode(cell, coding) for cell in row]
and use encoded_csv_reader_to_unicode(open(familynamelist)).
Your current problem is that you have been given a bum steer with the csv_unicode_reader thingy. As the name suggests, and as the documentation states explicitly:
"""(unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings). """
You don't have unicode strings, you have str strings encoded in UTF-8.
Suggestion: blow away the csv_unicode_reader stuff. Get each row plainly and simply as though it was encoded in ascii. Then convert each row to unicode:
unicode_row = [field.decode('utf8') for field in str_row]
Getting back to your original problem:
(1) To get help with fonts etc, you need to say what platform you are running on and what software you are using to display the unicode strings.
(2) If you want platform-independent ways of inspecting your data, look at the repr() built-in function, and the name function in the unicodedata module.
There's the unicode_csv_reader demo in the python docs:
http://docs.python.org/library/csv.html