Mongoimport json data and then big data

Mongoimport json data and then big data - python

I'm stuck in doing a very simple import operation in MongoDB. I have a file, 200MB in size, JSON format. Its a feeds dump, format as: {"some-headers":"", "dump":[{"item-id":"item-1"},{"item-id":"item-2"},...]}
This json feed contains words in languages other than english too, like Chinese, Japanese, Characters, etc.
I tried to do a mongoimport as mongoimport --db testdb --collection testcollection --file dump.json but possibly, because the data is a bit complex, its treating dump as a column, resulting in error, due to 4MB column value limit.
I further tried and a python script:
import simplejson
import pymongo
conn = pymongo.Connection("localhost",27017)
db = conn.testdb
c = db.testcollection
o = open("dump.json")
s = simplejson.load(o)
for x in s['dump']:
c.insert(x)
o.close()
Python is killed while running this thing, possibly due to the very limited resources I'm trying to work with.
I reduced the filesize, by getting a new json dump at 50MB, now due to ASCII issues, python is troubling me again.
I am looking for options both way using mongoimport and with above python script. Any further solutions shall also be greatly appreciated.
Also, I might some day reach the json dump ~GBs, so if there will be some other solution I should consider then, pl do highlight.

Related

Read a large JSON file with Python Dask raises a delimiter error

I try to read a JSON file of size ~12GB. I downloaded from the AMiner Citation Network dataset V12 from here. This is a sample, where I removed a couple of fields that makes the JSON too long.
[{"id":1091,"authors":[{"name":"Makoto Satoh","org":"Shinshu University","id":2312688602},{"name":"Ryo Muramatsu","org":"Shinshu University","id":2482909946},{"name":"Mizue Kayama","org":"Shinshu University","id":2128134587},{"name":"Kazunori Itoh","org":"Shinshu University","id":2101782692},{"name":"Masami Hashimoto","org":"Shinshu University","id":2114054191},{"name":"Makoto Otani","org":"Shinshu University","id":1989208940},{"name":"Michio Shimizu","org":"Nagano Prefectural College","id":2134989941},{"name":"Masahiko Sugimoto","org":"Takushoku University, Hokkaido Junior College","id":2307479915}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2013,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"fos":[{"name":"Telecommunications network","w":0.45139},{"name":"Computer science","w":0.45245},{"name":"Mind map","w":0.5347},{"name":"Humanâcomputer interaction","w":0.47011},{"name":"Multimedia","w":0.46629},{"name":"Empirical research","w":0.49737},{"name":"Comprehension","w":0.47042},{"name":"Communications protocol","w":0.51907}],"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}
,{"id":1388,"authors":[{"name":"Pranava K. Jha","id":2718958994}],"title":"Further Results on Independence in Direct-Product Graphs.","year":2000,"n_citation":1,"page_start":"","page_end":"","doc_type":"Journal","publisher":"","volume":"56","issue":"","doi":"","fos":[{"name":"Graph","w":0.0},{"name":"Discrete mathematics","w":0.45872},{"name":"Combinatorics","w":0.4515},{"name":"Direct product","w":0.59104},{"name":"Mathematics","w":0.42784}],"venue":{"raw":"Ars Combinatoria","id":73158690,"type":"J"}}
,{"id":1674,"authors":[{"name":"G. Beale","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2103626414},{"name":"G. Earl","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2117665592}],"title":"A methodology for the physically accurate visualisation of roman polychrome statuary","year":2011,"n_citation":1,"page_start":"137","page_end":"144","doc_type":"Conference","publisher":"Eurographics Association","volume":"","issue":"","doi":"10.2312/VAST/VAST11/137-144","references":[1535888970,1992876689,1993710814,2035653341,2043970887,2087778487,2094478046,2100468662,2110331535,2120448006,2138624212,2149970020,2150266006,2296384428,2403283736],"fos":[{"name":"Statue","w":0.40216},{"name":"Engineering drawing","w":0.43427},{"name":"Virtual reconstruction","w":0.0},{"name":"Computer science","w":0.42062},{"name":"Visualization","w":0.4595},{"name":"Polychrome","w":0.4474},{"name":"Artificial intelligence","w":0.40496}],"venue":{"raw":"International Conference on Virtual Reality","id":2754954274,"type":"C"}}]
When I try to read the file with Python Dask (I cannot open it like any other file since it's too big and I get a memory limit error)
import dask.bag as db
if __name__ == '__main__':
b = db.read_text('dblp.v12.json').map(json.loads)
print(b.take(4))
I get the following error:
JSONDecodeError: Expecting ',' delimiter
I checked the sample above in an online validator and it passes. So, I guess it's not an error in JSON, but something on Dask and how I should read the file.

It seems like dask is passing each line of your file separately to json.loads. I've removed the line breaks from your sample and was able to load the data.
Of course, doing that for your entire file and sending a single JSON object to json.loads would defeat the purpose of using dask.
One possible solution (though I'm not sure how scalable it is for very large files) is to use jq to convert your JSON file into JSON lines -- by writing each element of your root JSON array into a single line in a file:
jq -c '.[]' your.json > your.jsonl
Then you can load it with dask:
import dask.bag as db
import json
json_file = "your.jsonl"
b = db.read_text(json_file).map(json.loads)
print(b.take(4))

python trouble de-serializing avro in memory

Currently, I am using requests to grab an avro file from a database and storing the data in requests.text. the file is separated by the schema and data. How do I merge the schema and data in memory into readable/usable data.
Requests.text brings the data down in Unicode, and seperates it by schema first and data second. I have been able to use string manipulation to just grab the schema part of the Unicode and set that as a schema variable, however I am unsure how to handle the data section. I tried encodeing the data to utf-8 and passing it as raw_bytes in my code, with no luck,
#the request text is too large, so I am shortening it down
r.text = u'Obj\x01\x04\x14avro.codec\x08null\x16avro.schema\u02c6\xfa\x05{"namespace": "namespace", "type": "record", "fields" : [{"type": ["float", "null"], "default": " ", "name": "pvib_z_crest_factor"}],
#repeat for x amount of fields
"name": "Telemetry"}\x00\u201d \xe0B\x1a\u2030=\xc0\u01782\n.\u015e\x049\xaa\x12\xf6\u2030\x02\x00\u0131\u201a];\x02\x02\x02\x00\xed\r>;\x02\x02\x00\x01\x02\x00\x00\x02\x00\x00\x00\x00\x00\x02\x02\x00\x00\x00\x1aC\x00\x00\x00\x02C\x02\x00:\x00#2019-02-27 16:38:39.530263-05:00\x02\x02\x00\xaeGa=\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf8\x04\x02\x00\x00\x00\x00\x00\x00\x00\x02\x02\x02\x02\x00\xac\xc5\'7\x00\x00\x00\xe9B\x02\x00\x00\x00\x00\x00\x00\x0e-r#\x00\x00\x00\x00\x00\x02\x02\x00\xfa\xc0\xf5A\x00\x00\x00\xc0#\x00\x00\x00\x00\x02\x00\x02\xc9\xebB\x00\x00\x00\x00\x00\x00\xaa\ufffd\'\x02\x00\x02\xc9\xebB\x02\x02\x00\x00\x00\x00\x00\x02\x00\ufffd\xc2u=\x02\x00\xfc\x18\xd3>\x02\x02\x00\\\ufffdB>\x02\x02\x001\x08,=\x02\x00\x00\x02\x02\x00\x000oE\x00sh!A\x02\x00\x00\xc0uE\x02\x00\xf6(tA\x00\x00\x00\x00\x00\x00-\xb2\ufffd=\x02\x00\x1c \xd1B\x02\x02\x00#2019-02-27 16:38:39.529977-05:00\x02\x00\x080894\x00\u011f\xa7\xc6=\x00\x00\x02\x02\x02\x02\x02\x02\x00\x00\x00\xe0A\x02\x00\x00\x00\u011eA\x00\x00\x00\xb8A\x00\xc3\xf5\xc0#\x00\xd5x\xe9=\x02\x00\x00\x00q=VA\x02\x00\x00\x000B\x02\x00ZV\xfaE\x02\x02\x02\x02\x00\x00\x00!C\x02\x00\x00\x00#C\x00\x00\x00)C\x00\x00\x02\x00\x00\x00\u20ac?\x00\x00\x02\x02\x02\x02\x02\x00\xf8\x04\x02\x00\x00\x00\x00\x00\x02\x00\x00\x00\u20ac?\x00\x02W\x00ff6A\x00\x00\x00\x00\x00\x02\x00\xcc&\x10L\x00\x00\xf7\x7fG\x02\x02\x02\x00\x00\x00\x00\x00\x02\x02\x02\x00\x00\u20ac\xacC\x02\x02\x02\x00\x1c~%A\x00\x1c \xd1B\x00\x01\x02\x02\x02\x00\xfa\xc0\xf5A\x02\x02\x02\x02\x02\x00\x00\x000B\x00\x00\x00\x00\x00\x00\x00\x00?C\x00\xf4-\x1fE\x00\x00\x00\x00\x00\x00\x00\u0131\x7fG\x00\x00\u015f\x7fG\x00\x00\u0131\x7fG\x00\x00\x00\x0bC\x00#2019-05-31 13:00:25.931949+00:00\x00#2019-05-31 09:00:25.931967-04:00\x00\x00\x00\xe0A\x00h\xe8\u0178:\x00=\n%C\x00\x00\x00\x07C\x02\x00\x00\x00\xe0#\x00\x01\x02\x00\x00\x02\x02\x00\x00\u011e\u2020F\x02\x00\x00\u20acDE\x00\xcd\xcc\xcc=\x00#2019-02-27 16:38:39.529620-05:00\x02\x00\x00\x00\xc8B\x00\x00\x00\x06C\x02\x00\x01\x004\u20ac7:\x00\x00\x000B\x02\x02\x02\x02\x02\x02\x0033CA\x02\x00L7\t>\x02\x02\x00\xae\xc7\xa7B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x02\x02\x00\x00\x00pB\x00\x00\x00`B\x00\x00\x02\x00\x00\x00...
#continues on, too big to put the rest of (feel free to ask questions to see more)
I except the file in memory to be de-serialized into readable data, however I have been getting constant errors of list being out of range or cannot access branch index x.
Thank you for reading
EDIT(6/5/19):
I managed to download the avro file using azure storage explorer on another device. From here, I ran the following code:
import avro.schema
from avro.io import DatumReader, DatumWriter
from avro.datafile import DataFileReader, DataFileWriter
avro_file = DataFileReader(open("Destination/to/file.avro", "rb"), DatumReader())
avro_file = [x for x in avro_file]
for i in range(len(avro_file)):
print(len(data))
print(data[i])
(NOTE: the computer I ran this code on runs off of python 3.7, but theres no real syntax changes between the two python version)
This code runs smoothly and shows the data in the appropriate places.
However, Cannot simply pass the same data im recieving from the request as an argument to DataFileReader (Stating the obvious, but guessing it has something to do with calling "rb" when opening the file and the request.text being in unicode). Is their any way to modify that request.text to work so I can pass it as an argument inside DataFileReader (replacing open(file, "rb"))?

You want content, not text
I also think you'll want to try BytesIO, which should be able to be used like a file object
import io
import requests
r = requests.get("http://example.com/file.avro")
inmemoryfile = io.BytesIO(r.content)
reader = DataFileReader(inmemoryfile, DatumReader())
records = list(reader)
reader.close()
(code untested)

How to handle big textual data to create WordCloud?

I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours.
The data is firstly stored in MongoDB. Due to Cursor issues while reading the data into a Python list, I have exported the whole data to a plain text file - simply a txt file which is 304 MB.
So the question that I am looking for the answer is how can I handle this huge textual data? The word_cloud library needs a String parameter that contains the whole data separated with ' ' in order to create the Word Cloud.
p.s. Python version: 3.7.1
p.s. word_cloud is an open source Word Cloud generator for Python which is available on GitHub: https://github.com/amueller/word_cloud

You don't need to load all the file in memory.
from wordcloud import WordCloud
from collections import Counter
wc = WordCloud()
counts_all = Counter()
with open('path/to/file.txt', 'r') as f:
for line in f: # Here you can also use the Cursor
counts_line = wc.process_text(line)
counts_all.update(counts_line)
wc.generate_from_frequencies(counts_all)
wc.to_file('/tmp/wc.png')

Mongoexport exporting invalid json files

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.

Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.

Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref

A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

Bulk insert into Vertica using Python using Uber's vertica-python package

Question 1 of 2
I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package. The problem is that whitespace-only data elements are being loaded into Vertica as NULLs; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.
For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).
1,abc,,^M
1,abc, ,^M
Here is the code:
# import vertica-python package by Uber
# source: https://github.com/uber/vertica-python
import vertica_python
# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, escapechar='\\', doublequote=False)
writer.writerows(data)
# define query
q = "copy <table_name> (<column_names>) from stdin "\
"delimiter ',' "\
"enclosed by '\"' "\
"record terminator E'\\r' "
# copy data
conn = vertica_python.connect( host=<host>,
port=<port>,
user=<user>,
password=<password>,
database=<database>,
charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
cur.copy(q, f)
conn.close()
Question 2 of 2
Are there any other issues (e.g. character encoding) I have to watch out for using this method of loading data into Vertica? Are there any other mistakes in the code? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example). Any recommendations to make this code more robust would be greatly appreciated.
In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed? The data volume is large and the insert job as is takes a couple of hours to run.
Thank you in advance for any help you can provide!

The copy statement you have should perform the way you want with regards to the spaces. I tested it using a very similar COPY.
Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:
To fix the whitespace, you can change your copy statement:
copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin
By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.
As for any gotchas... I do what you have on Solaris often. The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not. I've never had to do it switching between linux, windows and solaris.
Also, one hint, this will return a resultset that will tell you how many rows were loaded. Do a fetchone() and print it out and you'll see it.
The only other thing I can recommend might be to use reject tables in case any rows reject.
You mentioned that it is a large job. You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more. I'm not sure if None would disable the read timeout or not.
As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly. It's much faster and has a number of optimizations that you can do. You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.
It's kind of a long topic, though. If you have any specific questions let me know.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mongoimport json data and then big data - python

Related

Read a large JSON file with Python Dask raises a delimiter error

python trouble de-serializing avro in memory

How to handle big textual data to create WordCloud?

Mongoexport exporting invalid json files

Bulk insert into Vertica using Python using Uber's vertica-python package

Categories

Resources