Mongoexport exporting invalid json files - python

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.

Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.

Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref

A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

Related

python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.
I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.
A look at the head of the output from - https://replicate.npmjs.com/_all_docs
gives me following output:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.
If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.
Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()
Is there a reason why you aren't decoding the json before iterating over the lines?
Can you try this:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)

Is there a way to quickly get rid of a lot of excess data with regex searches?

I'm trying to pull a few pieces of data for data entry into a server. I've gotten the data from a web API, and they include a lot of information that to me, is garbage. I need to get rid of a ton of it, but I'm having issues with where to start. The data I need is up until "abilities", and then starts again at "name":"Contherious". And here's that link. Most of the data processing I've been doing has been trying to use regex searches to try to process this, and the only search I can think of is that between the names that I need versus the names that I don't need have a space and lead to ID directly after them. I'm just unclear as to how to grab each and every one of these names and any help would be appreciated.
I've tried
DMG_DONE_FILE = "rawDmgDoneData.txt"
out = []
with open(DMG_DONE_FILE, 'r') as f:
line = f.readline()
while line:
regex_id = search('^+"name":"\s"+(\w+)+"id":',line)
if regex_id:
out.append(regex_id.group(1))
line = f.readline()
and I get errors because I generally don't know what I'm doing with regex searches
import sys
import json
# use urllib to fetch from api
# example here for testing is reading from local file
f=open('file.json','r')
data=f.read()
f.close()
entries = json.loads(data)
Now you have a data structure that you can easily address
e.g. entries['entries'][0]['name']
alternatively using jq https://stedolan.github.io/jq/
cat file.json |jq '.entries[]| {name:.name,id:.id,type:.type,itemLevel:.itemLevel,icon:.icon,total:.total,activeTime:.activeTime,activeTimeReduced:.activeTimeReduced}'

Nltk json data loading error

I'm trying to load a json data file in order to analyze it using the nltk framework but get an AttributeError: 'list' object has no attribute 'keys'. I have tried deleting the "json" part at the end as the documentation states that data type is autodetected by the extension of the file. Also tried deleting the database at the beginning to no avail. Any ideas where might I be stumbling?
import json
import nltk
database = nltk.data.load("data.json", "json")
After hours of research, it turns out NLTK does not accept json files if the highest order is a list rather than a dict. In order to access the data, the upper most structure must be a dictionary structure with keys.
jsonfile = open('data.json')
jsonstr = jsonfile.read()
jdata = json.loads(jsonstr)[0]
This allows one to access the first element of the list which includes a dictionary inside, similar to every other element of the list. One solution is to seperate the elements of the list and load the dicts one at a time. I also suspect that while encoding the json, sort_keys = True might make the upper most structure a dictionary.

How can I update a specific value on a custom configuration file?

Assuming I have a configuration txt file with this content:
{"Mode":"Classic","Encoding":"UTF-8","Colors":3,"Blue":80,"Red":90,"Green":160,"Shortcuts":[],"protocol":"2.1"}
How can i change a specific value like "Red":90 to "Red":110 in the file without changing its original format?
I have tried with configparser and configobj but as they are designed for .INI files I couldn't figure out how to make it work with this custom config file. I also tried splitting the lines searching for the keywords witch values I wanted to change but couldn't save the file the same way it was before. Any ideas how to solve this? (I'm very new in Python)
this looks like json so you could:
import json
obj = json.load(open("/path/to/jsonfile","r"))
obj["Blue"] = 10
json.dump(obj,open("/path/to/mynewfile","w"))
but be aware that a json dict does not have an order.
So the order of the elements is not guaranteed (and normally it's not needed) json lists have an order though.
Here's how you can do it:
import json
d = {} # store your data here
with open('config.txt','r') as f:
d = json.loads(f.readline())
d['Red']=14
d['Green']=15
d['Blue']=20
result = "{\"Mode\":\"%s\",\"Encoding\":\"%s\",\"Colors\":%s,\
\"Blue\":%s,\"Red\":%s,\"Green\":%s,\"Shortcuts\":%s,\
\"protocol\":\"%s\"}"%(d['Mode'],d['Encoding'],d['Colors'],
d['Blue'],d['Red'],d['Green'],
d['Shortcuts'],d['protocol'])
with open('config.txt','w') as f:
f.write(result)
f.close()
print result

Importing JSON in Python and Removing Header

I'm trying to write a simple JSON to CSV converter in Python for Kiva. The JSON file I am working with looks like this:
{"header":{"total":412045,"page":1,"date":"2012-04-11T06:16:43Z","page_size":500},"loans":[{"id":84,"name":"Justine","description":{"languages":["en"], REST OF DATA
The problem is, when I use json.load, I only get the strings "header" and "loans" in data, but not the actual information such as id, name, description, etc. How can I skip over everything until the [? I have a lot of files to process, so I can't manually delete the beginning in each one. My current code is:
import csv
import json
fp = csv.writer(open("test.csv","wb+"))
f = open("loans/1.json")
data = json.load(f)
f.close()
for item in data:
fp.writerow([item["name"]] + [item["posted_date"]] + OTHER STUFF)
Instead of
for item in data:
use
for item in data['loans']:
The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.
data is a dictionary, so for item in data iterates the keys.
You probably want for loan in data['loans']:

Categories

Resources