python: getting npm package data from a couchdb endpoint - python

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.
I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.
A look at the head of the output from - https://replicate.npmjs.com/_all_docs
gives me following output:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.

If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.
Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()

Is there a reason why you aren't decoding the json before iterating over the lines?
Can you try this:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)

Related

How do I process API return without writing to file in Python?

I am trying to slice the results of an API response to process just the first n values in Python without writing to a file first.
Specifically I want to do analysis on the "front page" from HN, which is just the first 30 items. However the API (https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty) gives you the first 500 results.
Right now I'm pulling top stories and writing to a file, then importing the file and truncating the string:
import json
import requests
#request JSON data from HN API
topstories = requests.get('https://hacker-news.firebaseio.com/v0/topstories.json')
#write return to .txt file named topstories.txt
with open('topstories.txt','w') as fd:
fd.write(topstories.text)
#truncate the text file to top 30 stories
f = open('topstories.txt','r+')
f.truncate(270)
f.close
This is inelegant and inefficient. I will have to do this again to extract each 8 digit object ID.
How do I process this API return data as much as possible in memory without writing to file?
Suggestion:
User jordanm suggested the code replacement:
fd.write(json.dumps(topstories.json()[:30]))
However that would just move the needle on when I would need to write/read versus doing anything else I want with it.
What you want is the io library
https://docs.python.org/3/library/io.html
Basically:
import io
f = io.StringIO(topstories.text)
f.truncate(270)

python trouble de-serializing avro in memory

Currently, I am using requests to grab an avro file from a database and storing the data in requests.text. the file is separated by the schema and data. How do I merge the schema and data in memory into readable/usable data.
Requests.text brings the data down in Unicode, and seperates it by schema first and data second. I have been able to use string manipulation to just grab the schema part of the Unicode and set that as a schema variable, however I am unsure how to handle the data section. I tried encodeing the data to utf-8 and passing it as raw_bytes in my code, with no luck,
#the request text is too large, so I am shortening it down
r.text = u'Obj\x01\x04\x14avro.codec\x08null\x16avro.schema\u02c6\xfa\x05{"namespace": "namespace", "type": "record", "fields" : [{"type": ["float", "null"], "default": " ", "name": "pvib_z_crest_factor"}],
#repeat for x amount of fields
"name": "Telemetry"}\x00\u201d \xe0B\x1a\u2030=\xc0\u01782\n.\u015e\x049\xaa\x12\xf6\u2030\x02\x00\u0131\u201a];\x02\x02\x02\x00\xed\r>;\x02\x02\x00\x01\x02\x00\x00\x02\x00\x00\x00\x00\x00\x02\x02\x00\x00\x00\x1aC\x00\x00\x00\x02C\x02\x00:\x00#2019-02-27 16:38:39.530263-05:00\x02\x02\x00\xaeGa=\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf8\x04\x02\x00\x00\x00\x00\x00\x00\x00\x02\x02\x02\x02\x00\xac\xc5\'7\x00\x00\x00\xe9B\x02\x00\x00\x00\x00\x00\x00\x0e-r#\x00\x00\x00\x00\x00\x02\x02\x00\xfa\xc0\xf5A\x00\x00\x00\xc0#\x00\x00\x00\x00\x02\x00\x02\xc9\xebB\x00\x00\x00\x00\x00\x00\xaa\ufffd\'\x02\x00\x02\xc9\xebB\x02\x02\x00\x00\x00\x00\x00\x02\x00\ufffd\xc2u=\x02\x00\xfc\x18\xd3>\x02\x02\x00\\\ufffdB>\x02\x02\x001\x08,=\x02\x00\x00\x02\x02\x00\x000oE\x00sh!A\x02\x00\x00\xc0uE\x02\x00\xf6(tA\x00\x00\x00\x00\x00\x00-\xb2\ufffd=\x02\x00\x1c \xd1B\x02\x02\x00#2019-02-27 16:38:39.529977-05:00\x02\x00\x080894\x00\u011f\xa7\xc6=\x00\x00\x02\x02\x02\x02\x02\x02\x00\x00\x00\xe0A\x02\x00\x00\x00\u011eA\x00\x00\x00\xb8A\x00\xc3\xf5\xc0#\x00\xd5x\xe9=\x02\x00\x00\x00q=VA\x02\x00\x00\x000B\x02\x00ZV\xfaE\x02\x02\x02\x02\x00\x00\x00!C\x02\x00\x00\x00#C\x00\x00\x00)C\x00\x00\x02\x00\x00\x00\u20ac?\x00\x00\x02\x02\x02\x02\x02\x00\xf8\x04\x02\x00\x00\x00\x00\x00\x02\x00\x00\x00\u20ac?\x00\x02W\x00ff6A\x00\x00\x00\x00\x00\x02\x00\xcc&\x10L\x00\x00\xf7\x7fG\x02\x02\x02\x00\x00\x00\x00\x00\x02\x02\x02\x00\x00\u20ac\xacC\x02\x02\x02\x00\x1c~%A\x00\x1c \xd1B\x00\x01\x02\x02\x02\x00\xfa\xc0\xf5A\x02\x02\x02\x02\x02\x00\x00\x000B\x00\x00\x00\x00\x00\x00\x00\x00?C\x00\xf4-\x1fE\x00\x00\x00\x00\x00\x00\x00\u0131\x7fG\x00\x00\u015f\x7fG\x00\x00\u0131\x7fG\x00\x00\x00\x0bC\x00#2019-05-31 13:00:25.931949+00:00\x00#2019-05-31 09:00:25.931967-04:00\x00\x00\x00\xe0A\x00h\xe8\u0178:\x00=\n%C\x00\x00\x00\x07C\x02\x00\x00\x00\xe0#\x00\x01\x02\x00\x00\x02\x02\x00\x00\u011e\u2020F\x02\x00\x00\u20acDE\x00\xcd\xcc\xcc=\x00#2019-02-27 16:38:39.529620-05:00\x02\x00\x00\x00\xc8B\x00\x00\x00\x06C\x02\x00\x01\x004\u20ac7:\x00\x00\x000B\x02\x02\x02\x02\x02\x02\x0033CA\x02\x00L7\t>\x02\x02\x00\xae\xc7\xa7B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x02\x02\x00\x00\x00pB\x00\x00\x00`B\x00\x00\x02\x00\x00\x00...
#continues on, too big to put the rest of (feel free to ask questions to see more)
I except the file in memory to be de-serialized into readable data, however I have been getting constant errors of list being out of range or cannot access branch index x.
Thank you for reading
EDIT(6/5/19):
I managed to download the avro file using azure storage explorer on another device. From here, I ran the following code:
import avro.schema
from avro.io import DatumReader, DatumWriter
from avro.datafile import DataFileReader, DataFileWriter
avro_file = DataFileReader(open("Destination/to/file.avro", "rb"), DatumReader())
avro_file = [x for x in avro_file]
for i in range(len(avro_file)):
print(len(data))
print(data[i])
(NOTE: the computer I ran this code on runs off of python 3.7, but theres no real syntax changes between the two python version)
This code runs smoothly and shows the data in the appropriate places.
However, Cannot simply pass the same data im recieving from the request as an argument to DataFileReader (Stating the obvious, but guessing it has something to do with calling "rb" when opening the file and the request.text being in unicode). Is their any way to modify that request.text to work so I can pass it as an argument inside DataFileReader (replacing open(file, "rb"))?
You want content, not text
I also think you'll want to try BytesIO, which should be able to be used like a file object
import io
import requests
r = requests.get("http://example.com/file.avro")
inmemoryfile = io.BytesIO(r.content)
reader = DataFileReader(inmemoryfile, DatumReader())
records = list(reader)
reader.close()
(code untested)

Read XML file from URL in Python

I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot
To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)

Mongoexport exporting invalid json files

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.
Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.
Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref
A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

Hitting url's stored in a variable in python

The below code grabs a column from a csv file I am passing it then appends the contents of that column to a url. I tried to print those url's as a test and it works just fine. I stored those url's in a variable and now I want to hit those URLs, responses are not important to me. I tried using urllib2 but that needs to be passed a url, not a variable. How do I hit the url's I have stored in SID variable?
#!/usr/bin/python
import csv
import sys
csvfile = list(csv.reader(open(sys.argv[1])))
for row in csvfile:
sid = "http://myurlhere.com?sid="+row[13]
print "%s" %sid
It doesn't matter whether you pass the paramater as a variable or a hardcoded string.
import csv
import sys
import urllib2
csvfile = list(csv.reader(open(sys.argv[1])))
for row in csvfile:
sid = "http://myurl.com?sid="+row[13]
urllib2.urlopen(sid)
By using urlopen(sid). You really should pay more attention to python's docs. Quoted from there:
The simplest way to use this module is to call the urlopen function,
which accepts a string containing a URL or a Request object (described
below). It opens the URL and returns the results as file-like
object.
As your sid variable is decidedly a string, I don't really see what the problem is...
EDIT: Hmm... I haven't logged in to StackOverflow for some times, but I thought it used to have some sort of notification if a new post was posted while we're typing. Or is my memory failing me?

Categories

Resources