scrapy get data dict from json dictionary

scrapy get data dict from json dictionary - python

I am trying to get all of the data stored in this json
as a dictionary that I can load and access. I am still new to writing spiders, but I believe I need something like
response.xpath().extract()
and then json.load().split() to get an element from it.
But the exact syntax I am not sure of, since there are so many elements in this file.

You can use re_first() to extract JSON from JavaScript code and next loads() it using json module:
import json
d = response.xpath('//script[contains(., "windows.PAGE_MODEL")]/text()').re_first(r'(?s)windows.PAGE_MODEL = (.+?\});')
data = json.loads(d)
property_id = data['propertyData']['id']

You're right, it pretty much works like you suggested in your question.
You can check the script tags for 'windows.PAGE_MODEL' with a simple xpath query.
Please try the following code in the callback for your request:
d = response.xpath('//script[text()[contains(., "windows.PAGE_MODEL")]]/text()').get()
from json import loads
data = loads(d)

Related

Issue with handling json api response in Python

I am using the Censys api in python to programmatically look through host and grab information about them. Censys website says it returns Json formatted data and it looks like Json formatted data but, I cant seem to figure out how to tun the api response into a json object. However, if i write the json response to a json file and load it. It works fine Any ideas?
Update: Figured out issue is with nested json that the api returns. Looking for libraries to flatten it.
Main.py
c = censys.ipv4.CensysIPv4(api_id=UID, api_secret=SECRET)
for result in c.search("autonomous_system.asn:15169 AND tags.raw:iot", max_records=1):
hostIPS.append(result["ip"]);
for host in hostIPS:
for details in c.view(host):
# test = json.dumps(details)
# test = json.load(test)
# data = json.load(details)
data = json.loads(details)
print(data)

You don't need to convert it to an object, it's already json.loaded. See the implementation here: https://github.com/censys/censys-python/blob/master/censys/base.py

Use python and Accessing the request GITUB authenticated URL that response JSON file after that we have parse the JSON to CSV file format

I am new to python code. We are requesting the GitHub URL and Response is JSON. We have to parse the Json to filter out the labels that need to store in the CSV format as specific labels. We have authentication token that we use it for request of the URL. could you please provide the coding the above scenario

Your question is very general and since you didn't include any code, it seems like you are just looking for a straightforward answer.
I can't give you that, but the below code should get you started on converting a json into a python object, looking for a specific keyword label and writing it to a cvs file.
import json
x = json.loads(your_json_object)
for label in x:
with open('your_file.csv', 'w') as file:
for label in x:
file.write("{}, ".format(label))

How to extract some text from json file without loading it?

python lxml can be used to extract text (e.g., with xpath) from XML files without having to fully parse XML. For example, I can do the following which is faster than BeautifulSoup, especially for large input. I'd like to have some equivalent code for JSON.
from lxml import etree
tree = etree.XML('<foo><bar>abc</bar></foo>')
print type(tree)
r = tree.xpath('/foo/bar')
print [x.tag for x in r]
I see http://goessner.net/articles/JsonPath/. But I don't see an example python code to extract some text from a json file without having use json.load(). Could anybody show me an example? Thanks.

I'm assuming you don't want to load the entire JSON for performance reasons.
If that's the case, perhaps ijson is what you need. I used it to search huge JSON files (>8gb) and it works well.
However, you will have to implement the search code yourself.

Mongoexport exporting invalid json files

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.

Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.

Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref

A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!

Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!

I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy get data dict from json dictionary - python

You can use re_first() to extract JSON from JavaScript code and next loads() it using json module: import json d = response.xpath('//script[contains(., "windows.PAGE_MODEL")]/text()').re_first(r'(?s)windows.PAGE_MODEL = (.+?\});') data = json.loads(d) property_id = data['propertyData']['id']

Related

Issue with handling json api response in Python

Use python and Accessing the request GITUB authenticated URL that response JSON file after that we have parse the JSON to CSV file format

How to extract some text from json file without loading it?

Mongoexport exporting invalid json files

How do I load JSON into Couchbase Headless Server in Python?

Categories

Resources