I am making a program that consists of scraping data from a job page, and I get to this data
{"job":{"ciphertext":"~01142b81f148312a7c","rid":225177647,"uid":"1416152499115024384","type":2,"access":4,"title":"Need app developers to handle our app upgrades","status":1,"category":{"name":"Mobile Development","urlSlug":"mobile-development"
,"contractorTier":2,"description":"We have an app currently built, we are looking for someone to \n\n1) Manage the app for bugs etc \n2) Provide feature upgrades \n3) Overall Management and optimization \n\nPlease get in touch and i will share more details. ","questions":null,"qualifications":{"type":0,"location":null,"minOdeskHours":0,"groupRecno":0,"shouldHavePortfolio":false,"tests":null,"minHoursWeek":40,"group":null,"prefEnglishSkill":0,"minJobSuccessScore":0,"risingTalent":true,"locationCheckRequired":false,"countries":null,"regions":null,"states":null,"timezones":null,"localMarket":false,"onSiteType":null,"locations":null,"localDescription":null,"localFlexibilityDescription":null,"earnings":null,"languages":null
],"clientActivity":{"lastBuyerActivity":null,"totalApplicants":0,"totalHired":0,"totalInvitedToInterview":0,"unansweredInvites":0,"invitationsSent":0
,"buyer":{"isPaymentMethodVerified":false,"location":{"offsetFromUtcMillis":14400000,"countryTimezone":"United Arab Emirates (UTC+04:00)","city":"Dubai","country":"United Arab Emirates"
,"stats":{"totalAssignments":31,"activeAssignmentsCount":3,"feedbackCount":27,"score":4.9258937139,"totalJobsWithHires":30,"hoursCount":7.16666667,"totalCharges":{"currencyCode":"USD","amount":19695.83
,"jobs":{"postedCount":59,"openCount":2
,"avgHourlyJobsRate":{"amount":19.999534874418824
But the problem is that the only data I need is:
-Title
-Description
-Customer activity (lastBuyerActivity, totalApplicants, totalHired, totalInvitedToInterview, unansweredInvites, invitationsSent)
-Buyer (isPaymentMethodVerified, location (Country))
-stats (All items)
-jobs (all items)
-avgHourlyJobsRate
These sort of data are JSON type data. Python understands these sort of data through dictionary data type.
Suppose you have your data stored in a string. You can use di = exec(myData) to convert the string to dictionary. Then you can access the structured data like: di["job"] which return's the job section of the data.
di = exec(myData)
print(`di["job"]`)
However this is just a hack and it is not recommended because it's a
bit messy and unpythonic.
The appropriate way is to use JSON library to convert the data to dictionary. Take a look at the code snippet below to get an idea of what is the appropriate way:
import json
myData = "Put your data Here"
res = json.loads(myData)
print(res["jobs"])
convert the data to dictionary using json.loads
then you can easily use the dictionary keys that your want to lookup or filter the data.
This seems to be a dictionary so you can extract something from it by doing: dictionary["job"]["uid"] for example. If it is a Json file convert the data to a Python dictionary
Related
I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.
I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.
I want to filter the json that operatinSystem are linux ,and I have some problem with it,the part of json in
'' : {
that I don't know how dictionary represent it and
"DQ578CGN99KG6ECF" : {
how can I represent it with wildcard, anyone could help my please.
import json
import urllib2
response=urllib2.urlopen('https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/index.json')
url=response.read()
urlj=json.loads(url)
filterx=[x for x in urlj if x['??']['??']["attributes"]["operatingSystem"] == 'linux']
I'm not sure about the wildcard representation. I'll look into it and get back to you. Meanwhile, I have already worked with this json before so I can tell you how to access the information you need.
The information you need can be obtained as follows:
for each_product in urlx['products']:
if urlx['products'][each_product]['attributes']['operatingSystem']=="linux":
#your code here
If you need pricing information from the json you need to take the product code string and look into the priceDimensions field for it. Look at the sample json and code accordingly.
https://aws.amazon.com/blogs/aws/new-aws-price-list-api/
I'm using the Google Visualization Library for python (gviz) to generate chart objects. This works great for generating JSON that can be read by the Google Charts using the DataTable.ToJSon method. What I'm trying to do now, however, is add multiple Google Chart data tables to one JSON dictionary. In other words, what I'm making now is this:
Chart_Data = JSON_Chart_Data_1
and what I want to make is this:
Chart_Data = {'Chart_1' : JSON_Chart_Data_1,
'Chart_2' : JSON_Chart_Data_2,}
Where Chart_Data is converted into a JSON string in both cases.
I'm pretty sure I can do this by converting the JSON string from gviz back into a python dictionary, compile the strings in a container dictionary as necessary, and then convert that container dictionary back into JSON, but that doesn't seem like a very elegant way to do it. Is there a better way? What I'm picturing is a .ToPythonObject method equivalent to .ToJSon, but there doesn't appear to be one in the library.
Thanks a lot,
Alex
I ended up doing my original, inelegant, solution to the problem with this function:
def CombineJson(JSON_List):
#Jsonlist should be a list of tuples, with the dictionary key and the json file to go with that.
#eg: [('json1', 'some json string'), ('json2', 'some other json string')]
Python_Dict = {}
for tup in JSON_List:
parsed_json = json.loads(tup[1])
Python_Dict[tup[0]] = parsed_json
BigJson = json.dumps(Python_Dict)
return BigJson
Thanks guys,
Alex
I am trying to gather weather data from the national weather service and read it into a python script. They offer a JSON return, but they also offer another return which isn't formatted JSON but has more variables (which I want). This set of data looks like it is formatted as a python dictionary. It looks like this:
stations={
KAPC:
{
'id':'KAPC',
'stnid':'92',
'name':'Napa, Napa County Airport',
'elev':'33',
'latitude':'38.20750',
'longitude':'-122.27944',
'distance':'',
'provider':'NWS/FAA',
'link':'http://www.wrh.noaa.gov/mesowest/getobext.php?sid=KAPC',
'Date':'24 Feb 8:54 am',
'Temp':'39',
'TempC':'4',
'Dewp':'29',
'Relh':'67',
'Wind':'NE#6',
'Direction':'50°',
'Winds':'6',
'WindChill':'35',
'Windd':'50',
'SLP':'1027.1',
'Altimeter':'30.36',
'Weather':'',
'Visibility':'10.00',
'Wx':'',
'Clouds':'CLR',
[...]
So, to me, it looks like its got a defined variable stations equal to a dictionary of dictionaries containing the stations and their variables. My question is how do I access this data. Right now I am trying:
import urllib
response = urrllib.urlopen(url)
r = response.read()
If I try to use the JSON module, it clearly fails because this isn't json. And if I just try to read the file, it comes back with a long string of characters. Any suggestions on how to extract this data? If possible, I would just like to get the dictionary as it exists in the url return, ie stations={...} Thanks!
See, As far I infer from the question, I assume that you have data in the form of text which in not a valid JSON data, So given we have a text like: line = "stations={'KAPC':{'id':'KAPC', 'stnid':'92', 'name':'Napa, Napa County Airport'}}" (say), then we can extract the dictionary by splitting it at the = symbol and then use the eval() method which initializes the dictionary variable with the required data.
dictionary_text = line.split("=")[1]
python_dictionary = eval(dictionary_text)
print python_dictionary
>>> {'KAPC': {'id': 'KAPC', 'name': 'Napa, Napa County Airport', 'stnid': '92'}}
The python_dictionary now behaves like a Python Dictionary with key, value pairs , and you can access any attribute using python_dictionary["KAPC"]["id"]