Retrieve specific data from JSON file in Python - python

Hello everyone and happy new year! I'm posting for the first time here since I can't find an answer for my problem, I've been searching for 2 days now and can't figure out what to do and it's driving me crazy... Here's my problem:
I'm using the Steamspy api to download data of the 100 most played games in the last 2 weeks on Steam, I'm able to dump the downloaded data into a .json file (dumped_data.json in this case), and to import it back into my Python code (I can print it). And while I can print the whole thing, it's not really useable as it's 1900 lines of more or less useful data, so I want to be able to print only specific items of the dumped data. In this project, I would like to only print the name, developer, owners and price of the first 10 games on the list. It's my first Python project, and I can't figure out how to do this...
Here's the Python code:
import steamspypi
import json
#gets data using the steamspy api
data_request = dict()
data_request["request"] = "top100in2weeks"
#dumps the data into json file
steam_data = steamspypi.download(data_request)
with open("dumped_data.json", "w") as jsonfile:
json.dump(steam_data, jsonfile, indent=4)
#print values from dumped file
f = open("dumped_data.json")
data= json.load(f)
print(data)
And here's an exemple of the first item in the json file:
{
"570": {
"appid": 570,
"name": "Dota 2",
"developer": "Valve",
"publisher": "Valve",
"score_rank": "",
"positive": 1378093,
"negative": 267290,
"userscore": 0,
"owners": "100,000,000 .. 200,000,000",
"average_forever": 35763,
"average_2weeks": 1831,
"median_forever": 1079,
"median_2weeks": 1020,
"price": "0",
"initialprice": "0",
"discount": "0",
"ccu": 553462
},
Thanks in advance to everyone that's willing to help me, it would mean a lot.

The following prints the values you desire from the first 10 games:
for game in list(data.values())[:10]:
print(game["name"], game["developer"], game["owners"], game["price"])

Related

Python parse JSON object and ask for selection of values

I am new to programming in general. I am trying to make a Python script that helps with making part numbers. Here, for an example, computer memory modules.
I have a python script that needs to read the bmt object from a JSON file. Then ask the user to select from it then append the value to a string.
{"common": {
"bmt": {
"DR1": "DDR",
"DR2": "DDR2",
"DR3": "DDR3",
"DR4": "DDR4",
"DR5": "DDR5",
"DR6": "DDR6",
"FER": "FeRAM",
"GD1": "GDDR",
"GD2": "GDDR2",
"GD3": "GDDR3",
"GD4": "GDDR4",
"GD5": "GDDR5",
"GX5": "GDDR5X",
"GD6": "GDDR6",
"GX6": "GDDR6X",
"LP1": "LPDDR",
"LP2": "LPDDR2",
"LP3": "LPDDR3",
"LP4": "LPDDR4",
"LX4": "LPDDR4X",
"LP5": "LPDDR5",
"MLP": "mDDR",
"OPX": "Intel Optane/Micron 3D XPoint",
"SRM": "SRAM",
"WRM": "WRAM"
},
"modtype": {
"CM": "Custom Module",
"DD": "DIMM",
"MD": "MicroDIMM",
"SD": "SODIMM",
"SM": "SIMM",
"SP": "SIPP",
"UD": "UniDIMM"
}
}}
Example: There is a string called "code" that already has the value "MMD". The script asks the user what to select from the listed values, (e.g. "DR1"). If a selection is made (user enters "DR1", it appends that value to the string, the new value would be "MMDDR1".
This code is to print the JSON. This is how far I have gotten
def enc(code):
memdjson = json.loads("memd.json")
print(memdjson)
How do I do this?
Repo with the rest of the code is here: https://github.com/CrazyblocksTechnologies/PyCIPN
Try:
import json
def enc(code):
memdjson = json.load(open("memd.json"))["common"]["bmt"]
selected = input("Select a value from the following: \n{}\n\n".format(' '.join(memdjson.keys())))
return code+memdjson[selected]
import json
import inquirer
with open(path_to_json_file) as f:
file = json.load(f)
code = "MMD"
questions = [
inquirer.List('input',
message="Select one of the following:",
choices=list(file["common"]["bmt"].keys()),
),
]
answer = inquirer.prompt(questions)
code += answer

How to parse tab-delimited text file with 4th column as json and remove certain keys?

I have a text file that is 26 Gb, The line format is as follow
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
I'm trying to get only the last columns which is a json and from that Json I'm only trying to save the "title", "isbn 13", "isbn 10"
I was able to save only the last column with this code
csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file, "w") as tmp_file:
reader = csv.reader(to_read, delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [4] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
but this doesn't return a proper json object instead returns everything with a double quotations next to it. Also how would I extract certain values from the json as a new json
EDIT:
"{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}"
EDIT 2:
so im trying to read this file which is a tab separated file with the following columns:
type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format
Im trying to read the JSON file and from that Json im only trying to get the "title", "isbn 13", "isbn 10" as a json and save it to the file as a row
so every row should look like the original but with only those key and values
Here's a straight-forward way of doing it. You would need to repeat this and extract the desired data from each line of the file as it's being read, line-by-line (the default way text file reading is handled in Python).
import json
line = '/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}'
csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data, indent=4))
desired = {key: json_data[key] for key in ("title", "isbn_13", "isbn_10")}
result = json.dumps(desired, indent=4)
print(result)
Output from sample line:
{
"title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
"isbn_13": [
"9780107805401"
],
"isbn_10": [
"0107805405"
]
}
So given that your current code returns the following:
result = '{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}'
Looks like what you need to do is: First - Replace those double-double-quotes with regular double quotes, otherwise things are not parsible:
res = result.replace('""','"')
Now res is convertible to a JSON object:
import json
my_json = json.loads(res)
my_json now looks like this:
{'authors': [{'key': '/authors/OL2645777A'}],
'identifiers': {'goodreads': ['6850240']},
'isbn_10': ['0107805405'],
'isbn_13': ['9780107805401'],
'key': '/books/OL10000135M',
'languages': [{'key': '/languages/eng'}],
'last_modified': {'type': '/type/datetime',
'value': '2010-04-24T17:54:01.503315'},
'latest_revision': 4,
'number_of_pages': 64,
'physical_format': 'Hardcover',
'publish_date': 'December 1993',
'publishers': ['Bernan Press'],
'revision': 4,
'subjects': ['Government - Comparative', 'Politics / Current Events'],
'subtitle': '9th November - 3rd December, 1992',
'title': 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93',
'type': {'key': '/type/edition'},
'works': [{'key': '/works/OL7925046W'}]}
You can conveniently get any field you want from this object:
my_json['title']
# 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93'
my_json['isbn_10'][0]
# '0107805405'
Especially because your example is so large, I'd recommend using a specialized library such as pandas, which has a read_csv method, or even dask, which supports out-of-memory operations.
Both of these systems will automatically parse out the quotations for you, and dask will do so in "pieces" direct from disk so you never have to try to load 26GB into RAM.
In both libraries, you can then access the columns you want like this:
data = read_csv(PATH)
data["ColumnName"]
You can then parse these rows either using json.loads() (import json) or you can use the pandas/dask json implementations. If you can give some more details of what you're expecting, I can help you draft a more specific code example.
Good luck!
I saved your data to a file to see if i could read just the rows, let me know if this works:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
for row in temp:
print(row['isbn_13'])
If that works this should create a json for you:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
new_json=[]
for row in temp:
new_json.append({'title': row['title'], 'isbn_13': row['isbn_13'], 'isbn_10': row['isbn_10']})

Python code breaks when attemting to download larger zipped csv file, works fine on smaller file

while working with small zipfiles(about 8MB) containg 25MB of CSV files the below code works exactly as it should. As soon as I attempt to download larger files (45MB zip file containing a 180MB csv) the code breaks and I get the following error message:
(venv) ufulu#ufulu awr % python get_awr_ranking_data.py
https://api.awrcloud.com/v2/get.php?action=get_topsites&token=REDACTED&project=REDACTED Client+%5Bw%5D&fileName=2017-01-04-2019-10-09
Traceback (most recent call last):
File "get_awr_ranking_data.py", line 101, in <module>
getRankingData(project['name'])
File "get_awr_ranking_data.py", line 67, in getRankingData
processRankingdata(rankDateData['details'])
File "get_awr_ranking_data.py", line 79, in processRankingdata
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
AttributeError: 'float' object has no attribute 'split'
My goal is to download data for 170 projects and save the data to sqlite DB.
Please bear with me me as I am a novice in the field of programming and python. I would greatly appreciate any help to fixing the code below as well as any other sugestions and improvements to making the code more robust and pythonic.
Thanks in advance
from dotenv import dotenv_values
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from sqlalchemy import create_engine
# SQL Alchemy setup
engine = create_engine('sqlite:///rankingdata.sqlite', echo=False)
# Excerpt from the initial API Call
data = {'projects': [{'name': 'Client1',
'id': '168',
'frequency': 'daily',
'depth': '5',
'kwcount': '80',
'last_updated': '2019-10-01',
'keywordstamp': 1569941983},
{
"depth": "5",
"frequency": "ondemand",
"id": "194",
"kwcount": "10",
"last_updated": "2019-09-30",
"name": "Client2",
"timestamp": 1570610327
},
{
"depth": "5",
"frequency": "ondemand",
"id": "196",
"kwcount": "100",
"last_updated": "2019-09-30",
"name": "Client3",
"timestamp": 1570610331
}
]}
#setup
api_url = 'https://api.awrcloud.com/v2/get.php?action='
urls = [] # processed URLs
urlbacklog = [] # URLs that didn't return a downloadable File
# API Call to recieve URL containing downloadable zip and csv
def getRankingData(project):
action = 'get_dates'
response = requests.get(''.join([api_url, action]),
params=dict(token=dotenv_values()['AWR_API'],
project=project))
response = response.json()
action2 = 'topsites_export'
rankDateData = requests.get(''.join([api_url, action2]),
params=dict(token=dotenv_values()['AWR_API'],
project=project, startDate=response['details']['dates'][0]['date'], stopDate=response['details']['dates'][-1]['date'] ))
rankDateData = rankDateData.json()
print(rankDateData['details'])
urls.append(rankDateData['details'])
processRankingdata(rankDateData['details'])
# API Call to download and unzip csv data and process it in pandas
def processRankingdata(url):
content = requests.get(url)
# {"response_code":25,"message":"Export in progress. Please come back later"}
if "response_code" not in content:
f = ZipFile(BytesIO(content.content))
#print(f.namelist()) to get all filenames in Zip
with f.open(f.namelist()[0], 'r') as g: rankingdatadf = pd.read_csv(g)
rankingdatadf = rankingdatadf[rankingdatadf['Search Engine'].str.contains("Google")]
domain = []
for row in rankingdatadf['URL']:
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
rankingdatadf['Domain'] = domain
rankingdatadf['Domain'] = rankingdatadf['Domain'].str.replace('www.', '')
rankingdatadf = rankingdatadf.drop(columns=['Title', 'Meta description', 'Snippet', 'Page'])
print(rankingdatadf['Search Engine'][0])
writeData(rankingdatadf)
else:
urlbacklog.append(url)
pass
# Finally write the data to database
def writeData(rankingdatadf):
table_name_from_file = project['name']
check = engine.has_table(table_name_from_file)
print(check) # boolean
if check == False:
rankingdatadf.to_sql(table_name_from_file, con=engine)
print(project['name'] + ' ...Done')
else:
print(project['name'] + ' ... already in DB')
for project in data['projects']:
getRankingData(project['name'])
The problem seems to be the split call on a float and not necessarily the download. Try changing line 79
from
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
to
domain.append(str(str(str(row).split("//")[-1]).split("/")[0]).split('?')[0])
It looks like you're trying to parse the network location portion of the URL here, you can also use urllib.parse to make this easier instead of chaining all the splits:
from urllib.parse import urlparse
...
for row in rankingdatadf['URL']:
domain.append(urlparse(row).netloc)
I think a malformed URL is causing you issues, try (to diagnose issue):
try :
for row in rankingdatadf['URL']:
try:
domain.append(urlparse(row).netloc)
catch Exception:
exit(row)
Looks like you figured it out above, you have a database entry with a NULL value for the URL field. Not sure what your fidelity requirements for this data set are but might want to enforce database rules for URL field, or use pandas to drop rows where URL is NaN.
rankingdatadf = rankingdatadf.dropna(subset=['URL'])

How to write line by line request output

I am trying to write line by line the JSON output from my Python request. I already checked some similar issue on StackOverflow in the question: write to file line by line python, without success.
Here is the code:
myfile = open ("data.txt", "a")
for item in pretty_json["geonames"]:
print (item["geonameId"],item["name"])
myfile.write ("%s\n" % item["geonameId"] + "https://www.geonames.org/" + item["name"])
myfile.close()
Here the output from my pretty_json["geonames"]
{
"adminCode1": "FR",
"lng": "7.2612",
"geonameId": 2661847,
"toponymName": "Aeschlenberg",
"countryId": "2658434",
"fcl": "P",
"population": 0,
"countryCode": "CH",
"name": "Aeschlenberg",
"fclName": "city, village,...",
"adminCodes1": {
"ISO3166_2": "FR"
},
"countryName": "Switzerland",
"fcodeName": "populated place",
"adminName1": "Fribourg",
"lat": "46.78663",
"fcode": "PPL"
}
Then, as output saved on my data.txt, I'm having :
11048419
https://www.geonames.org/Aïre2661847
https://www.geonames.org/Aeschlenberg2661880
https://www.geonames.org/Aarberg6295535
The expected result should be something like:
Aïre , https://www.geonames.org/11048419
Aeschlenberg , https://www.geonames.org/2661847
Aarberg , https://www.geonames.org/2661880
Writing the output in CSV could be a solution?
Regards.
Using the csv module.
Ex:
import csv
with open("data.txt", "a") as myfile:
writer = csv.writer(myfile) #Create Writer Object
for item in pretty_json["geonames"]: #Iterate list
writer.writerow([item["name"], "https://www.geonames.org/{}".format(item["geonameId"])]) #Write row.
If I understand correctly, you want the same screen output to your file. That's easy. If you are on python 3 just add to your print function:
print (item["geonameId"],item["name"], file=myfile)
Just compose a proper printing format for the needed items:
...
for item in pretty_json["geonames"]:
print("{}, https://www.geonames.org/{}".format(item["name"], item["geonameId"]))
Sample output:
Aeschlenberg, https://www.geonames.org/2661847

Python - Key not found in JSON data, how to parse other elements?

I have data from a Facebook group feed (24000 odd records in total). Eg.
{
"data": [
{
"message": "MoneyWise its time to vote for the 2017 winners https://www.moneywise.co.uk/home-finances-survey?",
"updated_time": "2017-07-27T21:15:52+0000",
"permalink_url": "https://www.facebook.com/groups/uwpartnersforum/permalink/1745120025791166/",
"from": {
"name": "John Oliver",
"id": "10152744793754666"
},
"id": "1452979881671850_1745120025791166"
},
{
"message": "We often think of communicating as figuring out a really good message and leaving it that. But the annoying fact is that unless we pay close attention to how that message is landing on the other person, not much communication will take place - Alan Alda",
"updated_time": "2017-07-27T21:15:26+0000",
"permalink_url": "https://www.facebook.com/groups/uwpartnersforum/permalink/1744867295816439/",
"from": {
"name": "Adrian Watts",
"id": "10152461880942242"
},
"id": "1452979881671850_1744867295816439"
}
]
}
and I am trying to extract, on comand prompt and in file, "message", "permalink_url", "updated_time", "name" and "id"(one inside from) post by a particular person, say "John Oliver". Following python script works.. mostly:
fhand = open('try1.json')
urlData = fhand.read()
jsonData = json.loads(urlData)
fout = open('output1.txt', 'w')
for i in jsonData["data"]:
if i["from"]["name"] == "John Oliver":
print (i["message"], end = "|")
print (i["permalink_url"], end = "|")
print (i["updated_time"], end = "|")
print (i["from"]["name"], end = "|")
print (i["from"]["id"], end = "\n")
print()
fout.write(str(i["message"]) + "|")
fout.write(str(i["permalink_url"]) + "|")
fout.write(str(i["updated_time"]) + "|")
fout.write(str(i["from"]["name"]) + "|")
fout.write(str(i["from"]["id"]) + "\n")
fout.close()
But I am facing two issues.
Issue 1. If there is no message in any records I am getting traceback:
Traceback (most recent call last):
File "facebook_feed.py", line 36, in <module>
main()
File "facebook_feed.py", line 25, in main
print (i["message"], end = "|")
KeyError: 'message'
So, I need some help in going through the complete file even if there is no message for an object extracting all other details from it.
Issue 2. and this is a strange one ... I have two files "try1.json" with 500 odd records and "trial1.json" with 24000 odd records, with completely same structure. When I open "try1.json" in "Atom" text editor it is colour highlighted smaller file in Atom but "trial1.json" is not colour highlighted bigger file in atom. On running the above script with try1.json, I am getting the KeyError for "message" (as shown above) but for "trial1.json" I get this:
Traceback (most recent call last):
File "facebook_feed.py", line 36, in <module>
main()
File "facebook_feed.py", line 20, in main
if i["from"]["name"] == "John Oliver":
KeyError: 'from'
trial1.json is 17 MB file.. is that an issue?
If you're not sure if i["message"] exists, don't just blindly access it. Either use dict.get, e.g. i.get('message', 'No message found'), or check if it's there first:
if "message" in i:
print (i["message"], end = "|")
You can do the same kind of thing with i["from"].
Atom isn't highlighting the big file because it's big. But if you can successfully json.loads something, it's valid JSON.

Categories

Resources