Accessing parts of a json file - python

I have a JSON file named region_descriptions.json which is available in the link http://visualgenome.org/static/data/dataset/region_descriptions.json.zip which I suggest that you download to understand the structure. Since this file is huge this doesn't properly open in most softwares(in my case google chrome did the trick for me). Inside this JSON file you will find many sentences as the value for the key "phrase". I need to write all the phrases(ONLY the phrases in SAME ORDER) in a different line in a .txt file.
I already got a .txt file link by running the following code
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
f = open("text.txt","w")
for regions_dict in json_data:
for region in regions_dict["regions"]:
print(region["phrase"])
f.write(region["phrase"]+"\n")
But I found out that some phrases have been printed more than twice in a row and there are empty lines in between which seems strange. I cannot open the json file to check that whether the .txt file I got is correct or not. Anyone with any solutions please?

I am not entirely sure what you mean by "twice in a row." This solution works under the assumption that you are meaning "duplicate phrases."
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
with open('test.txt','w') as f:
all_phrases = []
for regions_dict in json_data:
for region in regions_dict["regions"]:
all_phrases.append(region['phrase'])
new_phrases = [phrase for phrase in all_phrases if phrase.strip()] #all non-empty phrases
new_phrases_again = [phrase for i,phrase in enumerate(new_phrases) if phrase not in new_phrases[:i]] #if the phrase has not been used before in new_phrases, add it to the final list
f.write("\n".join(new_phrases_again))
Sample test.txt output:
the clock is green in colour
shade is along the street
man is wearing sneakers
cars headlights are off
bikes are parked at the far edge
A sign on the facade of the building
A tree trunk on the sidewalk
A man in a red shirt
A brick sidewalk beside the street
The back of a white car

import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
for regions_dict in json_data:
for region in regions_dict["regions"]:
print(region["phrase"])
This should do the trick. It's is just a matter of quoting the keys you want and understanding the structure of the data.
Something helpful can be to do stuff like this:
import sys
import json
import pprint
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
for regions_dict in json_data:
pprint.pprint(regions_dict["regions"])
sys.exit()
And you'll get a nicely formatted output so you can better 'see' what the structure looks like. It may be helpful to take a quick online course on lists and dictionaries to get an idea how these object hold data. Basically [ ] is a list of data and { } is a dictionary (key and value pairs). Here is where I got started: https://www.codecademy.com/learn/learn-python
The code should be working fine. If there are dupe phrases that is because the .json has dupe phrases, and empty lines means some of the lines are empty. If you'd like a unique list of phrases you can build off the existing code. Something like add each phrase to a list if it doesn't exist in the list already. Like this:
import sys
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
phrase_list = []
for regions_dict in json_data:
for region in regions_dict["regions"]:
if region["phrase"] not in phrase_list:
phrase_list.append(region["phrase"])
I would also suggest in the future if you can to use a small sample of data instead of a huge file. Makes it easier to figure out what to do! Good luck!

by the looks of the data it's a list of region dictionaries, who's value is a list of region dictionaries
CannedScientist beat me to the punch, doh!
my answer was going to look pretty similar, just without the last two list comprehensions
I was going to check for empty strings before appending.

Related

KeyError: 'primaryName' with csv.DictReader

first time poster - long time lurker.
I'm a little rough around the edges when it comes to Python and I'm running into an issue that I imagine has an easy fix.
I have a CSV file that I'm looking to read and perform a semi-advanced lookup.
The CSV in question is not really comma delimited when inspecting it in VS Code except the last "column".
Example (direct format screenshot from the file):
screenshot
The line that seems to have issues is:
import csv
import sys
from util import Node, StackFrontier, QueueFrontier
names = {}
people = {}
titles = {}
def load_data(directory):
with open(f"{directory}/file.csv", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
people[row["id"]] = {
"primaryName": row["primaryName"],
"birthYear": row["birthYear"],
"titles": set()
}
if row["primaryName"].lower() not in names:
names[row["primaryName"].lower()] = {row["nconst"]}
else:
names[row["primaryName"].lower()].add(row["nconst"])
The error I receive is:
File "C:\GitHub\Project\data-test.py", line 24, in load_data
"primaryName": row["primaryName"],
~~~^^^^^^^^^^^^^^^
KeyError: 'primaryName'
I've tried this with other CSV files where they are comma delimited, (screenshot example below):
screenshot
And that works perfectly fine. I noticed the above CSV file has the names in ""s which I imagine could be part of the solution.
Ultimately, if I can get it to work with the code above that would be great - otherwise, is there an easy way to automatically format the CSV file to put quotations around the names and separate the value by commas like the csv that's working above?
Thanks in advance for any help.

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

Is there any way to create the space between the words of output text file

I used the following code to transcribe the youtube video into text but the outcomes out a little weird. There is no space between the words and some are club together.
#import libraries
from youtube_transcript_api import YouTubeTranscriptApi as yta
import re
#select any youtube video
vid_id = 'S4lTtvlFvyk'
#extract text
data = yta.get_transcript(vid_id)
#make your transcript more better
transcript=''
for value in data:
for key,val in value.items():
if key == 'text':
transcript += val
l=transcript.splitlines()
final_tra = " ".join (l)
#write out transcript in the file
file=open(r"C:\Users\user.name\Desktop\python\DATA\Video files\trans.txt",'w')
file.write(final_tra)
file.close()
And the output file looks like:
check me outthe apple engineers went to the drawingboard to build a bettermask apple actually designed their veryown mask for their employees in store towear they've actually got a coupledifferent versionsbut this is kind of the standard this iswhat most employees will be wearing it'swhat most employeesof apple will have we've got some iphone12 later case news coming at the end ofthis video so stick around for thatwilly doo pulled it off plus someviewers of the lou later show downstairsthat got in touch with him so shout outto them anonymouslythis in front of me is the officialapplemask this is the reusable face mask inmedium largefor more information please visitwelcomeforward.apple.comwhat was crazy to me is on the packagingwhich is all very apple esque as you cantellwe have what looks like a serial numberdefinitely an item number and a lotnumber and production date sojust like everything else appletremendously detailed stuff over hereand an unboxing experience that lookslike it's kind of beyond
Some words are merged with each other and doesn't create any space. Please provide the appropriate solution for the same.
This may not give you exactly the output format you want but it's more concise and overcomes the word merging issue. If you dump (print) the dictionary returned by get_transcript() you'll get a better idea of what's going on.
from youtube_transcript_api import YouTubeTranscriptApi as yta
import re
# select any youtube video
vid_id = 'S4lTtvlFvyk'
# make your transcript more better
transcript = []
for value in yta.get_transcript(vid_id):
transcript.append(value['text'])
final_tra = ' '.join(transcript)
# write out transcript in the file
with open(r'C:\Users\user.name\Desktop\python\DATA\Video files\trans.txt', 'w') as outfile:
outfile.write(final_tra)

using python to extract data with duplicate names from a json array

I'll start by apologising if I use the wrong terms here, I am a rank beginner with python.
I have a json array containing 5 sets of data. The corresponding items in each set have duplicate names. I can extract them in java but not in python. The item(s) I want are called "summary_polyline". I have tried so many different ways in the last couple of weeks, so far nothing works.
This is the relevant part of my python-
#!/usr/bin/env python3.6
import os
import sys
from dotenv import load_dotenv, find_dotenv
import polyline
import matplotlib.pyplot as plt
import json
with open ('/var/www/vk7krj/running/strava_activities.json', 'rt') as myfile:
contents = myfile.read()
#print (contents)
#print (contents["summary_polyline"[1]])
activity1 = contents("summary_polyline"[1])
If I un-comment "print content", it prints the file to the screen ok.
I ran the json through an on-line json format checker and it passed ok
How do I extract the five "summay_polylines" and assign them to "activity1" to "activity5"?
If I right understand you, you need convert to json text data which was red from file.
with open ('/var/www/vk7krj/running/strava_activities.json', 'rt') as myfile:
contents = myfile.read()
# json_contents is a List with dicts now
json_contents = json.loads(contents)
# list with activities
activities = []
for dict_item in json_contents:
activities.append(dict_item)
# print all activities (whole file)
print(activities)
# print first activity
print(activities[0])
# print second activity
print(activities[1])

Importing JSON in Python and Removing Header

I'm trying to write a simple JSON to CSV converter in Python for Kiva. The JSON file I am working with looks like this:
{"header":{"total":412045,"page":1,"date":"2012-04-11T06:16:43Z","page_size":500},"loans":[{"id":84,"name":"Justine","description":{"languages":["en"], REST OF DATA
The problem is, when I use json.load, I only get the strings "header" and "loans" in data, but not the actual information such as id, name, description, etc. How can I skip over everything until the [? I have a lot of files to process, so I can't manually delete the beginning in each one. My current code is:
import csv
import json
fp = csv.writer(open("test.csv","wb+"))
f = open("loans/1.json")
data = json.load(f)
f.close()
for item in data:
fp.writerow([item["name"]] + [item["posted_date"]] + OTHER STUFF)
Instead of
for item in data:
use
for item in data['loans']:
The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.
data is a dictionary, so for item in data iterates the keys.
You probably want for loan in data['loans']:

Categories

Resources