using python to extract data with duplicate names from a json array

using python to extract data with duplicate names from a json array - python

I'll start by apologising if I use the wrong terms here, I am a rank beginner with python.
I have a json array containing 5 sets of data. The corresponding items in each set have duplicate names. I can extract them in java but not in python. The item(s) I want are called "summary_polyline". I have tried so many different ways in the last couple of weeks, so far nothing works.
This is the relevant part of my python-
#!/usr/bin/env python3.6
import os
import sys
from dotenv import load_dotenv, find_dotenv
import polyline
import matplotlib.pyplot as plt
import json
with open ('/var/www/vk7krj/running/strava_activities.json', 'rt') as myfile:
contents = myfile.read()
#print (contents)
#print (contents["summary_polyline"[1]])
activity1 = contents("summary_polyline"[1])
If I un-comment "print content", it prints the file to the screen ok.
I ran the json through an on-line json format checker and it passed ok
How do I extract the five "summay_polylines" and assign them to "activity1" to "activity5"?

If I right understand you, you need convert to json text data which was red from file.
with open ('/var/www/vk7krj/running/strava_activities.json', 'rt') as myfile:
contents = myfile.read()
# json_contents is a List with dicts now
json_contents = json.loads(contents)
# list with activities
activities = []
for dict_item in json_contents:
activities.append(dict_item)
# print all activities (whole file)
print(activities)
# print first activity
print(activities[0])
# print second activity
print(activities[1])

Related

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.

Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.

First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.

After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

How to make infinite Rest API requests and store the information?

I want to create one request per second and store all the data in a txt file.
However, the saving of the tuples in the txt doesn't seem to work
import bybit
import pandas as pd
import time
#request data
client = bybit.bybit(test=True, api_key="", api_secret="")
data= client.Market.Market_orderbook(symbol="BTCUSD").result()
#create request loop
h=1
while h>0:
time.sleep(1)
final = data + data
#save in txt
with open('orderbookdata.txt','w') as f:
for tup in final:
f.write( u" ".join(map(unicode,tup))+u"\n")
ps: keys given are read-only from a test net

You could simply save each data to a new line, which requires you to open the file for appending, not writing.
with open('file', 'a') # open for appending
Appending would create a file if it is non-existent, but if it is, it wouldn't just create an empty one, like writing do, but appending to the end of the file.
First of all, an infinite while should be while True:, second, you are reading the value once, then you are saving that data each line. Making a request each second may not permitted by the site, check that in the docs.
import bybit
import pandas as pd
import time
client = bybit.bybit(test=True, api_key="DzX0ObRek383f7BP4f", api_secret="wjZKH8MKJehLXv4iTplJiSxn1bg8rw49Vlbt")
while True:
data = client.Market.Market_orderbook(symbol="BTCUSD").result() #reading data each run
with open('orderbookdata.txt','a') as f: #open for appending
f.write('{}\n'.format(data)) #write data into a new line
time.sleep(1) #sleep

Parsing the Skills Section in a Resume in Python

I am trying to parse the skills section of a resume in python. I found a library by Mr. Omkar Pathak called pyresparser and I was able to extract a PDF resume's contents into a resume.txt file.
However, I was wondering how I can go about only extracting the skills section from the resume into a list and then writing that list possibly into a query.txt file.
I'm reading the contents of the resume.txt into a list and then comparing that to a list called skills which stores the extracted contents from a skill.cv file. Currently, the skills list is empty and I was wondering how I can go about storing the skills into that list? Is this the correct approach? Any help is greatly appreciated, thank you!
import string
import csv
import re
import sys
import importlib
import os
import spacy
from pyresparser import ResumeParser
import pandas as pd
import nltk
from spacy.matcher import matcher
import multiprocessing as mp
def main():
data = ResumeParser("C:/Users/infinitel88p/Downloads/resume.pdf").get_extracted_data()
print(data)
# Added encoding utf-8 to prevent unicode error
with open("C:/Users/infinitel88p/Downloads/resume.txt", "w", encoding='utf-8') as rf:
rf.truncate()
rf.write(str(data))
print("Resume results are getting printed into resume.txt.")
# Extracting skills
resume_list = []
skill_list = []
data = pd.read_csv("skills.csv")
skills = list(data.columns.values)
resume_file = os.path.dirname(__file__) + "/resume.txt"
with open(resume_file, 'r', encoding='utf-8') as f:
for line in f:
resume_list.append(line.strip())
for token in resume_list:
if token.lower() in skills:
skill_list.append(token)
print(skill_list)
if __name__ == "__main__":
main()

An easy way ( but not an efficient way ) to do:
Have a set of all possible relevant skills in a text file. For the words in skills sections of the resume or for all the words in resume, take each words and check whether it matches with any of the word from the text file. If a word matched, then that skill is present in resume. This way, you could identify a set of skills present in the resume.
For further addition or better identification, you can use naive-bayes classification or uni-gram probabilities to extract more relevant skills.

python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.
I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.
A look at the head of the output from - https://replicate.npmjs.com/_all_docs
gives me following output:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.

If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.
Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()

Is there a reason why you aren't decoding the json before iterating over the lines?
Can you try this:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)

Accessing parts of a json file

I have a JSON file named region_descriptions.json which is available in the link http://visualgenome.org/static/data/dataset/region_descriptions.json.zip which I suggest that you download to understand the structure. Since this file is huge this doesn't properly open in most softwares(in my case google chrome did the trick for me). Inside this JSON file you will find many sentences as the value for the key "phrase". I need to write all the phrases(ONLY the phrases in SAME ORDER) in a different line in a .txt file.
I already got a .txt file link by running the following code
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
f = open("text.txt","w")
for regions_dict in json_data:
for region in regions_dict["regions"]:
print(region["phrase"])
f.write(region["phrase"]+"\n")
But I found out that some phrases have been printed more than twice in a row and there are empty lines in between which seems strange. I cannot open the json file to check that whether the .txt file I got is correct or not. Anyone with any solutions please?

I am not entirely sure what you mean by "twice in a row." This solution works under the assumption that you are meaning "duplicate phrases."
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
with open('test.txt','w') as f:
all_phrases = []
for regions_dict in json_data:
for region in regions_dict["regions"]:
all_phrases.append(region['phrase'])
new_phrases = [phrase for phrase in all_phrases if phrase.strip()] #all non-empty phrases
new_phrases_again = [phrase for i,phrase in enumerate(new_phrases) if phrase not in new_phrases[:i]] #if the phrase has not been used before in new_phrases, add it to the final list
f.write("\n".join(new_phrases_again))
Sample test.txt output:
the clock is green in colour
shade is along the street
man is wearing sneakers
cars headlights are off
bikes are parked at the far edge
A sign on the facade of the building
A tree trunk on the sidewalk
A man in a red shirt
A brick sidewalk beside the street
The back of a white car

import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
for regions_dict in json_data:
for region in regions_dict["regions"]:
print(region["phrase"])
This should do the trick. It's is just a matter of quoting the keys you want and understanding the structure of the data.
Something helpful can be to do stuff like this:
import sys
import json
import pprint
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
for regions_dict in json_data:
pprint.pprint(regions_dict["regions"])
sys.exit()
And you'll get a nicely formatted output so you can better 'see' what the structure looks like. It may be helpful to take a quick online course on lists and dictionaries to get an idea how these object hold data. Basically [ ] is a list of data and { } is a dictionary (key and value pairs). Here is where I got started: https://www.codecademy.com/learn/learn-python
The code should be working fine. If there are dupe phrases that is because the .json has dupe phrases, and empty lines means some of the lines are empty. If you'd like a unique list of phrases you can build off the existing code. Something like add each phrase to a list if it doesn't exist in the list already. Like this:
import sys
import json
with open("region_descriptions.json", 'r') as file:
json_data = json.load(file)
phrase_list = []
for regions_dict in json_data:
for region in regions_dict["regions"]:
if region["phrase"] not in phrase_list:
phrase_list.append(region["phrase"])
I would also suggest in the future if you can to use a small sample of data instead of a huge file. Makes it easier to figure out what to do! Good luck!

by the looks of the data it's a list of region dictionaries, who's value is a list of region dictionaries
CannedScientist beat me to the punch, doh!
my answer was going to look pretty similar, just without the last two list comprehensions
I was going to check for empty strings before appending.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using python to extract data with duplicate names from a json array - python

Related

Is there any feasible solution to read WOT battle results .dat files?

How to make infinite Rest API requests and store the information?

Parsing the Skills Section in a Resume in Python

python: getting npm package data from a couchdb endpoint

Accessing parts of a json file

Categories

Resources