How do I access specific data in a nested JSON file with Python and Pandas

How do I access specific data in a nested JSON file with Python and Pandas - python

I am still a newbie with Python and working on my first REST API. I have a JSON file that has a few levels. When I create the data frame with pandas, no matter what I try I cannot access the level I need.
The API is built with Flask and has the correct parameters for the book, chapter and verse.
Below is a small example of the JSON data.
{
"book": "Python",
"chapters": [
{
"chapter": "1",
"verses": [
{
"verse": "1",
"text": "Testing"
},
{
"verse": "2",
"text": "Testing 2"
}
]
}
]
}
Here is my code:
#app.route("/api/v1/<book>/<chapter>/<verse>/")
def api(book, chapter, verse):
book = book.replace(" ", "").title()
df = pd.read_json(f"Python/{book}.json")
filt = (df['chapters']['chapter'] == chapter) & (df['chapters']['verses']['verse'] == verse)
text = df.loc[filt].to_json()
result_dictionary = {'Book': book, 'Chapter': chapter, "Verse": verse, "Text": text}
return result_dictionary
Here is the error I am getting:
KeyError
KeyError: 'chapter'
I have tried normalizing the data, using df.loc to filter and just trying to access the data directly.
Expecting that the API endpoint will allow the user to supply the book, chapter and verse as arguments and then it returns the text for the given position based on those parameters supplied.

You can first create a dataframe of the JSON and then query it.
import json
import pandas as pd
def api(book, chapter, verse):
# Read the JSON file
with open(f"Python/{book}.json", "r") as f:
data = json.load(f)
# Convert it into a DataFrame
df = pd.json_normalize(data, record_path=["chapters", "verses"], meta=["book", ["chapters", "chapter"]])
df.columns = ["Verse", "Text", "Book", "Chapter"] # rename columns
# Query the required content
query = f"Book == '{book}' and Chapter == '{chapter}' and Verse == '{verse}'"
result = df.query(query).to_dict(orient="records")[0]
return result
Here df would look like this after json_normalize:
Verse Text Book Chapter
0 1 Testing Python 1
1 2 Testing 2 Python 1
2 1 Testing Python 2
3 2 Testing 2 Python 2
And result is:
{'Verse': '2', 'Text': 'Testing 2', 'Book': 'Python', 'Chapter': '1'}

You are trying to access a list in a dict with a dict key ?
filt = (df['chapters'][0]['chapter'] == "chapter") & (df['chapters'][0]['verses'][0]['verse'] == "verse")
Will get a result.
But df.loc[filt] requires a list with (boolean) filters and above only gerenerates one false or true, so you can't filter with that.
You can filter like:
df.from_dict(df['chapters'][0]['verses']).query("verse =='1'")

One of the issues here is that "chapters" is a list
"chapters": [
This is why ["chapters"]["chapter"] wont work as you intend.
If you're new to this, it may be helpful to "normalize" the data yourself:
import json
with open("book.json") as f:
book = json.load(f)
for chapter in book["chapters"]:
for verse in chapter["verses"]:
row = book["book"], chapter["chapter"], verse["verse"], verse["text"]
print(repr(row))
('Python', '1', '1', 'Testing')
('Python', '1', '2', 'Testing 2')
It is possible to pass this to pd.DataFrame()
df = pd.DataFrame(
([book["book"], chapter["chapter"], verse["verse"], verse["text"]]
for verse in chapter["verses"]
for chapter in book["chapters"]),
columns=["Book", "Chapter", "Verse", "Text"]
)
Book Chapter Verse Text
0 Python 1 1 Testing
1 Python 1 2 Testing 2
Although it's not clear if you need a dataframe here at all.

Related

constructing a message format from the fetchall result in python

*New to Programming
Question: I need to use the below "Data" (two rows as arrays) queried from sql and use it to create the message structure below.
data from sql using fetchall()
Data = [[100,1,4,5],[101,1,4,6]]
##expected message structure
message = {
"name":"Tom",
"Job":"IT",
"info": [
{
"id_1":"100",
"id_2":"1",
"id_3":"4",
"id_4":"5"
},
{
"id_1":"101",
"id_2":"1",
"id_3":"4",
"id_4":"6"
},
]
}
I tried to create below method to iterate over the rows and then input the values, this is was just a starting, but this was also not working
def create_message(data)
for row in data:
{
"id_1":str(data[0][0],
"id_2":str(data[0][1],
"id_3":str(data[0][2],
"id_4":str(data[0][3],
}
Latest Code
def create_info(data):
info = []
for row in data:
temp_dict = {"id_1_tom":"","id_2_hell":"","id_3_trip":"","id_4_clap":""}
for i in range(0,1):
temp_dict["id_1_tom"] = str(row[i])
temp_dict["id_2_hell"] = str(row[i+1])
temp_dict["id_3_trip"] = str(row[i+2])
temp_dict["id_4_clap"] = str(row[i+3])
info.append(temp_dict)
return info

Edit: Updated answer based on updates to the question and comment by original poster.
This function might work for the example you've given to get the desired output, based on the attempt you've provided:
def create_info(data):
info = []
for row in data:
temp_dict = {}
temp_dict['id_1_tom'] = str(row[0])
temp_dict['id_2_hell'] = str(row[1])
temp_dict['id_3_trip'] = str(row[2])
temp_dict['id_4_clap'] = str(row[3])
info.append(temp_dict)
return info
For the input:
[[100, 1, 4, 5],[101,1,4,6]]
This function will return a list of dictionaries:
[{"id_1_tom":"100","id_2_hell":"1","id_3_trip":"4","id_4_clap":"5"},
{"id_1_tom":"101","id_2_hell":"1","id_3_trip":"4","id_4_clap":"6"}]
This can serve as the value for the key info in your dictionary message. Note that you would still have to construct the message dictionary.

Sentiment Analysis data not showing up in csv file

I put data into a csv file (called "Essential Data_posts"). In my main, I extract a particular column from this file (called 'Post Texts') so that I can analyze the post texts for sentiment entity analysis using Google Cloud NLP. I then put this analysis in another csv file (called "SentimentAnalysis"). To do this, I put all of the information pertaining to sentiment entity analysis into an array (one for each piece of information).
The problem I am having is that when I execute my code, nothing shows up in SentimentAnalysis file, other than the headers, ex. "Representative Name". When I requested the lengths of all the arrays, I found out that each array had a length of 0, so they didn't have information being added to them.
I am using Ubuntu 21.04 and Google Cloud Natural Language. I am running this all in Terminal, not the Google Cloud Platform. I am also using Python3 and emacs text editor.
from google.cloud import language_v1
import pandas as pd
import csv
import os
#lists we are appending to
representativeName = []
entity = []
salienceScore = []
entitySentimentScore = []
entitySentimentMagnitude = []
metadataNames = []
metadataValues = []
mentionText = []
mentionType = []
def sentiment_entity(postTexts):
client = language_v1.LanguageServiceClient()
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": post_texts, "type": type_, "language": language}
encodingType = language_v1.EncodingType.UTF8
response = client.analyze_entity_sentiment(request = {'document': document, 'encoding type': encodingType})
#loop through entities returned from the API
for entity in response.entities:
representativeName.append(entity.name)
entity.append(language_v1.Entity.Type(entity.type_).name)
salienceScore.append(entity.salience)
entitySentimentScore.append(sentiment.score)
entitySentimentMagnitude.append(sentiment.magnitude)
#loop over metadata associated with entity
for metadata_name, metadata_value in entity.metadata.items():
metadataNames.append(metadata_name)
metadataValues.append(metadata_value)
#loop over the mentions of this entity in the input document
for mention in entity.mentions:
mentionText.append(mention.text.content)
mentionType.append(mention.type_)
#put the lists into the csv file (using pandas)
data = {
"Representative Name": representativeName,
"Entity": entity,
"Salience Score": salienceScore,
"Entity Sentiment Score": entitySentimentScore,
"Entity Sentiment Magnitude": entitySentimentMagnitude,
"Metadata Name": metadataNames,
"Metadata Value": metadataValues,
"Mention Text": mentionText,
"Mention Type": mentionType
}
df = pd.DataFrame(data)
df
df.to_csv("SentimentAnalysis.csv", encoding='utf-8', index=False)
def main():
import argparse
#read the csv file containing the post text we need to analyze
filename = open('Essential Data_posts.csv', 'r')
#create dictreader object
file = csv.DictReader(filename)
postTexts = []
#iterate over each column and append values to list
for col in file:
postTexts.append(col['Post Text'])
parser = arg.parse.ArgumentParser()
parser.add_argument("--postTexts", type=str, default=postTexts)
args = parser.parse_args()
sentiment_entity(args.postTexts)

I tried running your code and I encountered the following errors:
You did not use the passed parameter postTexts in sentiment_entity() thus this will error at document = {"content": post_texts, "type": type_, "language": language}.
A list cannot be passed to "content": post_texts, it should be string. See Document reference.
In variable request, 'encoding type' should be 'encoding_type'
Local variable entity should not not have the same name with entity = []. Python will try to append values in the local variable entity which is not a list.
Should be entity.sentiment.score and entity.sentiment.magnitude instead of sentiment.score and sentiment.magnitude
Loop for metadata and mention should be under loop for entity in response.entities:
I edited your code and fixed the errors mentioned above. In your main(), I included a step to convert the list postTexts to string so it can be used in your sentiment_entity() function. metadataNames and metadataValues are temporarily commented since I do not have an example that could populate these values.
from google.cloud import language_v1
import pandas as pd
import csv
import os
#lists we are appending to
representativeName = []
entity_arr = []
salienceScore = []
entitySentimentScore = []
entitySentimentMagnitude = []
metadataNames = []
metadataValues = []
mentionText = []
mentionType = []
def listToString(s):
""" Transform list to string"""
str1 = " "
return (str1.join(s))
def sentiment_entity(postTexts):
client = language_v1.LanguageServiceClient()
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": postTexts, "type_": type_, "language": language}
encodingType = language_v1.EncodingType.UTF8
response = client.analyze_entity_sentiment(request = {'document': document, 'encoding_type': encodingType})
#loop through entities returned from the API
for entity in response.entities:
representativeName.append(entity.name)
entity_arr.append(language_v1.Entity.Type(entity.type_).name)
salienceScore.append(entity.salience)
entitySentimentScore.append(entity.sentiment.score)
entitySentimentMagnitude.append(entity.sentiment.magnitude)
#loop over the mentions of this entity in the input document
for mention in entity.mentions:
mentionText.append(mention.text.content)
mentionType.append(mention.type_)
#loop over metadata associated with entity
for metadata_name, metadata_value in entity.metadata.items():
metadataNames.append(metadata_name)
metadataValues.append(metadata_value)
data = {
"Representative Name": representativeName,
"Entity": entity_arr,
"Salience Score": salienceScore,
"Entity Sentiment Score": entitySentimentScore,
"Entity Sentiment Magnitude": entitySentimentMagnitude,
#"Metadata Name": metadataNames,
#"Metadata Value": metadataValues,
"Mention Text": mentionText,
"Mention Type": mentionType
}
df = pd.DataFrame(data)
df.to_csv("SentimentAnalysis.csv", encoding='utf-8', index=False)
def main():
import argparse
#read the csv file containing the post text we need to analyze
filename = open('test.csv', 'r')
#create dictreader object
file = csv.DictReader(filename)
postTexts = []
#iterate over each column and append values to list
for col in file:
postTexts.append(col['Post Text'])
content = listToString(postTexts) #convert list to string
print(content)
sentiment_entity(content)
if __name__ == "__main__":
main()
test.csv:
col_1,Post Text
dummy,Grapes are good.
dummy,Bananas are bad.
When code is ran, I printed the converted list to string and SentimentAnalysis.csv is generated:
SentimentAnalysis.csv:
Representative Name,Entity,Salience Score,Entity Sentiment Score,Entity Sentiment Magnitude,Mention Text,Mention Type
Grapes,OTHER,0.8335162997245789,0.800000011920929,0.800000011920929,Grapes,2
Bananas,OTHER,0.16648370027542114,-0.699999988079071,0.699999988079071,Bananas,2

Analysis of fields in nested document elasticsearch

I am using a document with nested structure in it where the content is analysed in spite of my telling it "not analysed". The document is defined as follows:
class SearchDocument(es.DocType)
# Verblijfsobject specific data
gebruiksdoel_omschrijving = es.String(index='not_analyzed')
oppervlakte = es.Integer()
bouwblok = es.String(index='not_analyzed')
gebruik = es.String(index='not_analyzed')
panden = es.String(index='not_analyzed')
sbi_codes = es.Nested({
'properties': {
'sbi_code': es.String(index='not_analyzed'),
'hcat': es.String(index='not_analyzed'),
'scat': es.String(index='not_analyzed'),
'hoofdcategorie': es.String(fields= {'raw': es.String(in dex='not_analyzed')}),
'subcategorie': es.String(fields={'raw':es.String(index='not_analyzed')}),
'sub_sub_categorie': es.String(fields= {'raw': es.String(index='not_analyzed')}),
'bedrijfsnaam': es.String(fields= {'raw': es.String(index='not_analyzed')}),
'vestigingsnummer': es.String(index='not_analyzed')
}
})
As is clear, it says "not analysed" in the document for most fields. This works OK for the "regular fields". The problem is in the nested structure. There the hoofdcategorie and other fields are indexed for their separate words instead of the unanalysed version.
The structure is filled with the following data:
[
{
"sbi_code": "74103",
"sub_sub_categorie": "Interieur- en ruimtelijk ontwerp",
"vestigingsnummer": "000000002216",
"bedrijfsnaam": "Flippie Tests",
"subcategorie": "design",
"scat": "22279_12_22254_11",
"hoofdcategorie": "zakelijke dienstverlening",
"hcat": "22279_12"
},
{
"sbi_code": "9003",
"sub_sub_categorie": "Schrijven en overige scheppende kunsten",
"vestigingsnummer": "000000002216",
"bedrijfsnaam": "Flippie Tests",
"subcategorie": "kunst",
"scat": "22281_12_22259_11",
"hoofdcategorie": "cultuur, sport, recreatie",
"hcat": "22281_12"
}
]
Now when I retrieve aggregates it has split the hoofdcategorie in 3 different words ("cultuur", "sport", "recreatie"). This is not what I want, but as far as I know I have specified it correctly using the "not analysed" phrase.
Anyone any ideas?

What is the data format returned by the AdWords API TargetingIdeaPage service?

When I query the AdWords API to get search volume data and trends through their TargetingIdeaSelector using the Python client library the returned data looks like this:
(TargetingIdeaPage){
totalNumEntries = 1
entries[] =
(TargetingIdea){
data[] =
(Type_AttributeMapEntry){
key = "KEYWORD_TEXT"
value =
(StringAttribute){
Attribute.Type = "StringAttribute"
value = "keyword phrase"
}
},
(Type_AttributeMapEntry){
key = "TARGETED_MONTHLY_SEARCHES"
value =
(MonthlySearchVolumeAttribute){
Attribute.Type = "MonthlySearchVolumeAttribute"
value[] =
(MonthlySearchVolume){
year = 2016
month = 2
count = 2900
},
...
(MonthlySearchVolume){
year = 2015
month = 3
count = 2900
},
}
},
},
}
This isn't JSON and appears to just be a messy Python list. What's the easiest way to flatten the monthly data into a Pandas dataframe with a structure like this?
Keyword | Year | Month | Count
keyword phrase 2016 2 10

The output is a sudsobject. I found that this code does the trick:
import suds.sudsobject as sudsobject
import pandas as pd
a = [sudsobject.asdict(x) for x in output]
df = pd.DataFrame(a)
Addendum: This was once correct but new versions of the API (I tested
201802) now return a zeep.objects. However, zeep.helpers.serialize_object should do the same trick.
link

Here's the complete code that I used to query the TargetingIdeaSelector, with requestType STATS, and the method I used to parse the data to a useable dataframe; note the section starting "Parse results to pandas dataframe" as this takes the output given in the question above and converts it to a dataframe. Probably not the fastest or best, but it works! Tested with Python 2.7.
"""This code pulls trends for a set of keywords, and parses into a dataframe.
The LoadFromStorage method is pulling credentials and properties from a
"googleads.yaml" file. By default, it looks for this file in your home
directory. For more information, see the "Caching authentication information"
section of our README.
"""
from googleads import adwords
import pandas as pd
adwords_client = adwords.AdWordsClient.LoadFromStorage()
PAGE_SIZE = 10
# Initialize appropriate service.
targeting_idea_service = adwords_client.GetService(
'TargetingIdeaService', version='v201601')
# Construct selector object and retrieve related keywords.
offset = 0
stats_selector = {
'searchParameters': [
{
'xsi_type': 'RelatedToQuerySearchParameter',
'queries': ['donald trump', 'bernie sanders']
},
{
# Language setting (optional).
# The ID can be found in the documentation:
# https://developers.google.com/adwords/api/docs/appendix/languagecodes
'xsi_type': 'LanguageSearchParameter',
'languages': [{'id': '1000'}],
},
{
# Location setting
'xsi_type': 'LocationSearchParameter',
'locations': [{'id': '1027363'}] # Burlington,Vermont
}
],
'ideaType': 'KEYWORD',
'requestType': 'STATS',
'requestedAttributeTypes': ['KEYWORD_TEXT', 'TARGETED_MONTHLY_SEARCHES'],
'paging': {
'startIndex': str(offset),
'numberResults': str(PAGE_SIZE)
}
}
stats_page = targeting_idea_service.get(stats_selector)
##########################################################################
# Parse results to pandas dataframe
stats_pd = pd.DataFrame()
if 'entries' in stats_page:
for stats_result in stats_page['entries']:
stats_attributes = {}
for stats_attribute in stats_result['data']:
#print (stats_attribute)
if stats_attribute['key'] == 'KEYWORD_TEXT':
kt = stats_attribute['value']['value']
else:
for i, val in enumerate(stats_attribute['value'][1]):
data = {'keyword': kt,
'year': val['year'],
'month': val['month'],
'count': val['count']}
data = pd.DataFrame(data, index = [i])
stats_pd = stats_pd.append(data, ignore_index=True)
print(stats_pd)

JSON Parsing help in Python

I have below data in JSON format, I have started with code below which throws a KEY ERROR.
Not sure how to get all data listed in headers section.
I know I am not doing it right in json_obj['offers'][0]['pkg']['Info']: but not sure how to do it correctly.
how can I get to different nodes like info,PricingInfo,Flt_Info etc?
{
"offerInfo":{
"siteID":"1",
"language":"en_US",
"currency":"USD"
},
"offers":{
"pkg":[
{
"offerDateRange":{
"StartDate":[
2015,
11,
8
],
"EndDate":[
2015,
11,
14
]
},
"Info":{
"Id":"111"
},
"PricingInfo":{
"BaseRate":1932.6
},
"flt_Info":{
"Carrier":"AA"
}
}
]
}
}
import os
import json
import csv
f = open('api.csv','w')
writer = csv.writer(f,delimiter = '~')
headers = ['Id' , 'StartDate', 'EndDate', 'Id', 'BaseRate', 'Carrier']
default = ''
writer.writerow(headers)
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)

It looks like you're accessing the json incorrectly. After you have accessed json_obj['offers'], you accessed [0], but there is no array there. json_obj['offers'] gives you another dictionary.
For example, to get PricingInfo like you asked, access like this:
json_obj['offers']['pkg'][0]['PricingInfo']
or 11 from the StartDate like this:
json_obj['offers']['pkg'][0]['offerDateRange']['StartDate'][1]
And I believe you get the KEY ERROR because you access [0] in the dictionary, which since that isn't a key, you get the error.

try to substitute this piece of code:
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)
With this:
for pkg in json_obj['offers']['pkg']:
row.append(pkg['Info']['Id'])
year = pkg['offerDateRange']['StartDate'][0]
month = pkg['offerDateRange']['StartDate'][1]
day = pkg['offerDateRange']['StartDate'][2]
StartDate = "%d-%d-%d" % (year,month,day)
print StartDate
writer.writerow(row)

Try this
import os
import json
import csv
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["StartDate"][2])
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["EndDate"][2])
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print json_obj["offers"]["pkg"][0]["PricingInfo"]["BaseRate"]
print json_obj["offers"]["pkg"][0]["flt_Info"]["Carrier"]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I access specific data in a nested JSON file with Python and Pandas - python

Related

constructing a message format from the fetchall result in python

Sentiment Analysis data not showing up in csv file

Analysis of fields in nested document elasticsearch

What is the data format returned by the AdWords API TargetingIdeaPage service?

JSON Parsing help in Python

Categories

Resources