I put data into a csv file (called "Essential Data_posts"). In my main, I extract a particular column from this file (called 'Post Texts') so that I can analyze the post texts for sentiment entity analysis using Google Cloud NLP. I then put this analysis in another csv file (called "SentimentAnalysis"). To do this, I put all of the information pertaining to sentiment entity analysis into an array (one for each piece of information).
The problem I am having is that when I execute my code, nothing shows up in SentimentAnalysis file, other than the headers, ex. "Representative Name". When I requested the lengths of all the arrays, I found out that each array had a length of 0, so they didn't have information being added to them.
I am using Ubuntu 21.04 and Google Cloud Natural Language. I am running this all in Terminal, not the Google Cloud Platform. I am also using Python3 and emacs text editor.
from google.cloud import language_v1
import pandas as pd
import csv
import os
#lists we are appending to
representativeName = []
entity = []
salienceScore = []
entitySentimentScore = []
entitySentimentMagnitude = []
metadataNames = []
metadataValues = []
mentionText = []
mentionType = []
def sentiment_entity(postTexts):
client = language_v1.LanguageServiceClient()
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": post_texts, "type": type_, "language": language}
encodingType = language_v1.EncodingType.UTF8
response = client.analyze_entity_sentiment(request = {'document': document, 'encoding type': encodingType})
#loop through entities returned from the API
for entity in response.entities:
representativeName.append(entity.name)
entity.append(language_v1.Entity.Type(entity.type_).name)
salienceScore.append(entity.salience)
entitySentimentScore.append(sentiment.score)
entitySentimentMagnitude.append(sentiment.magnitude)
#loop over metadata associated with entity
for metadata_name, metadata_value in entity.metadata.items():
metadataNames.append(metadata_name)
metadataValues.append(metadata_value)
#loop over the mentions of this entity in the input document
for mention in entity.mentions:
mentionText.append(mention.text.content)
mentionType.append(mention.type_)
#put the lists into the csv file (using pandas)
data = {
"Representative Name": representativeName,
"Entity": entity,
"Salience Score": salienceScore,
"Entity Sentiment Score": entitySentimentScore,
"Entity Sentiment Magnitude": entitySentimentMagnitude,
"Metadata Name": metadataNames,
"Metadata Value": metadataValues,
"Mention Text": mentionText,
"Mention Type": mentionType
}
df = pd.DataFrame(data)
df
df.to_csv("SentimentAnalysis.csv", encoding='utf-8', index=False)
def main():
import argparse
#read the csv file containing the post text we need to analyze
filename = open('Essential Data_posts.csv', 'r')
#create dictreader object
file = csv.DictReader(filename)
postTexts = []
#iterate over each column and append values to list
for col in file:
postTexts.append(col['Post Text'])
parser = arg.parse.ArgumentParser()
parser.add_argument("--postTexts", type=str, default=postTexts)
args = parser.parse_args()
sentiment_entity(args.postTexts)
I tried running your code and I encountered the following errors:
You did not use the passed parameter postTexts in sentiment_entity() thus this will error at document = {"content": post_texts, "type": type_, "language": language}.
A list cannot be passed to "content": post_texts, it should be string. See Document reference.
In variable request, 'encoding type' should be 'encoding_type'
Local variable entity should not not have the same name with entity = []. Python will try to append values in the local variable entity which is not a list.
Should be entity.sentiment.score and entity.sentiment.magnitude instead of sentiment.score and sentiment.magnitude
Loop for metadata and mention should be under loop for entity in response.entities:
I edited your code and fixed the errors mentioned above. In your main(), I included a step to convert the list postTexts to string so it can be used in your sentiment_entity() function. metadataNames and metadataValues are temporarily commented since I do not have an example that could populate these values.
from google.cloud import language_v1
import pandas as pd
import csv
import os
#lists we are appending to
representativeName = []
entity_arr = []
salienceScore = []
entitySentimentScore = []
entitySentimentMagnitude = []
metadataNames = []
metadataValues = []
mentionText = []
mentionType = []
def listToString(s):
""" Transform list to string"""
str1 = " "
return (str1.join(s))
def sentiment_entity(postTexts):
client = language_v1.LanguageServiceClient()
type_ = language_v1.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": postTexts, "type_": type_, "language": language}
encodingType = language_v1.EncodingType.UTF8
response = client.analyze_entity_sentiment(request = {'document': document, 'encoding_type': encodingType})
#loop through entities returned from the API
for entity in response.entities:
representativeName.append(entity.name)
entity_arr.append(language_v1.Entity.Type(entity.type_).name)
salienceScore.append(entity.salience)
entitySentimentScore.append(entity.sentiment.score)
entitySentimentMagnitude.append(entity.sentiment.magnitude)
#loop over the mentions of this entity in the input document
for mention in entity.mentions:
mentionText.append(mention.text.content)
mentionType.append(mention.type_)
#loop over metadata associated with entity
for metadata_name, metadata_value in entity.metadata.items():
metadataNames.append(metadata_name)
metadataValues.append(metadata_value)
data = {
"Representative Name": representativeName,
"Entity": entity_arr,
"Salience Score": salienceScore,
"Entity Sentiment Score": entitySentimentScore,
"Entity Sentiment Magnitude": entitySentimentMagnitude,
#"Metadata Name": metadataNames,
#"Metadata Value": metadataValues,
"Mention Text": mentionText,
"Mention Type": mentionType
}
df = pd.DataFrame(data)
df.to_csv("SentimentAnalysis.csv", encoding='utf-8', index=False)
def main():
import argparse
#read the csv file containing the post text we need to analyze
filename = open('test.csv', 'r')
#create dictreader object
file = csv.DictReader(filename)
postTexts = []
#iterate over each column and append values to list
for col in file:
postTexts.append(col['Post Text'])
content = listToString(postTexts) #convert list to string
print(content)
sentiment_entity(content)
if __name__ == "__main__":
main()
test.csv:
col_1,Post Text
dummy,Grapes are good.
dummy,Bananas are bad.
When code is ran, I printed the converted list to string and SentimentAnalysis.csv is generated:
SentimentAnalysis.csv:
Representative Name,Entity,Salience Score,Entity Sentiment Score,Entity Sentiment Magnitude,Mention Text,Mention Type
Grapes,OTHER,0.8335162997245789,0.800000011920929,0.800000011920929,Grapes,2
Bananas,OTHER,0.16648370027542114,-0.699999988079071,0.699999988079071,Bananas,2
Related
I am still a newbie with Python and working on my first REST API. I have a JSON file that has a few levels. When I create the data frame with pandas, no matter what I try I cannot access the level I need.
The API is built with Flask and has the correct parameters for the book, chapter and verse.
Below is a small example of the JSON data.
{
"book": "Python",
"chapters": [
{
"chapter": "1",
"verses": [
{
"verse": "1",
"text": "Testing"
},
{
"verse": "2",
"text": "Testing 2"
}
]
}
]
}
Here is my code:
#app.route("/api/v1/<book>/<chapter>/<verse>/")
def api(book, chapter, verse):
book = book.replace(" ", "").title()
df = pd.read_json(f"Python/{book}.json")
filt = (df['chapters']['chapter'] == chapter) & (df['chapters']['verses']['verse'] == verse)
text = df.loc[filt].to_json()
result_dictionary = {'Book': book, 'Chapter': chapter, "Verse": verse, "Text": text}
return result_dictionary
Here is the error I am getting:
KeyError
KeyError: 'chapter'
I have tried normalizing the data, using df.loc to filter and just trying to access the data directly.
Expecting that the API endpoint will allow the user to supply the book, chapter and verse as arguments and then it returns the text for the given position based on those parameters supplied.
You can first create a dataframe of the JSON and then query it.
import json
import pandas as pd
def api(book, chapter, verse):
# Read the JSON file
with open(f"Python/{book}.json", "r") as f:
data = json.load(f)
# Convert it into a DataFrame
df = pd.json_normalize(data, record_path=["chapters", "verses"], meta=["book", ["chapters", "chapter"]])
df.columns = ["Verse", "Text", "Book", "Chapter"] # rename columns
# Query the required content
query = f"Book == '{book}' and Chapter == '{chapter}' and Verse == '{verse}'"
result = df.query(query).to_dict(orient="records")[0]
return result
Here df would look like this after json_normalize:
Verse Text Book Chapter
0 1 Testing Python 1
1 2 Testing 2 Python 1
2 1 Testing Python 2
3 2 Testing 2 Python 2
And result is:
{'Verse': '2', 'Text': 'Testing 2', 'Book': 'Python', 'Chapter': '1'}
You are trying to access a list in a dict with a dict key ?
filt = (df['chapters'][0]['chapter'] == "chapter") & (df['chapters'][0]['verses'][0]['verse'] == "verse")
Will get a result.
But df.loc[filt] requires a list with (boolean) filters and above only gerenerates one false or true, so you can't filter with that.
You can filter like:
df.from_dict(df['chapters'][0]['verses']).query("verse =='1'")
One of the issues here is that "chapters" is a list
"chapters": [
This is why ["chapters"]["chapter"] wont work as you intend.
If you're new to this, it may be helpful to "normalize" the data yourself:
import json
with open("book.json") as f:
book = json.load(f)
for chapter in book["chapters"]:
for verse in chapter["verses"]:
row = book["book"], chapter["chapter"], verse["verse"], verse["text"]
print(repr(row))
('Python', '1', '1', 'Testing')
('Python', '1', '2', 'Testing 2')
It is possible to pass this to pd.DataFrame()
df = pd.DataFrame(
([book["book"], chapter["chapter"], verse["verse"], verse["text"]]
for verse in chapter["verses"]
for chapter in book["chapters"]),
columns=["Book", "Chapter", "Verse", "Text"]
)
Book Chapter Verse Text
0 Python 1 1 Testing
1 Python 1 2 Testing 2
Although it's not clear if you need a dataframe here at all.
I have produced a couple of json files after scraping a few elements. The structure for each file is as follows:
us.json
{'Pres': 'Biden', 'Vice': 'Harris', 'Secretary': 'Blinken'}
uk.json
{'1st Min': 'Johnson', 'Queen':'Elizabeth', 'Prince': 'Charles'}
I'd like to know how I could edit the structure of each dictionary inside the json file to get an output as it follows:
[
{"title": "Pres",
"name": "Biden"}
,
{"title": "Vice",
"name": "Harris"}
,
{"title": "Secretary",
"name": "Blinken"}
]
As far as I am able to think how to do it (I'm a beginner, studying only since a few weeks) I need first to run a loop to open each file, then I should generate a list of dictionaries and finally modify the dictionary to change the structure. This is what I got NOT WORKING as it overrides always with the same keys.
import os
import json
list_of_dicts = []
for filename in os.listdir("DOCS/Countries Data"):
with open(os.path.join("DOCS/Countries Data", filename), 'r', encoding='utf-8') as f:
text = f.read()
country_json = json.loads(text)
list_of_dicts.append(country_json)
for country in list_of_dicts:
newdict = country
lastdict = {}
for key in newdict:
lastdict = {'Title': key}
for value in newdict.values():
lastdict['Name'] = value
print(lastdict)
Extra bonus if you could also show me how to generate an ID mumber for each entry. Thank you very much
This look like task for list comprehension, I would do it following way
import json
us = '{"Pres": "Biden", "Vice": "Harris", "Secretary": "Blinken"}'
data = json.loads(us)
us2 = [{"title":k,"name":v} for k,v in data.items()]
us2json = json.dumps(us2)
print(us2json)
output
[{"title": "Pres", "name": "Biden"}, {"title": "Vice", "name": "Harris"}, {"title": "Secretary", "name": "Blinken"}]
data is dict, .items() provide key-value pairs, which I unpack into k and v (see tuple unpacking).
You can do this easily by writing a simple function like below
import uuid
def format_dict(data: dict):
return [dict(title=title, name=name, id=str(uuid.uuid4())) for title, name in data.items()]
where you can split the items as different objects and add a identifier for each using uuid.
Full code can be modified like this
import uuid
import os
import json
def format_dict(data: dict):
return [dict(title=title, name=name, id=str(uuid.uuid4())) for title, name in data.items()]
list_of_dicts = []
for filename in os.listdir("DOCS/Countries Data"):
with open(os.path.join("DOCS/Countries Data", filename), 'r', encoding='utf-8') as f:
country_json = json.load(f)
list_of_dicts.append(format_dict(country_json))
# list_of_dicts contains all file contents
*New to Programming
Question: I need to use the below "Data" (two rows as arrays) queried from sql and use it to create the message structure below.
data from sql using fetchall()
Data = [[100,1,4,5],[101,1,4,6]]
##expected message structure
message = {
"name":"Tom",
"Job":"IT",
"info": [
{
"id_1":"100",
"id_2":"1",
"id_3":"4",
"id_4":"5"
},
{
"id_1":"101",
"id_2":"1",
"id_3":"4",
"id_4":"6"
},
]
}
I tried to create below method to iterate over the rows and then input the values, this is was just a starting, but this was also not working
def create_message(data)
for row in data:
{
"id_1":str(data[0][0],
"id_2":str(data[0][1],
"id_3":str(data[0][2],
"id_4":str(data[0][3],
}
Latest Code
def create_info(data):
info = []
for row in data:
temp_dict = {"id_1_tom":"","id_2_hell":"","id_3_trip":"","id_4_clap":""}
for i in range(0,1):
temp_dict["id_1_tom"] = str(row[i])
temp_dict["id_2_hell"] = str(row[i+1])
temp_dict["id_3_trip"] = str(row[i+2])
temp_dict["id_4_clap"] = str(row[i+3])
info.append(temp_dict)
return info
Edit: Updated answer based on updates to the question and comment by original poster.
This function might work for the example you've given to get the desired output, based on the attempt you've provided:
def create_info(data):
info = []
for row in data:
temp_dict = {}
temp_dict['id_1_tom'] = str(row[0])
temp_dict['id_2_hell'] = str(row[1])
temp_dict['id_3_trip'] = str(row[2])
temp_dict['id_4_clap'] = str(row[3])
info.append(temp_dict)
return info
For the input:
[[100, 1, 4, 5],[101,1,4,6]]
This function will return a list of dictionaries:
[{"id_1_tom":"100","id_2_hell":"1","id_3_trip":"4","id_4_clap":"5"},
{"id_1_tom":"101","id_2_hell":"1","id_3_trip":"4","id_4_clap":"6"}]
This can serve as the value for the key info in your dictionary message. Note that you would still have to construct the message dictionary.
while working with small zipfiles(about 8MB) containg 25MB of CSV files the below code works exactly as it should. As soon as I attempt to download larger files (45MB zip file containing a 180MB csv) the code breaks and I get the following error message:
(venv) ufulu#ufulu awr % python get_awr_ranking_data.py
https://api.awrcloud.com/v2/get.php?action=get_topsites&token=REDACTED&project=REDACTED Client+%5Bw%5D&fileName=2017-01-04-2019-10-09
Traceback (most recent call last):
File "get_awr_ranking_data.py", line 101, in <module>
getRankingData(project['name'])
File "get_awr_ranking_data.py", line 67, in getRankingData
processRankingdata(rankDateData['details'])
File "get_awr_ranking_data.py", line 79, in processRankingdata
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
AttributeError: 'float' object has no attribute 'split'
My goal is to download data for 170 projects and save the data to sqlite DB.
Please bear with me me as I am a novice in the field of programming and python. I would greatly appreciate any help to fixing the code below as well as any other sugestions and improvements to making the code more robust and pythonic.
Thanks in advance
from dotenv import dotenv_values
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from sqlalchemy import create_engine
# SQL Alchemy setup
engine = create_engine('sqlite:///rankingdata.sqlite', echo=False)
# Excerpt from the initial API Call
data = {'projects': [{'name': 'Client1',
'id': '168',
'frequency': 'daily',
'depth': '5',
'kwcount': '80',
'last_updated': '2019-10-01',
'keywordstamp': 1569941983},
{
"depth": "5",
"frequency": "ondemand",
"id": "194",
"kwcount": "10",
"last_updated": "2019-09-30",
"name": "Client2",
"timestamp": 1570610327
},
{
"depth": "5",
"frequency": "ondemand",
"id": "196",
"kwcount": "100",
"last_updated": "2019-09-30",
"name": "Client3",
"timestamp": 1570610331
}
]}
#setup
api_url = 'https://api.awrcloud.com/v2/get.php?action='
urls = [] # processed URLs
urlbacklog = [] # URLs that didn't return a downloadable File
# API Call to recieve URL containing downloadable zip and csv
def getRankingData(project):
action = 'get_dates'
response = requests.get(''.join([api_url, action]),
params=dict(token=dotenv_values()['AWR_API'],
project=project))
response = response.json()
action2 = 'topsites_export'
rankDateData = requests.get(''.join([api_url, action2]),
params=dict(token=dotenv_values()['AWR_API'],
project=project, startDate=response['details']['dates'][0]['date'], stopDate=response['details']['dates'][-1]['date'] ))
rankDateData = rankDateData.json()
print(rankDateData['details'])
urls.append(rankDateData['details'])
processRankingdata(rankDateData['details'])
# API Call to download and unzip csv data and process it in pandas
def processRankingdata(url):
content = requests.get(url)
# {"response_code":25,"message":"Export in progress. Please come back later"}
if "response_code" not in content:
f = ZipFile(BytesIO(content.content))
#print(f.namelist()) to get all filenames in Zip
with f.open(f.namelist()[0], 'r') as g: rankingdatadf = pd.read_csv(g)
rankingdatadf = rankingdatadf[rankingdatadf['Search Engine'].str.contains("Google")]
domain = []
for row in rankingdatadf['URL']:
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
rankingdatadf['Domain'] = domain
rankingdatadf['Domain'] = rankingdatadf['Domain'].str.replace('www.', '')
rankingdatadf = rankingdatadf.drop(columns=['Title', 'Meta description', 'Snippet', 'Page'])
print(rankingdatadf['Search Engine'][0])
writeData(rankingdatadf)
else:
urlbacklog.append(url)
pass
# Finally write the data to database
def writeData(rankingdatadf):
table_name_from_file = project['name']
check = engine.has_table(table_name_from_file)
print(check) # boolean
if check == False:
rankingdatadf.to_sql(table_name_from_file, con=engine)
print(project['name'] + ' ...Done')
else:
print(project['name'] + ' ... already in DB')
for project in data['projects']:
getRankingData(project['name'])
The problem seems to be the split call on a float and not necessarily the download. Try changing line 79
from
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
to
domain.append(str(str(str(row).split("//")[-1]).split("/")[0]).split('?')[0])
It looks like you're trying to parse the network location portion of the URL here, you can also use urllib.parse to make this easier instead of chaining all the splits:
from urllib.parse import urlparse
...
for row in rankingdatadf['URL']:
domain.append(urlparse(row).netloc)
I think a malformed URL is causing you issues, try (to diagnose issue):
try :
for row in rankingdatadf['URL']:
try:
domain.append(urlparse(row).netloc)
catch Exception:
exit(row)
Looks like you figured it out above, you have a database entry with a NULL value for the URL field. Not sure what your fidelity requirements for this data set are but might want to enforce database rules for URL field, or use pandas to drop rows where URL is NaN.
rankingdatadf = rankingdatadf.dropna(subset=['URL'])
I am new to JSON and Python,I am trying to achieve below
Need to parse below JSON
{
"id": "12345abc",
"codes": [
"BSVN1FKW3JKKNNMN",
"HJYYUKJJL999OJR",
"DFTTHJJJJ0099JUU",
"FGUUKHKJHJGJJYGJ"
],
"ctr": {
"source": "xyz",
"user_id": "1234"
}
}
Expected output:Normalized on "codes" value
ID~CODES~USER_ID
12345abc~BSVN1FKW3JKKNNMN~1234
12345abc~HJYYUKJJL999OJR~1234
12345abc~DFTTHJJJJ0099JUU~1234
12345abc~FGUUKHKJHJGJJYGJ~1234
Started with below ,but need help to get to my desired output.
The "codes" block can have n number of values separated by comma.
The below code is throwing an error "TypeError: string indices must be integers"
#!/usr/bin/python
import os
import json
import csv
f = open('rspns.csv','w')
writer = csv.writer(f,delimiter = '~')
headers = [‘ID’,’CODES’,’USER_ID’]
default = ''
writer.writerow(headers)
string = open('sample.json').read().decode('utf-8')
json_obj = json.loads(string)
#print json_obj['id']
#print json_obj['codes']
#print json_obj['codes'][0]
#print json_obj['codes'][1]
#print json_obj['codes’][2]
#print json_obj['codes’][3]
#print json_obj['ctr’][‘user_id']
for keyword in json_obj:
row = []
row.append(str(keyword['id']))
row.append(str(keyword['codes']))
row.append(str(keyword['ctr’][‘user_id']))
writer.writerow(row)
If your json_obj looks exactly like that , that is it is a dictionary, then the issue is that when you do -
for keyword in json_obj:
You are iterating over keys in json_obj, then if you try to access ['id'] for that key it should error out saying string indices must be integers .
You should first get the id and user_id before looping and then loop over json_obj['codes'] and then add the previously computed id and user_id along with the current value from codes list to the writer csv as a row.
Example -
import json
import csv
string = open('sample.json').read().decode('utf-8')
json_obj = json.loads(string)
with open('rspns.csv','w') as f:
writer = csv.writer(f,delimiter = '~')
headers = ['ID','CODES','USER_ID']
writer.writerow(headers)
id = json_obj['id']
user_id = json_obj['ctr']['user_id']
for code in json_obj['codes']:
writer.writerow([id,code,user_id])
You don't want to iterate through json_obj as that is a dictionary and iterating through will get the keys. The TypeError is caused by trying to index into the keys ('id', 'code', and 'ctr') -- which are strings -- as if they were a dictionary.
Instead, you want a separate row for each code in json_obj['codes'] and to use the json_obj dictionary for your lookups:
for code in json_obj['codes']:
row = []
row.append(json_obj['id'])
row.append(code)
row.append(json_obj['ctr’][‘user_id'])
writer.writerow(row)