I trying to loop over a list containing Twitter data in a json format. The list is made of several dictionaries each containing data on a politician. The code works if the input json_response only holds data on one politician. However, when json_response is list of dictionaries i get an error.
In short, I believe the issue can be isolated to three for-loops in the code for tweet in json_response['data']:, for dics in json_response['includes']['users']:, and for element in json_response['includes']['media']:.
# Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = search_query
start_time = "2016-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 3000
json_response = [] # empty list that will hold tweet objects
for i in keyword: # loop through list of politicians in keyword i.e. search query and extract tweets
url = create_url(i, start_time, end_time, max_results)
json_response.append(connect_to_endpoint(url[0], headers, url[1]))
pass
I have only pasted the json_response object for 2 out of 30 politicians due cap on characters. However, the structure is the same for the remaining 28 politicians.
print(json.dumps(json_response, indent=4, sort_keys=True)) # look at json_response object.
[
{
"data": [
{
"author_id": "2877379617",
"created_at": "2021-03-25T12:11:14.000Z",
"id": "1375057688355336195",
"text": "#prettynobodyco She blocked me in 2015 - for pointing out that Tim Kaine enables sexual assault in the military and the evidence was his killing of the MJIA and publicly stated that Military commanders should remain in charge of military rape cases. She's Tanden level awful. Congrats!"
},
{
"author_id": "1265018154444562440",
"created_at": "2021-03-22T19:48:59.000Z",
"id": "1374085719472361474",
"text": "#MehcatCat #AlasscanIsBack #PattyArquette #timkaine Funny, they blocked me. \ud83e\udd23\ud83e\udd23"
},
{
"author_id": "2378324935",
"created_at": "2021-03-07T21:32:13.000Z",
"id": "1368675879312887810",
"text": "#DrWinarick #KatieOGrady4 I apologize for any drama. Katie O Grady blocked me because we had a disagreement about Tim Kaine on one of your older posts. I guess I can't please everyone haha. :/"
},
{
"author_id": "821870502943817729",
"created_at": "2021-02-12T23:53:59.000Z",
"id": "1360376637385244673",
"text": "She blocked me a long ass time ago when I asked her why we shoulf care about Tim Kaine's personal view on abortion if it didn't impact legislation"
},
{
"attachments": {
"media_keys": [
"16_1341045032732770306"
]
},
"author_id": "17232340",
"created_at": "2020-12-21T15:37:07.000Z",
"id": "1341045038420275205",
"text": "#DSingh4Biden #moomintroll8 #timkaine #GovernorVA That's why I replied to you. She blocked me previously, for what silliness I can't remember. Tough being a troll AND a snowflake!"
}
],
"includes": {
"media": [
{
"media_key": "16_1341045032732770306",
"type": "animated_gif"
}
],
"users": [
{
"created_at": "2014-11-15T02:23:57.000Z",
"description": "",
"id": "2877379617",
"name": "Laura Saylor",
"username": "lauraleesaylor"
},
{
"created_at": "2020-05-25T20:33:36.000Z",
"description": "Weird Writer & Lunatic Linguist\nWicked Witch of the East\nshe/her",
"id": "1265018154444562440",
"name": "Zauberkind",
"username": "Zauberkind2"
},
{
"created_at": "2014-03-08T07:22:31.000Z",
"description": "#Resist, #BLM, #Vaxxed, liberal, autistic, kidney transplant survivor, political nerd, mental health advocate, fighter for equality, truth, justice, etc.",
"id": "2378324935",
"name": "Trevor \"Trev\" McKee Achilles",
"username": "MrTAchilles"
},
{
"created_at": "2017-01-19T00:02:52.000Z",
"description": "statist / Progressive Gun Nut/ Single and hating it\n\n / \n\nstraight????? /\n\npronouns / brain worm survivor\n\n \n",
"id": "821870502943817729",
"name": "Squirrel Dad",
"username": "nihilisticpillo"
},
{
"created_at": "2008-11-07T15:09:46.000Z",
"description": "Liberal-Veteran-Dog Lover | Taste for irony, but in moderation | Humor is reason gone mad. ~Groucho Marx | I follow & unfollow back #VeteransResist #Resist",
"id": "17232340",
"name": "anti-Fascist Jim",
"username": "JimnBL"
}
]
},
"meta": {
"newest_id": "1375057688355336195",
"next_token": "b26v89c19zqg8o3foseug43lzoqdft4ghg78o9sn9ds3h",
"oldest_id": "1341045038420275205",
"result_count": 5
}
},
{
"data": [
{
"author_id": "1248251899884814336",
"created_at": "2021-03-27T13:36:45.000Z",
"id": "1375803982409576450",
"text": "#gavinjeffries0 #steven86026859 #MSNBC #SenBooker Uh Oh our friend Steve blocked me, I guess not being able to answer your simple question and being asked to was too much for him."
},
{
"author_id": "293104735",
"created_at": "2021-02-07T21:45:47.000Z",
"id": "1358532435122683904",
"text": "#slwilliams1101 #annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC #SenBooker #AliVelshi I stopped watching #TiffanyDCross as well and only watch #CapehartJ now (even though he blocked me in 2016 because I had a \"strong\" response to something mean he said about Hillary Clinton)."
},
{
"author_id": "380970864",
"created_at": "2021-02-07T20:58:01.000Z",
"id": "1358520416273326081",
"text": "#annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC After I criticized #TiffanyDCross she blocked me. #JoyAnnReid called herself petty during and interview with #SenBooker. Why be petty? Be mature and thoughtful so people can learn. Hosts need to learn too. I only watch #AliVelshi #CapehartJ now."
},
{
"attachments": {
"media_keys": [
"3_1358448920632909825"
]
},
"author_id": "793175035322171397",
"created_at": "2021-02-07T16:17:44.000Z",
"id": "1358449876565164034",
"text": "#FinstaManhattan #SenSchumer #SenBooker #RonWyden Lmao he blocked me over that. His bio said he likes to 'debate & that sometimes he's wrong but he can admit that'.\n\nGuess not.\n\nI wasn't rude or mean at all. This is too funny \ud83e\udd23"
},
{
"author_id": "752266160352010241",
"created_at": "2021-02-06T20:34:06.000Z",
"id": "1358152008948195328",
"text": "#fattypinner #tkbone32221 #SenSchumer #SenBooker #RonWyden He blocked me \ud83e\udd23\ud83d\ude2d\ud83e\udd23\ud83e\udd23\ud83e\udd23\ud83d\ude2d"
}
],
"includes": {
"media": [
{
"media_key": "3_1358448920632909825",
"type": "photo",
"url": ""
}
],
"users": [
{
"created_at": "2020-04-09T14:11:04.000Z",
"description": "",
"id": "1248251899884814336",
"name": "Firstcomm",
"username": "Firstcomm1"
},
{
"created_at": "2011-05-04T19:26:22.000Z",
"description": "Cinephile, balletomane, book lover, tennis fan, K-Drama fanatic, Jang Na-ra fangirl, USC School of Cinematic Arts alumna, Hillary Clinton and Nancy Pelosi Dem.",
"id": "293104735",
"name": "Joyce Tyler",
"username": "joyce_tyler"
},
{
"created_at": "2011-09-27T14:50:37.000Z",
"description": "Spelman College, BA, George Washington University MA, University of South Florida Ph.D. in Political Science, proud Ted Kennedy, Obama, Biden/Harris Democrat!",
"id": "380970864",
"name": "Stephanie L. Williams, Ph.D.",
"username": "slwilliams1101"
},
{
"created_at": "2016-10-31T19:37:19.000Z",
"description": "Loves: life, fam, cats, cars, tattoos, reality TV; collector of t-shirts & Volkswagen\u2019s. Hates: Oxford commas. #CombatVet #Medic #BidenHarris2020 #Resist",
"id": "793175035322171397",
"name": "Que Sarah Sarah \ud83d\udda4",
"username": "sarahalli13"
},
{
"created_at": "2016-07-10T22:20:03.000Z",
"description": "3x Hollywood Video Street Fighter 2 Champion",
"id": "752266160352010241",
"name": "Sugarcoder",
"username": "TheSugarCoder"
}
]
},
"meta": {
"newest_id": "1375803982409576450",
"next_token": "b26v89c19zqg8o3fosktkdplqiw2q9kzx2ibm4r4y27wd",
"oldest_id": "1358152008948195328",
"result_count": 5
}
}
...28 other politicians
# Create file
csvFile = open("tweet_sample.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data I want to save. I only want to save these columns in my dataset
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
global created_at, tweet_id, bio, text, author_id
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response[0]['data']: # NOTE adding a 0 gives access to the data for the first politician while adding 1 gives access to data for the second politician and so on...
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
for dics in json_response[0]['includes']['users']: # NOTE 0 added
# 5. description. Contained in includes data object
if ('description' in dics):
bio = dics['description']
else:
bio = " "
for element in json_response[0]['includes']['media']: # NOTE 0 added
# 6. image url. Contained in includes data object
if ('url' in element):
image_url = element['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "tweet_sample.csv") # Save tweet data in a csv file
Error message:
TypeError: list indices must be integers or slices, not str
By adding the [0] in the loop I avoid the TypeError above. However the output from the function append_to_csv is not ideal as it only includes the last tweet for the first politician. I guess my loop overwrites data.
Desired output would be a data frame with columns author_id, created_at, id, tweet, bio, image_url. Not all users have a bio on their profile or an image_url in their tweet hence the if-else statement in the function above and the bio, no_bio and bio, image_url, no_image_url in the desired data frame.
pol_df = pd.read_csv("path_to_tweet_sample.csv" )
pol_df.head()
author_id created_at id tweet bio image_url
0 737885223858384896 2021-03-26T21:56:02.000Z 1375567243082338314 tweet_text no_bio no_image_url
1 847612931487416323 2021-03-26T21:55:24.000Z 1375567083791073283 tweet_text no_bio no_image_url
2 18634205 2021-03-08T12:29:00.000Z 1368901564363051010 tweet_text bio image_url
3 27327319 2021-03-02T11:53:16.000Z 1366718245521211393 tweet_text bio no_image_url
4 917634626247647232 2021-02-28T18:16:45.000Z 1366089974907432961 tweet_text bio image_url
I think you are confusing lists with dicts. When you try to access a list like a dict (e.g. data["author_id"]) the TypeError you're getting will be raised. You have to iterate over a list and then try to access each dict in that list like [x['author_id'] for x in data], for example. If you want to extract values from the dicts and write it to a csv file you might want to do something like this:
import pandas as pd
author_data = []
for data in resp:
for author in data['data']:
author_id = author['author_id']
created_at = author['created_at']
another_id = author['id']
tweet_text = author['text']
author_data.append([author_id, created_at, another_id, tweet_text])
author_df = pd.DataFrame(author_data, columns=['author_id', 'created_at', 'id', 'text'])
media_data = []
for data in resp:
for media in data['includes']['media']:
url = media.get('url', 'no_url')
media_data.append(media)
media_df = pd.DataFrame(media_data, columns=['url'])
bio_data = []
for data in resp:
for user in data['includes']['users']:
bio = user['description']
author_id = user['id']
bio_data.append([bio, author_id])
bio_df = pd.DataFrame(bio_data, columns=['bio', 'author_id'])
final_df = author_df.merge(bio_df, on="author_id")
print(final_df)
You have to save different parts of the data in different dataframes and then merge them. The thing is that media does not contain the author_id or another key that is shared between the ['includes']['media'] part and ['data'] part so you cannot merge that.
I have a json structured as this:
{
"data": [
{
"groups": {
"data": [
{
"group_name": "Wedding planning - places and bands (and others) to recommend!",
"date_joined": "2009-03-12 01:01:08.677427"
},
{
"group_name": "Harry Potter and the Deathly Hollows",
"date_joined": "2009-01-15 01:38:06.822220"
},
{
"group_name": "Xbox , Playstation, Wii - console fans",
"date_joined": "2010-04-02 04:02:58.078934"
}
]
},
"id": "0"
},
{
"groups": {
"data": [
{
"group_name": "Lost&Found (Strzegom)",
"date_joined": "2010-02-01 14:13:34.551920"
},
{
"group_name": "Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)",
"date_joined": "2008-09-24 17:29:43.356992"
}
]
},
"id": "1"
}
]
}
How does one parse jsons in this form? Should i try building a class resembling this format? My desired output is a csv where index is an "id" and in the first column I have the most recently taken group, in the second column the second most recently taken group and so on.
Meaning the result of this would be:
most recent second most recent
0 Xbox , Playstation, Wii - console fans Wedding planning - places and bands (and others) to recommend!
1 Lost&Found (Strzegom) Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)
solution could be like this:
data = json.load(f)
result = []
# it's max element in there for each id. Helping how many group_name here for this example [3,2]
max_element_group_name = [len(data['data'][i]['groups']['data']) for i in range(len(data['data']))]
max_element_group_name.sort()
for i in range(len(data['data'])):
# get id for each groups
id = data['data'][i]['id']
# sort data_joined in groups
sorted_groups_by_date = sorted(data['data'][i]['groups']['data'],key=lambda x : time.strptime(x['date_joined'],'%Y-%m-%d %H:%M:%S.%f'),reverse=True)
# get groups name using minumum value in max_element_group_name for this example [2]
group_names = [sorted_groups_by_date[j]['group_name'] for j in range(max_element_group_name[0])]
# add result list with id
result.append([id]+group_names)
# create df for list
df = pd.DataFrame(result, columns = ['id','most recent', 'second most recent'])
# it could be better.
From the question I asked here I took a JSON response looking similar to this:
(please note: id's in my sample data below are numeric strings but some are alphanumeric)
data=↓**
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
I loaded this using:
df = json_normalize([data['team'], data['opponent']],
'members',
['id', 'name'],
meta_prefix='team.',
errors='ignore')
print(df.iloc(1))
attacks [{'damage': 70, 'order': 4, 'defender_id': '9'...
id 4
level 9
name Fred
opponentAttacks 0
some_useless_data NaN
team.name Bad Guys
team.id 987654321
Name: 3, dtype: object
I have a 3 part question in essense.
How do I get a row like the one above using the member tag? I've tried:
member = df[df['id']=="1"].iloc[0]
#Now this works, but am I correctly doing this?
#It just feels weird is all.
How would I retrieve a member's defenses based only given that only attacks are recorded and not defenses (even though defender_id is given)? I have tried:
df.where(df['tag']==df['attacks'].str.get('defender_id'), df['attacks'], axis=0)
#This is totally not working.. Where am I going wrong?
Since I am retrieving new data from an API, I need to check vs the old data in my database to see if there are any new attacks. I can then loop through the new attacks where I then display to the user the attack info.
This I honestly cannot figure out, I've tried looking into this question and this one as well that I felt were anywhere close to what I needed and am still having trouble wrapping my brain around the concept. Essentially my logic is as follows:
def get_new_attacks(old_data, new_data)
'''params
old_data: Dataframe loaded from JSON in database
new_data: Dataframe loaded from JSON API response
hopefully having new attacks
returns:
iterator over the new attacks
'''
#calculate a dataframe with new attacks listed
return df.iterrows()
I know the function above shows little to no effort other than the docs I gave (basically to show my desired input/output) but trust me I've been wracking my brain over this part the most. I've been looking into merging all attacks then doing reset_index() and that just raises an error due to the attacks being a list. The map() function in the second question I linked above has me stumped.
Referring to your questions in order (code below):
I looks like id is a unique index of the data and so you can use df.set_index('id') which allows you to access data by player id via df.loc['1'] for example.
As far as I understand your data, all the dictionaries listed in each of the attacks are self-contained in a sense that the corresponding player id is not needed (as attacker_id or defender_id seems to be enough to identify the data). So instead of dealing with a rows that contains lists I recommend swapping that data out in its own data frame which makes it easily accessible.
Once you store attacks in its own data frame you can simply compare indices in order to filter out the old data.
Here's some example code to illustrate the various points:
# Question 1.
df.set_index('id', inplace=True)
print(df.loc['1']) # For example player id 1.
# Question 2 & 3.
attacks = pd.concat(map(
lambda x: pd.DataFrame.from_dict(x).set_index('order'), # Is 'order' the right index?
df['attacks'].dropna()
))
# Question 2.
print(attacks[attacks['defender_id'] == '1']) # For example defender_id 1.
# Question 3.
old_attacks = attacks.iloc[:2] # For example.
new_attacks = attacks[~attacks.index.isin(old_attacks.index)]
print(new_attacks)
i have json file that containd the metadata of 900 articles and i want to extract the Urls from it. my file start like this
[
{
"title": "The histologic phenotypes of …",
"authors": [
{
"name": "JE Armes"
},
],
"publisher": "Wiley Online Library",
"article_url": "https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0142(19981201)83:11%3C2335::AID-CNCR13%3E3.0.CO;2-N",
"cites": 261,
"use": true
},
{
"title": "Comparative epidemiology of pemphigus in ...",
"authors": [
{
"name": "S Bastuji-Garin"
},
{
"name": "R Souissi"
}
],
"year": 1995,
"publisher": "search.ebscohost.com",
"article_url": "http://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=0022202X&AN=12612836&h=B9CC58JNdE8SYy4M4RyVS%2FrPdlkoZF%2FM5hifWcv%2FwFvGxUCbEaBxwQghRKlK2vLtwY2WrNNl%2B3z%2BiQawA%2BocoA%3D%3D&crl=c",
"use": true
},
.........
I want to inspect the file with objectpath to create json.tree for the extraxtion of the url. this is the code i want to execute
1. import json
2. import objectpath
3. with open("Data_sample.json") as datafile: data = json.load(datafile)
4. jsonnn_tree = objectpath.Tree(data['name of data'])
5. result_tuple = tuple(jsonnn_tree.execute('$..article_url'))
But in the step 4 for the creation of the tree, I have to insert the name of the data whitch i think that i haven't in my file. How can i replace this line?
You can get all the article urls using a list comprehension.
import json
with open("Data_sample.json") as fh:
articles = json.load(fh)
article_urls = [article['article_url'] for article in articles]
You can instantiate the tree like this:
tobj = op.Tree(your_data)
results = tobj.execute("$.article_url")
And in the end:
results = [x for x in results]
will yield:
["url1", "url2", ...]
Did you try removing the reference and just using:
jsonnn_tree = objectpath.Tree(data)
{
"teachers" : [
{"name": "Lucy", "id": 3, course: "history"},
{"name": "Mark", "id": 6, "course": "maths"},
{"name": "Joan", "id": 20, course: "French"}
]
}
This document is in the "school" collection. I have been trying to access these imbedded documents using
db.school.find({teachers:{id:3}})
I also tried
db.school.find({teacher.id:3})
but I understand it isn't working since mongo can't look inside an embedded array.
Therefore I would like to turn these imbedded documents into individual documents. That is, remove the embedding and the "teachers" key, creating an individual document for each teacher.
The final "school" collection would be
{"name": "Lucy", "id": 3, "course": "history"},
{"name": "Mark", "id": 6, "course": "maths"},
{"name": "Joan", "id": 20, "course": "French"}
i would like to do it with python and save the new documents into a collection.
EDIT
this is what i have come up with for now:
import pymongo
import sys
connection = pymongo.Connection("mongodb://localhost", safe=True)
db = connection.hello
shows = db.school
for doc in db.school:
for indiv in "teachers":
try:
db.individual.insert(indiv)
except:
print "Unexpected error", sys.exc_info()[0]
By the way, Mongodb can find embedded documents that are in arrays:
db.school.find({ 'teachers.id' : 3 });
You can learn more about the dot notation at mongodb documentation.
In case the goal is to return only the embedded document you can use an aggregate request:
db.school.aggregate(
{$match: { 'teachers.id' : 3 }},
{$unwind : '$teachers'},
{$project: {
_id: 0,
name: '$teachers.name',
id: '$teachers.id',
course: '$teachers.course'
}},
{$match: {id:3}}
);
school_records = db.school.find()
for i in school_records:
for teacher in i['teachers']:
db.individual.insert(teacher)
What about this?
You can use an aggregate command (it is in pymongo) from mongo v2.2+:
fagg=db.school.aggregate([{$unwind: "$teachers"},
{$project: {name: "$teachers.name",
id: "$teachers.id", course: "$teachers.course"}}])
fagg.result.forEach(function(o){
db.teachers.insert({_id: o.id, name: o.name, course: o.course})})