I have extracted json objects from an api library and wrote them into a text file. I am now stuck on how to take the json structure saved in the .txt file and read that back into python pandas library.
There are many resources that walk through how to import a json file into pandas but since this is a text file and I'm new to programming and working with json structure I'm not sure how to efficiently perform this task.
There are numerous json objects in the text file and I would share an example but it has a bunch of url shorteners which is preventing me from being able to post this question so unless someone really needs to see the structure Ill hold off. I already tried pd.read_csv() and pd.read_json() but since this is a json structure in a .txt file its not working properly for either so far.
Here has been my best guess so far to get the data back into python:
data = []
with open('tweet_json.txt') as f:
for line in f:
data.append(json.loads(line))
But I got the following error message when I tried that: JSONDecodeError: Extra data: line 1 column 4626 (char 4625)
Here are two tweets that you can copy and save to a .txt file to replicate:
{'contributors': None,
'coordinates': None,
'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
'display_text_range': [0, 85],
'entities': {'hashtags': [],
'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
'id': 892420639486877696,
'id_str': '892420639486877696',
'indices': [86, 109],
'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
'sizes': {'large': {'h': 528, 'resize': 'fit', 'w': 540},
'medium': {'h': 528, 'resize': 'fit', 'w': 540},
'small': {'h': 528, 'resize': 'fit', 'w': 540},
'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
'type': 'photo',
'url': na}],
'symbols': [],
'urls': [],
'user_mentions': []},
'extended_entities': {'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
'id': 892420639486877696,
'id_str': '892420639486877696',
'indices': [86, 109],
'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
'sizes': {'large': {'h': 528, 'resize': 'fit', 'w': 540},
'medium': {'h': 528, 'resize': 'fit', 'w': 540},
'small': {'h': 528, 'resize': 'fit', 'w': 540},
'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
'type': 'photo',
'url': na}]},
'favorite_count': 39311,
'favorited': False,
'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 na ",
'geo': None,
'id': 892420643555336193,
'id_str': '892420643555336193',
'in_reply_to_screen_name': None,
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'is_quote_status': False,
'lang': 'en',
'place': None,
'possibly_sensitive': False,
'possibly_sensitive_appealable': False,
'retweet_count': 8778,
'retweeted': False,
'source': 'Twitter for iPhone',
'truncated': False,
'user': {'contributors_enabled': False,
'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
'default_profile': False,
'default_profile_image': False,
'description': 'Only Legit Source for Professional Dog Ratings STORE: #ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE APP: #GoodDogsGame Business: dogratingtwitter#gmail.com',
'entities': {'description': {'urls': []},
'url': {'urls': [{'display_url': 'weratedogs.com',
'expanded_url': 'http://weratedogs.com',
'indices': [0, 23],
'url': na }]}},
'favourites_count': 126135,
'follow_request_sent': False,
'followers_count': 4730764,
'following': False,
'friends_count': 109,
'geo_enabled': True,
'has_extended_profile': True,
'id': 4196983835,
'id_str': '4196983835',
'is_translation_enabled': False,
'is_translator': False,
'lang': 'en',
'listed_count': 3700,
'location': 'DM YOUR DOGS. WE WILL RATE',
'name': 'WeRateDogs™',
'notifications': False,
'profile_background_color': '000000',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1510812288',
'profile_image_url': 'http://pbs.twimg.com/profile_images/936608706107772929/GwbLQRxf_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/936608706107772929/GwbLQRxf_normal.jpg',
'profile_link_color': 'F5ABB5',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '000000',
'profile_text_color': '000000',
'profile_use_background_image': False,
'protected': False,
'screen_name': 'dog_rates',
'statuses_count': 6301,
'time_zone': None,
'translator_type': 'none',
'url': n/a,
'utc_offset': None,
'verified': True}}
{'contributors': None,
'coordinates': None,
'created_at': 'Tue Aug 01 00:17:27 +0000 2017',
'display_text_range': [0, 138],
'entities': {'hashtags': [],
'media': [{'display_url': 'pic.twitter.com/0Xxu71qeIV',
'expanded_url': 'https://twitter.com/dog_rates/status/892177421306343426/photo/1',
'id': 892177413194625024,
'id_str': '892177413194625024',
'indices': [139, 162],
'media_url': 'http://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg',
'media_url_https': 'https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg',
'sizes': {'large': {'h': 1600, 'resize': 'fit', 'w': 1407},
'medium': {'h': 1200, 'resize': 'fit', 'w': 1055},
'small': {'h': 680, 'resize': 'fit', 'w': 598},
'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
'type': 'photo',
'url': na}],
'symbols': [],
'urls': [],
'user_mentions': []},
'extended_entities': {'media': [{'display_url': 'pic.twitter.com/0Xxu71qeIV',
'expanded_url': 'https://twitter.com/dog_rates/status/892177421306343426/photo/1',
'id': 892177413194625024,
'id_str': '892177413194625024',
'indices': [139, 162],
'media_url': 'http://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg',
'media_url_https': 'https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg',
'sizes': {'large': {'h': 1600, 'resize': 'fit', 'w': 1407},
'medium': {'h': 1200, 'resize': 'fit', 'w': 1055},
'small': {'h': 680, 'resize': 'fit', 'w': 598},
'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
'type': 'photo',
'url': na}]},
'favorite_count': 33662,
'favorited': False,
'full_text': "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 na,
'geo': None,
'id': 892177421306343426,
'id_str': '892177421306343426',
'in_reply_to_screen_name': None,
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'is_quote_status': False,
'lang': 'en',
'place': None,
'possibly_sensitive': False,
'possibly_sensitive_appealable': False,
'retweet_count': 6431,
'retweeted': False,
'source': 'Twitter for iPhone',
'truncated': False,
'user': {'contributors_enabled': False,
'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
'default_profile': False,
'default_profile_image': False,
'description': 'Only Legit Source for Professional Dog Ratings STORE: #ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE APP: #GoodDogsGame Business: dogratingtwitter#gmail.com',
'entities': {'description': {'urls': []},
'url': {'urls': [{'display_url': 'weratedogs.com',
'expanded_url': 'http://weratedogs.com',
'indices': [0, 23],
'url': na}]}},
'favourites_count': 126135,
'follow_request_sent': False,
'followers_count': 4730865,
'following': False,
'friends_count': 109,
'geo_enabled': True,
'has_extended_profile': True,
'id': 4196983835,
'id_str': '4196983835',
'is_translation_enabled': False,
'is_translator': False,
'lang': 'en',
'listed_count': 3728,
'location': 'DM YOUR DOGS. WE WILL RATE',
'name': 'WeRateDogs™',
'notifications': False,
'profile_background_color': '000000',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1510812288',
'profile_image_url': 'http://pbs.twimg.com/profile_images/936608706107772929/GwbLQRxf_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/936608706107772929/GwbLQRxf_normal.jpg',
'profile_link_color': 'F5ABB5',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '000000',
'profile_text_color': '000000',
'profile_use_background_image': False,
'protected': False,
'screen_name': 'dog_rates',
'statuses_count': 6301,
'time_zone': None,
'translator_type': 'none',
'url': na,
'utc_offset': None,
'verified': True}}
Update
The following code produces this error: JSONDecodeError: Expecting ',' delimiter: line 1 column 4627 (char 4626)
with open('tweet_json.txt', 'r') as f:
datastore = json.load(f)
This post is the closest I've found so far to help me solve my issue:
Python json.loads shows ValueError: Expecting , delimiter: line 1
Thanks everyone for the feedback. I had to adjust the code regarding how I was extracting the data from the API and then it was pretty straight-forward to get the data into a list of dictionaries after that.
with open('tweet_json.txt', 'a+', encoding='utf-8') as file:
for tweet_id in twitter_archive_df['tweet_id']:
try:
tweet = api.get_status(id = tweet_id, tweet_mode='extended')
file.write(json.dumps(tweet))
file.write('\n')
except:
pass
file.close()
then I ran the following code to import the json objects from the .txt file into a list of dictionaries:
with open('tweet_json.txt') as file:
status = []
for line in file:
status.append(json.loads(line))
Related
Good day every one,
I'm trying to parse telegram poll data, I have the following:
{'_': 'MessageMediaPoll', 'poll': {'_': 'Poll', 'id': 578954245254551900254, 'question': 'Have you seen it ?! 👀', 'answers': [{'_': 'PollAnswer', 'text': 'Lost', 'option': [48]}, {'_': 'PollAnswer', 'text': 'Am lose', 'option': [49]}, {'_': 'PollAnswer', 'text': 'Have lost', 'option': [50]}, {'_': 'PollAnswer', 'text': 'Am losing', 'option': [51]}], 'closed': False, 'public_voters': False, 'multiple_choice': False, 'quiz': True, 'close_period': None, 'close_date': None}, 'results': {'_': 'PollResults', 'min': False, 'results': [{'_': 'PollAnswerVoters', 'option': [48], 'voters': 2066, 'chosen': False, 'correct': True}, {'_': 'PollAnswerVoters', 'option': [49], 'voters': 471, 'chosen': False, 'correct': False}, {'_': 'PollAnswerVoters', 'option': [50], 'voters': 704, 'chosen': False, 'correct': False}, {'_': 'PollAnswerVoters', 'option': [51], 'voters': 279, 'chosen': True, 'correct': False}], 'total_voters': 3520, 'recent_voters': [], 'solution': None, 'solution_entities': []}}
and I want to print it like this:
Q: Have you seen it ?! 👀
A: Lost|Correct
A: Am lose|Incorrect
A: Have lost|Incorrect
A: Am losing|Incorrect
How I can achieve that in Python? and what is the type of the data? json?
data = {'_': 'MessageMediaPoll', 'poll': {'_': 'Poll', 'id': 57894245245450254, 'question': 'Have you seen it ?! 👀', 'answers': [{'_': 'PollAnswer', 'text': 'Lost', 'option': [48]}, {'_': 'PollAnswer', 'text': 'Am lose', 'option': [49]}, {'_': 'PollAnswer', 'text': 'Have lost', 'option': [50]}, {'_': 'PollAnswer', 'text': 'Am losing', 'option': [51]}], 'closed': False, 'public_voters': False, 'multiple_choice': False, 'quiz': True, 'close_period': None, 'close_date': None}, 'results': {'_': 'PollResults', 'min': False, 'results': [{'_': 'PollAnswerVoters', 'option': [48], 'voters': 2066, 'chosen': False, 'correct': True}, {'_': 'PollAnswerVoters', 'option': [49], 'voters': 471, 'chosen': False, 'correct': False}, {'_': 'PollAnswerVoters', 'option': [50], 'voters': 704, 'chosen': False, 'correct': False}, {'_': 'PollAnswerVoters', 'option': [51], 'voters': 279, 'chosen': True, 'correct': False}], 'total_voters': 3520, 'recent_voters': [], 'solution': None, 'solution_entities': []}}
Questionz = data['poll']['question']
Answerz = ""
Truez = ""
fullData = ""
#print(data['poll']['question'])
for answer in data['poll']['answers']:
#print(answer['text'], answer['option'])
Answerz = Answerz + answer['text'] + "\n"
#print(answer['text'])
for results in data['results']['results']:
#print(results['correct'])
Truez = Truez + str(results['correct']) + "\n"
TruezLines = Truez.split("\n")
n = 0
for line in Answerz.splitlines():
print(line, "|||||||||" ,TruezLines[n])
fullData = line, "|||||||||" ,TruezLines[n]
n += 1
I am trying to import a deeply nested JSON into pandas dataframe. Here is the structure of the JSON file (this is only the first record (retweets[:1]):
[{'lang': 'en',
'author_id': '1076979440372965377',
'reply_settings': 'everyone',
'entities': {'mentions': [{'start': 3,
'end': 17,
'username': 'Terry81987010',
'url': '',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
{'start': 19,
'end': 32,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
'conversation_id': '1253462541881106433',
'created_at': '2020-04-23T23:15:32.000Z',
'id': '1253462541881106433',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'retweeted',
'id': '1253052684489437184',
'in_reply_to_user_id': '91882544',
'attachments': {'media_keys': ['3_1253052312144293888',
'3_1253052620937277442'],
'media': [{}, {}]},
'entities': {'annotations': [{'start': 126,
'end': 128,
'probability': 0.514,
'type': 'Organization',
'normalized_text': 'CDC'},
{'start': 145,
'end': 146,
'probability': 0.5139,
'type': 'Place',
'normalized_text': 'NY'}],
'mentions': [{'start': 0,
'end': 13,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
'urls': [{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
'lang': 'en',
'author_id': '1091471553437593605',
'reply_settings': 'everyone',
'conversation_id': '1253050942716551168',
'created_at': '2020-04-22T20:06:55.000Z',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
'public_metrics': {'retweet_count': 208,
'reply_count': 57,
'like_count': 402,
'quote_count': 38},
'source': 'Twitter Web App',
'text': "#DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death $60,000",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'Terry81987010',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
'in_reply_to_user': {'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
'public_metrics': {'retweet_count': 208,
'reply_count': 0,
'like_count': 0,
'quote_count': 0},
'source': 'Twitter for iPhone',
'text': "RT #Terry81987010: #DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'set1952',
'location': 'Etats-Unis',
'created_at': '2018-12-23T23:14:42.000Z',
'protected': False,
'public_metrics': {'followers_count': 103,
'following_count': 44,
'tweet_count': 44803,
'listed_count': 0},
'name': 'SunSet1952',
'verified': False,
'id': '1076979440372965377',
'description': '',
'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
'__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
'version': '2.0.8',
'retrieved_at': '2021-05-17T17:13:17+00:00'}},
Here is my code:
retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
retweets.append(json.loads(line))
df = json_normalize(
retweets, 'referenced_tweets', ['referenced_tweets', 'type'],
meta_prefix= ".",
errors='ignore'
)
df[['author_id', 'type', '.type', 'id', 'in_reply_to_user_id', 'referenced_tweets']].head()
Here is the resulting dataframe:
As you can see, the column referenced_tweets is not flattened yet (please note that there are two different referenced_tweets arrays in my JSON file: one is in a deeper level insdide the other "referenced_tweets"). For example, the one at the higher level return this:
>>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"
and the one in the deeper level return this:
>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'
QUESTION: I was wondering how I can flatten the deeper referenced_tweets. I want to have two separate columns as referenced_tweets.type and referenced_tweets.id, where the value of the column referenced_tweets.type in the above example should be replied_to.
I think the issue here is that your data is double nested... there is a key referenced_tweets within referenced_tweets.
import json
from pandas import json_normalize
with open("flatten.json", "r") as file:
data = json.load(file)
df = json_normalize(
data,
record_path=["referenced_tweets", "referenced_tweets"],
meta=[
"author_id",
# ["author", "username"], # not possible
# "author", # possible but not useful
["referenced_tweets", "id"],
["referenced_tweets", "type"],
["referenced_tweets", "in_reply_to_user_id"],
["referenced_tweets", "in_reply_to_user", "username"],
]
)
print(df)
See also: https://stackoverflow.com/a/37668569/42659
Note: Above code will fail if second nested referenced_tweet is missing.
Edit: Alternatively you could further normalize your data (which you already partly normalized with your code) in your question with an additional manual iteration. See example below. Note: Code is not optimized and may be slow depending on the amount of data.
# load your `data` with `json.load()` or `json.loads()`
df = json_normalize(
data,
record_path="referenced_tweets",
meta=["referenced_tweets", "type"],
meta_prefix= ".",
errors="ignore",
)
columns = [*df.columns, "_type", "_id"]
normalized_data = []
def append(row, type, id):
normalized_row = [*row.to_list(), type, id]
normalized_data.append(normalized_row)
for _, row in df.iterrows():
# a list/array is expected
if type(row["referenced_tweets"]) is list:
for tweet in row["referenced_tweets"]:
append(row, tweet["type"], tweet["id"])
# if empty list
else:
append(row, None, None)
else:
append(row, None, None)
enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)
print(enhanced_df)
Edit 2: referenced_tweets should be an array. However, if there is no referenced tweet, the Twitter API seems to omit referenced_tweets completely. In that case, the cell value is NaN (float) instead of an empty list. I updated the code above to take that into account.
I'm using tweepy for streaming and json.loads to get the data. I saved it as txt files.
def on_data(self, data):
all_data = json.loads(data)
save_file.write(str(all_data)+"\n")
Now I want to extract several property from the data, but the problem is when I'm using ast.literal_eval() for solving the quotes and comma error, I'm getting another error.
Traceback (most recent call last):
File "C:\Users\RandomScientist\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-ffea75fc7446>", line 3, in <module>
data = ast.literal_eval(data)
File "C:\Users\RandomScientist\Anaconda3\lib\ast.py", line 48, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "C:\Users\RandomScientist\Anaconda3\lib\ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 2
{'created_at': 'Thu Apr 04 07:00:10 +0000 2019', 'id': 1113697753530392577, 'id_str': '1113697753530392577', 'text': 'Karena kita adalah suratan terbuka kasih-Nya untuk dunia \n#iamthemessenjah (link)', 'source': 'Facebook', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 234355404, 'id_str': '234355404', 'name': 'Messenjah Clothing', 'screen_name': 'MessenjahCloth', 'location': 'YOGYAKARTA', 'url': 'http://www.messenjahclothing.com', 'description': 'THE WORLD CHANGER pages : http://www.facebook.com/messenjahclothingdotcom pin: 578CD443 WA: +6285 727 386 267 IG: #the_messenjah IG product #messenjahstore', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 3405, 'friends_count': 190, 'listed_count': 7, 'favourites_count': 204, 'statuses_count': 58765, 'created_at': 'Wed Jan 05 13:13:10 +0000 2011', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '7E808A', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme3/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme3/bg.gif', 'profile_background_tile': False, 'profile_link_color': '0400DB', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '252429', 'profile_text_color': '666666', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/882803510932156417/KenYVq-i_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/882803510932156417/KenYVq-i_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/234355404/1523949182', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'iamthemessenjah', 'indices': [60, 76]}], 'urls': [{'url': 'https: (link)', 'expanded_url': 'https://www.facebook.com/messenjahclothingdotcom/posts/2575823355778752', 'display_url': 'facebook.com/messenjahcloth…', 'indices': [77, 100]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'in', 'timestamp_ms': '1554361210602'}
^
SyntaxError: invalid syntax
Here's my code
with open('pre-process.txt','r') as file:
data = file.read()
data = ast.literal_eval(data)
print(data)
And I've been reading several answer like Python request using ast.literal_eval error Invalid syntax? and Python ast.literal_eval on dictionary string not working (SyntaxError: invalid syntax) but didn't get the suitable solution. Any idea? Thanks in advance.
It appears that you have one value per line in the file, so you need to read it a line at a time and call ast.literal_eval() on the line, not try to evaluate the entire file at once.
with open('pre-process.txt','r') as file:
for line in file:
data = ast.literal_eval(line)
print(data)
{'contributors': None,
'coordinates': None,
'created_at': 'Tue Aug 02 19:51:58 +0000 2016',
'entities': {'hashtags': [],
'symbols': [],
'urls': [],
'user_mentions': [{'id': 873491544,
'id_str': '873491544',
'indices': [0, 13],
'name': 'Kenel M',
'screen_name': 'KxSweaters13'}]},
'favorite_count': 1,
'favorited': False,
'geo': None,
'id': 760563814450491392,
'id_str': '760563814450491392',
'in_reply_to_screen_name': 'KxSweaters13',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': 873491544,
'in_reply_to_user_id_str': '873491544',
'is_quote_status': False,
'lang': 'en',
'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
'place': {'attributes': {},
'bounding_box': {'coordinates': [[[-71.813501, 42.4762],
[-71.702186, 42.4762],
[-71.702186, 42.573956],
[-71.813501, 42.573956]]],
'type': 'Polygon'},
'contained_within': [],
'country': 'Australia',
'country_code': 'AUS',
'full_name': 'Melbourne, V',
'id': 'c4f1830ea4b8caaf',
'name': 'Melbourne',
'place_type': 'city',
'url': 'https://api.twitter.com/1.1/geo/id/c4f1830ea4b8caaf.json'},
'retweet_count': 0,
'retweeted': False,
'source': 'Twitter for Android',
'text': '#KxSweaters13 are you the kenelx13 I see owning leominster for team valor?',
'truncated': False,
'user': {'contributors_enabled': False,
'created_at': 'Thu Apr 21 17:09:52 +0000 2011',
'default_profile': False,
'default_profile_image': False,
'description': "Arbys when it's cold. Kimballs when it's warm. #Ally__09 all year. Comp sci classes sometimes.",
'entities': {'description': {'urls': []}},
'favourites_count': 1106,
'follow_request_sent': None,
'followers_count': 167,
'following': None,
'friends_count': 171,
'geo_enabled': True,
'has_extended_profile': False,
'id': 285715182,
'id_str': '285715182',
'is_translation_enabled': False,
'is_translator': False,
'lang': 'en',
'listed_count': 2,
'location': 'MA',
'name': 'Steve',
'notifications': None,
'profile_background_color': '131516',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme14/bg.gif',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme14/bg.gif',
'profile_background_tile': True,
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/285715182/1462218226',
'profile_image_url': 'http://pbs.twimg.com/profile_images/727223698332200961/bGPjGjHK_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/727223698332200961/bGPjGjHK_normal.jpg',
'profile_link_color': '4A913C',
'profile_sidebar_border_color': 'FFFFFF',
'profile_sidebar_fill_color': 'EFEFEF',
'profile_text_color': '333333',
'profile_use_background_image': True,
'protected': False,
'screen_name': 'StephenBurke_',
'statuses_count': 5913,
'time_zone': 'Eastern Time (US & Canada)',
'url': None,
'utc_offset': -14400,
'verified': False}}
I have a json file which contains a list of json objects (each has the structure like above)
So I read it into a dataframe:
df = pd.read_json('data.json')
and then I try to get all the rows which are the 'city' type by:
df = df[df['place']['place_type'] == 'city']
but then I got the 'TypeError: an integer is required' During handling of the above exception, another exception occurred: KeyError: 'place_type'
Then I tried:
df['place'].head(3)
=>
0 {'id': '01864a8a64df9dc4', 'url': 'https://api...
1 {'id': '01864a8a64df9dc4', 'url': 'https://api...
2 {'id': '0118c71c0ed41109', 'url': 'https://api...
Name: place, dtype: object
So df['place'] return a series where keys are the indexes and that's why I got the TypeError
I've also tried to select the place_type of the first row and it works just fine:
df.iloc[0]['place']['place_type']
=>
city
The question is how can I filter out the rows in this case?
Solution:
Okay, so the problem lies in the fact that the pd.read_json cannot deal with nested JSON structure, so what I have done is to normalize the json object:
with open('data.json') as jsonfile:
data = json.load(jsonfile)
df = pd.io.json.json_normalize(data)
df = df[df['place.place_type'] == 'city']
You can use the a list comprehension to do the filtering you need.
df = [loc for loc in df if d['place']['place_type'] == 'city']
This will give you an array where the elements place_type is 'city'.
I don't know if you have to use the place_type that is the index, to show all the rows that contains city.
"and then I try to get all the rows which are the city type by:"
This way you can get all the rows that contains city in the column place:
df = df[(df['place'] == 'city')]
I'm trying to do some simple JSON parsing using Python 3's built in JSON module, and from reading a bunch of other questions on SO and googling, it seems this is supposed to be pretty straightforward. However, I think I'm getting a string returned instead of the expected dictionary.
Firstly, here is the JSON I am trying to get values from. It's just some output from Twitter's API
[{'in_reply_to_status_id_str': None, 'in_reply_to_screen_name': None, 'retweeted': False, 'in_reply_to_status_id': None, 'contributors': None, 'favorite_count': 0, 'in_reply_to_user_id': None, 'coordinates': None, 'source': 'Twitter Web Client', 'geo': None, 'retweet_count': 0, 'text': 'Tweeting a url \nhttp://t.co/QDVYv6bV90', 'created_at': 'Mon Sep 01 19:36:25 +0000 2014', 'entities': {'symbols': [], 'user_mentions': [], 'urls': [{'expanded_url': 'http://www.isthereanappthat.com', 'display_url': 'isthereanappthat.com', 'url': 'http://t.co/QDVYv6bV90', 'indices': [16, 38]}], 'hashtags': []}, 'id_str': '506526005943865344', 'in_reply_to_user_id_str': None, 'truncated': False, 'favorited': False, 'lang': 'en', 'possibly_sensitive': False, 'id': 506526005943865344, 'user': {'profile_text_color': '333333', 'time_zone': None, 'entities': {'description': {'urls': []}}, 'url': None, 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'protected': False, 'default_profile_image': True, 'utc_offset': None, 'default_profile': True, 'screen_name': 'KickzWatch', 'follow_request_sent': False, 'following': False, 'profile_background_color': 'C0DEED', 'notifications': False, 'description': '', 'profile_sidebar_border_color': 'C0DEED', 'geo_enabled': False, 'verified': False, 'friends_count': 40, 'created_at': 'Mon Sep 01 16:29:18 +0000 2014', 'is_translator': False, 'profile_sidebar_fill_color': 'DDEEF6', 'statuses_count': 4, 'location': '', 'id_str': '2784389341', 'followers_count': 4, 'favourites_count': 0, 'contributors_enabled': False, 'is_translation_enabled': False, 'lang': 'en', 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png', 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png', 'id': 2784389341, 'profile_use_background_image': True, 'listed_count': 0, 'profile_background_tile': False, 'name': 'Maktub Destiny', 'profile_link_color': '0084B4'}, 'place': None}]
I assigned this String to a variable named json_string like so:
json_string = json.dumps(output)
jason = json.loads(json_string)
Then, when I try to get a specific key from the "jason" dictionary:
print(jason['hashtags'])
I'm getting an error:
TypeError: string indices must be integers
I want to be able to convert the json output to a dictionary, then use jason[key_name] call to get values using specified keys. Is there something obvious that I'm missing here?
This is my fist time working with Python, after coming from Java. I absolutely love the language and think it's very powerful. So, any help on this would be greatly appreciated!
Ok first you should print your object so that you can read it:
>>> from pprint import pprint
>>> output = [{'in_reply_to_status_id_str': None, 'in_reply_to_screen_name': None, 'retweeted': False, 'in_reply_to_status_id': None, 'contributors': None, 'favorite_count': 0, 'in_reply_to_user_id': None, 'coordinates': None, 'source': 'Twitter Web Client', 'geo': None, 'retweet_count': 0, 'text': 'Tweeting a url \nhttp://t.co/QDVYv6bV90', 'created_at': 'Mon Sep 01 19:36:25 +0000 2014', 'entities': {'symbols': [], 'user_mentions': [], 'urls': [{'expanded_url': 'http://www.isthereanappthat.com', 'display_url': 'isthereanappthat.com', 'url': 'http://t.co/QDVYv6bV90', 'indices': [16, 38]}], 'hashtags': []}, 'id_str': '506526005943865344', 'in_reply_to_user_id_str': None, 'truncated': False, 'favorited': False, 'lang': 'en', 'possibly_sensitive': False, 'id': 506526005943865344, 'user': {'profile_text_color': '333333', 'time_zone': None, 'entities': {'description': {'urls': []}}, 'url': None, 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'protected': False, 'default_profile_image': True, 'utc_offset': None, 'default_profile': True, 'screen_name': 'KickzWatch', 'follow_request_sent': False, 'following': False, 'profile_background_color': 'C0DEED', 'notifications': False, 'description': '', 'profile_sidebar_border_color': 'C0DEED', 'geo_enabled': False, 'verified': False, 'friends_count': 40, 'created_at': 'Mon Sep 01 16:29:18 +0000 2014', 'is_translator': False, 'profile_sidebar_fill_color': 'DDEEF6', 'statuses_count': 4, 'location': '', 'id_str': '2784389341', 'followers_count': 4, 'favourites_count': 0, 'contributors_enabled': False, 'is_translation_enabled': False, 'lang': 'en', 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png', 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png', 'id': 2784389341, 'profile_use_background_image': True, 'listed_count': 0, 'profile_background_tile': False, 'name': 'Maktub Destiny', 'profile_link_color': '0084B4'}, 'place': None}]
>>> pprint(output)
[{'contributors': None,
'coordinates': None,
'created_at': 'Mon Sep 01 19:36:25 +0000 2014',
'entities': {'hashtags': [],
'symbols': [],
'urls': [{'display_url': 'isthereanappthat.com',
'expanded_url': 'http://www.isthereanappthat.com',
'indices': [16, 38],
'url': 'http://t.co/QDVYv6bV90'}],
'user_mentions': []},
'favorite_count': 0,
'favorited': False,
'geo': None,
'id': 506526005943865344,
'id_str': '506526005943865344',
'in_reply_to_screen_name': None,
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'lang': 'en',
'place': None,
'possibly_sensitive': False,
'retweet_count': 0,
'retweeted': False,
'source': 'Twitter Web Client',
'text': 'Tweeting a url \nhttp://t.co/QDVYv6bV90',
'truncated': False,
'user': {'contributors_enabled': False,
'created_at': 'Mon Sep 01 16:29:18 +0000 2014',
'default_profile': True,
'default_profile_image': True,
'description': '',
'entities': {'description': {'urls': []}},
'favourites_count': 0,
'follow_request_sent': False,
'followers_count': 4,
'following': False,
'friends_count': 40,
'geo_enabled': False,
'id': 2784389341,
'id_str': '2784389341',
'is_translation_enabled': False,
'is_translator': False,
'lang': 'en',
'listed_count': 0,
'location': '',
'name': 'Maktub Destiny',
'notifications': False,
'profile_background_color': 'C0DEED',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png',
'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_6_normal.png',
'profile_link_color': '0084B4',
'profile_sidebar_border_color': 'C0DEED',
'profile_sidebar_fill_color': 'DDEEF6',
'profile_text_color': '333333',
'profile_use_background_image': True,
'protected': False,
'screen_name': 'KickzWatch',
'statuses_count': 4,
'time_zone': None,
'url': None,
'utc_offset': None,
'verified': False}}]
From looking at this you can see that output is a list which contains a single dict. To access this you need:
>>> first_elem = output[0]
You will also see that the hashtags key in the first_elem is contained in a second level dict under the key entities:
>>> entities = first_elem['entities']
>>> pprint(entities)
{'hashtags': [],
'symbols': [],
'urls': [{'display_url': 'isthereanappthat.com',
'expanded_url': 'http://www.isthereanappthat.com',
'indices': [16, 38],
'url': 'http://t.co/QDVYv6bV90'}],
'user_mentions': []}
Now you are able to access hashtags:
>>> entities['hashtags']
[]
Which just happens to be the empty list.
To convert to JSON, note the comment:
>>> import json
>>> # Make sure output is the list object not a string representing the object
>>> json_string = json.dumps(output)
>>> jason = json.loads(output)
>>> jason[0]['entities']['hashtags']
[]
I think your problem is that you made output a string before you json.dumps it, meaning that json.loads will return a string, not a json object.
And #Dan's answer is correct, this is not valid JSON. It is however a valid python dict, and I'm assuming that you got it from Twitter using python then printed it.
I did json.loads(json.loads(string)) and was able to get the dictionary. You can check it out. The first time it doesn't just return the same string, but processes it (e.g. removes \\ characters).
First off, your JSON example is not valid JSON; the Twitter API would not output this, because it would break every conforming JSON consumer.
jsonlint shows the first, obvious syntax error: single-quoted rather than double quoted strings.
Secondly, you have None where JSON requires null, False instead of false, and True, instead of true.
Your alleged "JSON" example appears to have been pre-decoded into Python :). When I use a snippet of real JSON, it works exactly as expected:
import json
json_string = r"""
[{"actual_json_key":"actual_json_value"}]
"""
jason = json.loads(json_string)
print(jason[0]["actual_json_key"])