Using json_normalize to flatten nested json

Using json_normalize to flatten nested json - python

I'm trying to flatten a json file using json_normalize in Python (Pandas), but being a noob at this I always seem to end up in a KeyError.
What I would like to achieve is a DataFrame with all the Plays in a game.
I've tried numerous variants of paths and prefixes, but no success. Googled a lot as well, but I'm still falling short.
What I would like to end up with is a DataFrame like:
period, time, type, player1, player2, xcord, ycord
import pandas as pd
import json
with open('PlayByPlay.json') as data_file:
data = json.load(data_file)
from pandas.io.json import json_normalize
records = json_normalize(data)
plays = records['data.game.plays.play'][0]
plays
Would generate
{'aoi': [8470324, 8473449, 8475158, 8475215, 8477499, 8477933],
'apb': [],
'as': 0,
'asog': 0,
'desc': 'Zack Kassian hit Kyle Okposo',
'eventid': 7,
'formalEventId': 'EDM7',
'hoi': [8471678, 8475178, 8475660, 8476454, 8476457, 8476472],
'hpb': [],
'hs': 0,
'hsog': 0,
'localtime': '5:12 PM',
'p1name': 'Zack Kassian',
'p2name': 'Kyle Okposo',
'p3name': '',
'period': 1,
'pid': 8475178,
'pid1': 8475178,
'pid2': 8473449,
'pid3': '',
'playername': 'Zack Kassian',
'strength': 701,
'sweater': '44',
'teamid': 22,
'time': '00:28',
'type': 'Hit',
'xcoord': 22,
'ycoord': 38}
Json
{'data': {'game': {'awayteamid': 7,
'awayteamname': 'Buffalo Sabres',
'awayteamnick': 'Sabres',
'hometeamid': 22,
'hometeamname': 'Edmonton Oilers',
'hometeamnick': 'Oilers',
'plays': {'play': [{'aoi': [8470324,
8473449,
8475158,
8475215,
8477499,
8477933],
'apb': [],
'as': 0,
'asog': 0,
'desc': 'Zack Kassian hit Kyle Okposo',
'eventid': 7,
'formalEventId': 'EDM7',
'hoi': [8471678, 8475178, 8475660, 8476454, 8476457, 8476472],
'hpb': [],
'hs': 0,
'hsog': 0,
'localtime': '5:12 PM',
'p1name': 'Zack Kassian',
'p2name': 'Kyle Okposo',
'p3name': '',
'period': 1,
'pid': 8475178,
'pid1': 8475178,
'pid2': 8473449,
'pid3': '',
'playername': 'Zack Kassian',
'strength': 701,
'sweater': '44',
'teamid': 22,
'time': '00:28',
'type': 'Hit',
'xcoord': 22,
'ycoord': 38},
{'aoi': [8471742, 8475179, 8475215, 8475220, 8475235, 8475728],
'apb': [],
'as': 0,
'asog': 0,
'desc': 'Jesse Puljujarvi Tip-In saved by Robin Lehner',
'eventid': 59,
'formalEventId': 'EDM59',
'hoi': [8473468, 8474034, 8475660, 8477498, 8477934, 8479344],
'hpb': [],
'hs': 0,
'hsog': 1,
'localtime': '5:13 PM',
'p1name': 'Jesse Puljujarvi',
'p2name': 'Robin Lehner',
'p3name': '',
'period': 1,
'pid': 8479344,
'pid1': 8479344,
'pid2': 8475215,
'pid3': '',
'playername': 'Jesse Puljujarvi',
'strength': 701,
'sweater': '98',
'teamid': 22,
'time': '01:32',
'type': 'Shot',
'xcoord': 81,
'ycoord': 3}]}},
'refreshInterval': 0}}

If you have only one game, this will create the dataframe you want:
json_normalize(data['data']['game']['plays']['play'])
Then you just need to extract the columns you're interested in.

it might be un-intuitive to use this API when the structure becomes complicated.
but the key is: json_normalize extracts JSON fields into table.
for my case: I have a table
----------
| fact | // each row is a json object {'a':a, 'b':b....}
----------
rrrrr = []
for index, row in data.iterrows():
r1 = json_normalize(row['fact'])
rrrrr.append(r1)
rr1 = pd.concat(rrrrr)

Related

Trying to open .bson file and read to pandas df but getting 'bson.errors.InvalidBSON: objsize too large' first time using .bson

#This is my code
import pandas as pd
import bson
FILE="users_(1).bson"
with open(FILE,'rb') as f:
data = bson.decode_all(f.read())
main_df=pd.DataFrame(data)
main_df.describe()
#This is my .bson file
[{'_id': ObjectId('999f24f260f653401b'),
'isV2': False,
'isBeingMigratedToV2': False,
'firstName': 'Jezz',
'lastName': 'Bezos',
'subscription': {'_id': ObjectId('999f24f260f653401b'),
'chargebeeId': 'AzZdd6T847kHQ',
'currencyCode': 'EUR',
'customerId': 'AzZdd6T847kHQ',
'nextBillingAt': datetime.datetime(2022, 7, 7, 10, 14, 6),
'numberOfMonthsPaid': 1,
'planId': 'booster-v3-eur',
'startedAt': datetime.datetime(2022, 6, 7, 10, 14, 6),
'addons': [],
'campaign': None,
'maskedCardNumber': '************1234'},
'email': 'jeffbezos#gmail.com',
'groupName': None,
'username': 'jeffbezy',
'country': 'DE'},
{'_id': ObjectId('999f242660f653401b'),
'isV2': False,
'isBeingMigratedToV2': False,
'firstName': 'Caterina',
'lastName': 'Fake',
'subscription': {'_id': ObjectId('999f242660f653401b'),
'chargebeeId': '16CGLYT846t99',
'currencyCode': 'GBP',
'customerId': '16CGLYT846t99',
'nextBillingAt': datetime.datetime(2022, 7, 7, 10, 10, 41),
'numberOfMonthsPaid': 1,
'planId': 'personal-v3-gbp',
'startedAt': datetime.datetime(2022, 6, 7, 10, 10, 41),
'addons': [],
'campaign': None,
'maskedCardNumber': '************4311'},
'email': 'caty.fake#gmail.com',
'groupName': None,
'username': 'cfake',
'country': 'GB'}]
I get the error
'bson.errors.InvalidBSON: objsize too large'
Is it something to do with the datetime? Is it the structure of the .bson file, been at this for hours and can't seem to see the error. I know how to work with json and tried to convert it to json but no success. Any tips would be appreciated.

If the main goal here is to read the data into a pandas DataFrame you could indeed format the data to json and use bson.json_util.loads:
import pandas as pd
from bson.json_util import loads
with open(filepath,'r') as f:
data = f.read()
mapper = {
'\'': '"', # using double quotes
'False': 'false',
'None': '\"None\"', # double quotes around None
# modifying the ObjectIds and timestamps
'("': '(',
'")': ')',
')': ')"',
'ObjectId': '"ObjectId',
'datetime.datetime': '"datetime.datetime'
}
for k, v in mapper.items():
data = data.replace(k, v)
data = loads(data)
df = pd.DataFrame(data)

Flattening deeply nested JSON into pandas data frame

I am trying to import a deeply nested JSON into pandas dataframe. Here is the structure of the JSON file (this is only the first record (retweets[:1]):
[{'lang': 'en',
'author_id': '1076979440372965377',
'reply_settings': 'everyone',
'entities': {'mentions': [{'start': 3,
'end': 17,
'username': 'Terry81987010',
'url': '',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
{'start': 19,
'end': 32,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
'conversation_id': '1253462541881106433',
'created_at': '2020-04-23T23:15:32.000Z',
'id': '1253462541881106433',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'retweeted',
'id': '1253052684489437184',
'in_reply_to_user_id': '91882544',
'attachments': {'media_keys': ['3_1253052312144293888',
'3_1253052620937277442'],
'media': [{}, {}]},
'entities': {'annotations': [{'start': 126,
'end': 128,
'probability': 0.514,
'type': 'Organization',
'normalized_text': 'CDC'},
{'start': 145,
'end': 146,
'probability': 0.5139,
'type': 'Place',
'normalized_text': 'NY'}],
'mentions': [{'start': 0,
'end': 13,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
'urls': [{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
'lang': 'en',
'author_id': '1091471553437593605',
'reply_settings': 'everyone',
'conversation_id': '1253050942716551168',
'created_at': '2020-04-22T20:06:55.000Z',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
'public_metrics': {'retweet_count': 208,
'reply_count': 57,
'like_count': 402,
'quote_count': 38},
'source': 'Twitter Web App',
'text': "#DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death $60,000",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'Terry81987010',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
'in_reply_to_user': {'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
'public_metrics': {'retweet_count': 208,
'reply_count': 0,
'like_count': 0,
'quote_count': 0},
'source': 'Twitter for iPhone',
'text': "RT #Terry81987010: #DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'set1952',
'location': 'Etats-Unis',
'created_at': '2018-12-23T23:14:42.000Z',
'protected': False,
'public_metrics': {'followers_count': 103,
'following_count': 44,
'tweet_count': 44803,
'listed_count': 0},
'name': 'SunSet1952',
'verified': False,
'id': '1076979440372965377',
'description': '',
'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
'__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
'version': '2.0.8',
'retrieved_at': '2021-05-17T17:13:17+00:00'}},
Here is my code:
retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
retweets.append(json.loads(line))
df = json_normalize(
retweets, 'referenced_tweets', ['referenced_tweets', 'type'],
meta_prefix= ".",
errors='ignore'
)
df[['author_id', 'type', '.type', 'id', 'in_reply_to_user_id', 'referenced_tweets']].head()
Here is the resulting dataframe:
As you can see, the column referenced_tweets is not flattened yet (please note that there are two different referenced_tweets arrays in my JSON file: one is in a deeper level insdide the other "referenced_tweets"). For example, the one at the higher level return this:
>>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"
and the one in the deeper level return this:
>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'
QUESTION: I was wondering how I can flatten the deeper referenced_tweets. I want to have two separate columns as referenced_tweets.type and referenced_tweets.id, where the value of the column referenced_tweets.type in the above example should be replied_to.

I think the issue here is that your data is double nested... there is a key referenced_tweets within referenced_tweets.
import json
from pandas import json_normalize
with open("flatten.json", "r") as file:
data = json.load(file)
df = json_normalize(
data,
record_path=["referenced_tweets", "referenced_tweets"],
meta=[
"author_id",
# ["author", "username"], # not possible
# "author", # possible but not useful
["referenced_tweets", "id"],
["referenced_tweets", "type"],
["referenced_tweets", "in_reply_to_user_id"],
["referenced_tweets", "in_reply_to_user", "username"],
]
)
print(df)
See also: https://stackoverflow.com/a/37668569/42659
Note: Above code will fail if second nested referenced_tweet is missing.
Edit: Alternatively you could further normalize your data (which you already partly normalized with your code) in your question with an additional manual iteration. See example below. Note: Code is not optimized and may be slow depending on the amount of data.
# load your `data` with `json.load()` or `json.loads()`
df = json_normalize(
data,
record_path="referenced_tweets",
meta=["referenced_tweets", "type"],
meta_prefix= ".",
errors="ignore",
)
columns = [*df.columns, "_type", "_id"]
normalized_data = []
def append(row, type, id):
normalized_row = [*row.to_list(), type, id]
normalized_data.append(normalized_row)
for _, row in df.iterrows():
# a list/array is expected
if type(row["referenced_tweets"]) is list:
for tweet in row["referenced_tweets"]:
append(row, tweet["type"], tweet["id"])
# if empty list
else:
append(row, None, None)
else:
append(row, None, None)
enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)
print(enhanced_df)
Edit 2: referenced_tweets should be an array. However, if there is no referenced tweet, the Twitter API seems to omit referenced_tweets completely. In that case, the cell value is NaN (float) instead of an empty list. I updated the code above to take that into account.

Decoding list containing dictionaries

I need to get certain values out of a list of dictionaries, which looks like that and is assigned to the variable 'segment_values':
[{'distance': 114.6,
'duration': 20.6,
'instruction': 'Head north',
'name': '-',
'type': 11,
'way_points': [0, 5]},
{'distance': 288.1,
'duration': 28.5,
'instruction': 'Turn right onto Knaufstraße',
'name': 'Knaufstraße',
'type': 1,
'way_points': [5, 17]},
{'distance': 3626.0,
'duration': 273.0,
'instruction': 'Turn slight right onto B320',
'name': 'B320',
'type': 5,
'way_points': [17, 115]},
{'distance': 54983.9,
'duration': 2679.3,
'instruction': 'Keep right onto Ennstal-Bundesstraße, B320',
'name': 'Ennstal-Bundesstraße, B320',
'type': 13,
'way_points': [115, 675]},
{'distance': 11065.1,
'duration': 531.1,
'instruction': 'Keep left onto Pyhrn Autobahn, A9',
'name': 'Pyhrn Autobahn, A9',
'type': 12,
'way_points': [675, 780]},
{'distance': 800.7,
'duration': 64.1,
'instruction': 'Keep right',
'name': '-',
'type': 13,
'way_points': [780, 804]},
{'distance': 49.6,
'duration': 4.0,
'instruction': 'Keep left',
'name': '-',
'type': 12,
'way_points': [804, 807]},
{'distance': 102057.2,
'duration': 4915.0,
'instruction': 'Keep right',
'name': '-',
'type': 13,
'way_points': [807, 2104]},
{'distance': 56143.4,
'duration': 2784.5,
'instruction': 'Keep left onto S6',
'name': 'S6',
'type': 12,
'way_points': [2104, 2524]},
{'distance': 7580.6,
'duration': 389.8,
'instruction': 'Keep left',
'name': '-',
'type': 12,
'way_points': [2524, 2641]},
{'distance': 789.0,
'duration': 63.1,
'instruction': 'Keep right',
'name': '-',
'type': 13,
'way_points': [2641, 2663]},
{'distance': 815.9,
'duration': 65.3,
'instruction': 'Keep left',
'name': '-',
'type': 12,
'way_points': [2663, 2684]},
{'distance': 682.9,
'duration': 54.6,
'instruction': 'Turn left onto Heinrich-Drimmel-Platz',
'name': 'Heinrich-Drimmel-Platz',
'type': 0,
'way_points': [2684, 2711]},
{'distance': 988.1,
'duration': 79.0,
'instruction': 'Turn left onto Arsenalstraße',
'name': 'Arsenalstraße',
'type': 0,
'way_points': [2711, 2723]},
{'distance': 11.7,
'duration': 2.1,
'instruction': 'Turn left',
'name': '-',
'type': 0,
'way_points': [2723, 2725]},
{'distance': 0.0,
'duration': 0.0,
'instruction': 'Arrive at your destination, on the left',
'name': '-',
'type': 10,
'way_points': [2725, 2725]}]
I need to get the duration values and the waypoint values out of that code segment.
For the duration I tried:
segment_values= data['features'][0]['properties']['segments'][0]['steps'] #gets me the above code
print(segment_values[0:]['duration'])
Shouldn't this print me all dictionaries, and the values at duration in each of them?
I also tried this:
duration = data['features'][0]['properties']['segments'][0]['steps'][0:]['duration']
print(duration)
Both tries give me "TypeError: list indices must be integers or slices, not str
"
Where am I going wrong?

Your data is a list of dictionaries.
For this reason you need to cycle through its content in order to access data.
Please try this print statement to look at the data more closely:
for item in data_list:
print(item)
In order to access duration per each item you can use similar code:
for item in data_list:
print(item['duration'])
You can also use list comprehension to achieve the same result:
duration = [item['duration'] for item in data_list]
List comprehension is a Pythonic way to obtain the same result, you can read more about it here.
The same principle can be applied twice if a key in your data contains a list or another iterable, here's another example:
for item in data:
print("\nPrinting waypoints for name: " + item['name'])
for way_point in item['way_points']:
print(way_point)

duration = [x['duration'] for x in segment_values]
waypoints =[x['way_points'] for x in segment_values]

You might be thinking of higher-level wrappers like pandas, which would let you do
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(3, 2), index=list('abc'), columns=list('xy'))
>>> df
x y
a -0.192041 -0.312067
b -0.595700 0.339085
c -0.524242 0.946350
>>> df.x
a -0.192041
b -0.595700
c -0.524242
Name: x, dtype: float64
>>> df[0:].x
a -0.192041
b -0.595700
c -0.524242
Name: x, dtype: float64
>>> df[1:].y
b 0.339085
c 0.946350
Name: y, dtype: float64

Another tool for this is glom, which provides helpers for logic like this (pip install glom).
>>> from glom import glom
>>> from pprint import pprint
>>> data = <your data>
>>> pprint(glom(data, [{'wp': 'way_points', 'dist': 'distance'}]))
[{'dist': 114.6, 'wp': [0, 5]},
{'dist': 288.1, 'wp': [5, 17]},
{'dist': 3626.0, 'wp': [17, 115]},
{'dist': 54983.9, 'wp': [115, 675]},
{'dist': 11065.1, 'wp': [675, 780]},
{'dist': 800.7, 'wp': [780, 804]},
{'dist': 49.6, 'wp': [804, 807]},
{'dist': 102057.2, 'wp': [807, 2104]},
{'dist': 56143.4, 'wp': [2104, 2524]},
{'dist': 7580.6, 'wp': [2524, 2641]},
{'dist': 789.0, 'wp': [2641, 2663]},
{'dist': 815.9, 'wp': [2663, 2684]},
{'dist': 682.9, 'wp': [2684, 2711]},
{'dist': 988.1, 'wp': [2711, 2723]},
{'dist': 11.7, 'wp': [2723, 2725]},
{'dist': 0.0, 'wp': [2725, 2725]}]
You can get a feel on how other cases might work from the documentation:
https://glom.readthedocs.io/en/latest/faq.html#how-does-glom-work
def glom(target, spec):
# if the spec is a string or a Path, perform a deep-get on the target
if isinstance(spec, (basestring, Path)):
return _get_path(target, spec)
# if the spec is callable, call it on the target
elif callable(spec):
return spec(target)
# if the spec is a dict, assign the result of
# the glom on the right to the field key on the left
elif isinstance(spec, dict):
ret = {}
for field, subspec in spec.items():
ret[field] = glom(target, subspec)
return ret
# if the spec is a list, run the spec inside the list on every
# element in the list and return the new list
elif isinstance(spec, list):
subspec = spec[0]
iterator = _get_iterator(target)
return [glom(t, subspec) for t in iterator]
# if the spec is a tuple of specs, chain the specs by running the
# first spec on the target, then running the second spec on the
# result of the first, and so on.
elif isinstance(spec, tuple):
res = target
for subspec in spec:
res = glom(res, subspec)
return res
else:
raise TypeError('expected one of the above types')

Write json responce for each request into a file

I wrote a code which is making a request to API and recieving output in JSON. So my question is how to write output for each request in file. Now my code is doing the last one request.
import requests
import json
with open("query4.txt", "rt") as file:
data_file = file.read()
for line in data_file.split("\n"):
drX, drY, fromX, fromY, dist = line.split(",")
url = "https://api.openrouteservice.org/directions?"
params = [
["api_key", "my_api_key"],
["coordinates", "%s,%s|%s,%s" % (drY, drX, fromY, fromX)],
["profile", "driving-car"]
]
headers = {
"Accept": "application/json, application/geo+json,"
"application/gpx+xml, img/png; charset=utf-8"}
responce = requests.get(url=url, params=params, headers=headers)
# print(responce.url)
# print(responce.text)
result = json.loads(responce.text)
# print(result)
with open("result.txt", "w+") as f_o:
for rows in result["routes"]:
f_o.writelines(json.dumps(rows["summary"]["distance"])) # depending on how do you want the result
print(result["routes"])
I have an output like this:
{'routes': [{'warnings': [{'code': 1, 'message': 'There may be restrictions on some roads'}], 'summary': {'distance': 899.6, 'duration': 102.1}, 'geometry_format': 'encodedpolyline', 'geometry': 'u~uiHir{iFb#SXzADTlAk#JOJ]#_#CWW{AKo#k#eDEYKo#y#{EGc#G]GYCOa#gCc#iCoBsLNGlAm#VK^Sh#Un#tD', 'segments': [{'distance': 899.6, 'duration': 102.1, 'steps': [{'distance': 22.1, 'duration': 5.3, 'type': 11, 'instruction': 'Head south', 'name': '', 'way_points': [0, 1]}, {'distance': 45.4, 'duration': 10.9, 'type': 1, 'instruction': 'Turn right', 'name': '', 'way_points': [1, 3]}, {'distance': 645.5, 'duration': 52.3, 'type': 0, 'instruction': 'Turn left onto Партизанська вулиця', 'name': 'Партизанська вулиця', 'way_points': [3, 21]}, {'distance': 114.4, 'duration': 20.6, 'type': 1, 'instruction': 'Turn right', 'name': '', 'way_points': [21, 26]}, {'distance': 72.1, 'duration': 13, 'type': 1, 'instruction': 'Turn right', 'name': '', 'way_points': [26, 27]}, {'distance': 0, 'duration': 0, 'type': 10, 'instruction': 'Arrive at your destination, on the left', 'name': '', 'way_points': [27, 27]}]}], 'way_points': [0, 27], 'extras': {'roadaccessrestrictions': {'values': [[0, 1, 0], [1, 3, 2], [3, 27, 0]], 'summary': [{'value': 0, 'distance': 854.2, 'amount': 94.95}, {'value': 2, 'distance': 45.4, 'amount': 5.05}]}}, 'bbox': [38.484536, 48.941171, 38.492904, 48.943022]}], 'bbox': [38.484536, 48.941171, 38.492904, 48.943022], 'info': {'attribution': 'openrouteservice.org | OpenStreetMap contributors', 'engine': {'version': '5.0.1', 'build_date': '2019-05-29T14:22:56Z'}, 'service': 'routing', 'timestamp': 1568280549854, 'query': {'profile': 'driving-car', 'preference': 'fastest', 'coordinates': [[38.485115, 48.942059], [38.492073, 48.941676]], 'language': 'en', 'units': 'm', 'geometry': True, 'geometry_format': 'encodedpolyline', 'instructions_format': 'text', 'instructions': True, 'elevation': False}}}
{'routes': [{'summary': {'distance': 2298, 'duration': 365.6}, 'geometry_format': 'encodedpolyline', 'geometry': 'u~a{Gee`zDLIvBvDpClCtA|AXHXCp#m#bBsBvBmC`AmAtIoKNVLXHPb#c#`A_AFENGzAc#XKZCJ?PDLBH#F?T?PC~CcATOt#Sd#QLKBCBAb#]ZG|#OY_DQ}IE{DC_DAg#Eg#q#aFgBuH^GjBFj#
I did NeverHopeless answer, but i've got the same:
result = json.loads(responce.text)
i = 0
with open(f"result-{i}.txt", "w+") as f_o:
i += 1
for rows in result["routes"]:
f_o.writelines(json.dumps(rows["summary"]["distance"])) # depending on how do you want the result
print(result["routes"])
Output now looks like this
899.622982138.832633191.8
I'm expecting to get this:
2298
2138.8
3263
3191.8
Every value is a distance from different requests so i need to have each on the new line.

You need to open and keep open the output file before your loop:
import requests
import json
with open("query4.txt", "rt") as file:
data_file = file.read()
with open("result.txt", "w") as f_o:
for line in data_file.split("\n"):
drX, drY, fromX, fromY, dist = line.split(",")
url = "https://api.openrouteservice.org/directions?"
params = [
["api_key", "my_api_key"],
["coordinates", "%s,%s|%s,%s" % (drY, drX, fromY, fromX)],
["profile", "driving-car"]
]
headers = {
"Accept": "application/json, application/geo+json,"
"application/gpx+xml, img/png; charset=utf-8"}
responce = requests.get(url=url, params=params, headers=headers)
# print(responce.url)
# print(responce.text)
result = json.loads(responce.text)
# print(result)
for rows in result["routes"]:
print(rows["summary"]["distance"], file=f_o) # depending on how do you want the result
# print(result["routes"])

I think it's better to write results in different files with timestamp. in this way you don't rewrite on your older file and also you can find them easier.
current_time = time.strftime("%m_%d_%y %H_%M_%S", time.localtime())
with open(current_time + ".txt", "w+") as f_o:
for rows in result["routes"]:
f_o.writelines(json.dumps(rows["summary"]["distance"])) # depending on how do you want the result
print(result["routes"])

You need to make this filename "result.txt" dynamic. Currently it is overwriting content.
Perhaps like this:
i = 0 # <----- Keep it outside your for loop or it will be always set to zero
with open(f"result-{i}.txt", "w+") as f_o:
i += 1
Or instead of integers, you may better use timestamp in filename.

Python 3.5 Pandas and MongoDB -json_normalize: raise TypeError("data argument can't be an iterator")

I am trying to do data transformation using pandas on python3.5.
Data is fetched from MongoDB using MongoClient() and json_normalize.
However when i execute below code it throws error as data argument can't be an iterator. Any pointers will help.
Sample Data :
{'bank_code': 'CID005', 'status': 'Init', 'cpgmid': '7847', 'blaze_transId': 'ZI4YQFFOTGG96ZRUQWZS121111632121509-9173782788741', 'currency': 'INR', 'amount': 7800, 'merchant_trans_id': '121111632121509-9173782788741', 'date_time': datetime.datetime(2016, 11, 11, 14, 1, 14, 44000), 'consumer_mobile': 9999999999.0, 'consumer_email': 'test#test.com', '_id': ObjectId('5825cf2a11eae123023730a9')}
{'bank_code': 'CID001', 'status': 'Init', 'cpgmid': '228', 'blaze_transId': '1rjfeklmg2281610111931334hjlm4j8xwl', 'currency': 'INR', 'amount': 651.4, 'merchant_trans_id': '161111569056', 'date_time': datetime.datetime(2016, 11, 11, 14, 1, 14, 333000), 'consumer_mobile': 9999992399.0, 'consumer_email': 'test#air.com', '_id': ObjectId('5825cf2a11eae123023730af')}
{'bank_code': 'CID001', '_id': ObjectId('5825cf2a097752b55d0f17ac'), 'custom_params': {'suppress_trans': 1}, 'currency': 'INR', 'merchant_trans_id': 'BX819215014788728725757', 'date_time': datetime.datetime(2016, 11, 11, 14, 1, 14, 421000), 'consumer_mobile': 0, 'status': 'Init', 'cpgmid': '1656', 'blaze_transId': '1bygejlxl16561610111931423bkgfe1uxx', 'amount': 577, 'consumer_email': 'p.25#gmail.com'}
Code:
start_datetime1 = (datetime.now() - timedelta(days=1)).replace(hour=18, minute=30, second=00, microsecond=0)
start_datetime2 = (datetime.now() - timedelta(days=0)).replace(hour=18, minute=29, second=59, microsecond=0)
client = MongoClient(host_val, int(port_val))
db = client.cit
transactions_collection = db.transactions
cursor = json_normalize(transactions_collection.find({'date_time': {'$lt': start_datetime2, '$gte': start_datetime1}},
{'_id': 1, 'blaze_transId': 1, 'status': 1, 'merchant_trans_id': 1,
'date_time': 1, 'amount': 1, 'status': 1, 'cpgmid': 1, 'currency': 1,
'status_msg': 1, 'bank_code': 1, 'custom_params.suppress_trans': 1,
'consumer_email': 1,'consumer_mobile': 1}))
df_txn = pd.DataFrame(cursor)
Error:
ERROR:root:Exception in fetch
Traceback (most recent call last):
File "/opt/Analytics-services/ETLservices/transformationService/Blazenet_Txns_Fact.py", line 174, in fetchBlazenetTxnsFromDB
'consumer_email': 1,'consumer_mobile': 1}))
File "/usr/local/lib/python3.5/site-packages/pandas/io/json.py", line 717, in json_normalize
return DataFrame(data)
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 283, in __init__
raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator

You need to convert the cursor to a list before passing it to json_normalize.
cursor = transactions_collection.find({'date_time': {'$lt': start_datetime2, '$gte': start_datetime1}},
{'_id': 1, 'blaze_transId': 1, 'status': 1, 'merchant_trans_id': 1,
'date_time': 1, 'amount': 1, 'status': 1, 'cpgmid': 1, 'currency': 1,
'status_msg': 1, 'bank_code': 1, 'custom_params.suppress_trans': 1,
'consumer_email': 1,'consumer_mobile': 1})
df_txn = pd.DataFrame(json_normalize(list(cursor)))
You may also want to look at monary if you want to avoid having the massive ammounts of data converted to a list.

Along with Steves answer changed mongo query to avoid selecting data points which were not required. This is done to as custom_params was not getting flattened if i try to select it in mongo query.
cursor = transactions_collection.find({"date_time": {'$lt': start_datetime2, '$gte': start_datetime1}},{'bankRes':0,'rawDV':0})
df_txn = pd.DataFrame(json_normalize(list(cursor)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using json_normalize to flatten nested json - python

If you have only one game, this will create the dataframe you want: json_normalize(data['data']['game']['plays']['play']) Then you just need to extract the columns you're interested in.

Related

Trying to open .bson file and read to pandas df but getting 'bson.errors.InvalidBSON: objsize too large' first time using .bson

Flattening deeply nested JSON into pandas data frame

Decoding list containing dictionaries

Write json responce for each request into a file

Python 3.5 Pandas and MongoDB -json_normalize: raise TypeError("data argument can't be an iterator")

Categories

Resources