Parsing deeply (multiple) nested JSON blocks with Python 3.6.8 - python

I've seen several related "nested json in Python" questions but the syntax for this Corona virus JSON data is giving me problems. Here's a sample:
{"recovered":524855,"list":[
{"countrycode":"US","country":"United States of America","state":"South Carolina","latitude":"34.22333378","longitude":"-82.46170658","confirmed":15228,"deaths":568},
{"countrycode":"US","country":"United States of America","state":"Louisiana","latitude":"30.2950649","longitude":"-92.41419698","confirmed":43612,"deaths":2957}
]}
If I just want to get to Louisiana, here's what I was trying:
import json
import requests
url = "https://covid19-data.p.api.com/us"
headers = {
'x-api-key': "<api-key>",
'x-api-host': "covid19-data.p.api.com"
}
response = requests.request("GET", url, headers=headers)
coronastats = json.loads(response.text)
la_deaths = coronastats["list"][0]["countrycode"]["US"]["country"]["United States of America"]["state"]["Louisiana"]["deaths"]
print("Value: %s" % la_deaths)
I get: "TypeError: string indices must be integers"
This is obviously a list (I'm a detective and deduced that a variable named "list" might be a list) but the long key-value list is throwing me off (I think).

The problem is that once you get the first element of the list, you're left with only a depth-one dictionary. The data isn't as nested as you think it is. You're getting to a string quickly, and then trying to indice it using the US string, which raises the exception.
In [2]: data
Out[2]:
{'recovered': 524855,
'list': [{'countrycode': 'US',
'country': 'United States of America',
'state': 'South Carolina',
'latitude': '34.22333378',
'longitude': '-82.46170658',
'confirmed': 15228,
'deaths': 568},
{'countrycode': 'US',
'country': 'United States of America',
'state': 'Louisiana',
'latitude': '30.2950649',
'longitude': '-92.41419698',
'confirmed': 43612,
'deaths': 2957}]}
In [3]: data["list"][0]
Out[3]:
{'countrycode': 'US',
'country': 'United States of America',
'state': 'South Carolina',
'latitude': '34.22333378',
'longitude': '-82.46170658',
'confirmed': 15228,
'deaths': 568}
In [7]: data["list"][0]["countrycode"]
Out[7]: 'US'
In [8]: type(data["list"][0]["countrycode"])
Out[8]: str
In [9]: data["list"][0]["countrycode"]["asdf"]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-cb9dbc39ef82> in <module>
----> 1 data["list"][0]["countrycode"]["asdf"]
TypeError: string indices must be integers
To get to a specific country what you want to do is to FIND the state in the list, for example with code:
In [14]: [f"{row['state']}: {row['deaths']} deaths. Wear a mask!" for row in data["list"] if row["state"] == "Louisiana"]
Out[14]: ['Louisiana: 2957 deaths. Wear a mask!']
You can also use filter, pandas, and a million other solutions to sort through a table.

Try this:
coronastats = json.loads(response.text)
[coronastats["list"][i]["deaths"] for i in range(len(coronastats["list"])) if coronastats["list"][i]["state"] == "Louisiana"]

Related

How to convert dataframe to nested dictionary?

I am running Python 3.7 and Pandas 1.1.3 and have a DataFrame which looks like this:
location = {'city_id': [22000,25000,27000,35000],
'population': [3971883,2720546,667137,1323],
'region_name': ['California','Illinois','Massachusetts','Georgia'],
'city_name': ['Los Angeles','Chicago','Boston','Boston'],
}
df = pd.DataFrame(location, columns = ['city_id', 'population','region_name', 'city_name'])
I want to transform this dataframe into a dict that looks like:
{
'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}
}
And if the same cities in different regions, nested JSON should be sorted by population (for example Boston is in Massachusetts and Georgia. The city in Massachusetts is bigger, we output it first.
My code is:
result = df = df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).to_dict()
Output:
{'Boston': {'city_id': {'Massachusetts': 27000, 'Georgia': 35000}},
'Chicago': {'city_id': {'Illinois': 25000}},
'Los Angeles': {'city_id': {'California': 22000}}}
how can you see to dictionary add key - "city_id"
Tell me, please, how I should change my code that gets the expected result?
just method chain apply() method to your current solution:
result=df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).apply(lambda x:list(x.values())[0]).to_dict()
Now if you print result you will get your expected output:
{'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}}

seeking help regarding converting data from nested JSON to flat json in python

I am looking for converting a nested json into flat json using python.
I have the data coming from an API response, the number of keys/columns can be upto 100, and the rows/overall count of elements can be 100k
[{"Name":"John", "Location":{"City":"Los Angeles","State":"CA"}},{"Name":"Sam", "Location":{"City":"Chicago","State":"IL"}}]
I did came across this
(Python flatten multilevel JSON)
but this flattens the whole JSON completely, as a result everything falls under one line which I am not looking for currently. I also thought of using this on one the data one array at a time in loop but that is causing a lot of load on the system
[{"Name":"John", "City":"Los Angeles","State":"CA"},{"Name":"Sam", "City":"Chicago","State":"IL"}]
Use unpacking with dict.pop:
[{**d.pop("Location"), **d} for d in l]
Output:
[{'City': 'Los Angeles', 'Name': 'John', 'State': 'CA'},
{'City': 'Chicago', 'Name': 'Sam', 'State': 'IL'}]

How to turn a list of a list of dictionaries into a dataframe via loop

I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5

Efficient looping through dictionary with keys as tuple

I have a very large dictionary with 200 million keys. The keys are tuple with integer as individual elements of the tuple. I want to search for the key where the "query integer" lies within the two integers of the tuple in dictionary keys.
Currently, I am looping through all dictionary keys and comparing the integer with each element of tuple if it lies within that range. It works but the time to look up each query is around 1-2 minutes and I need to perform around 1 Million such queries. The example of the dictionary and the code which I have written are following:
Sample dictionary:
[{ (3547237440, 3547237503) : {'state': 'seoul teukbyeolsi', 'country': 'korea (south)', 'country_code': 'kr', 'city': 'seoul'} },
{ (403044176, 403044235) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'pleasanton'} },
{ (3423161600, 3423161615) : {'state': 'kansas', 'country': 'united states', 'country_code': 'us', 'city': 'lenexa'} },
{ (3640467200, 3640467455) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'san jose'} },
{ (853650485, 853650485) : {'state': 'colorado', 'country': 'united states', 'country_code': 'us', 'city': 'arvada'} },
{ (2054872064, 2054872319) : {'state': 'tainan', 'country': 'taiwan', 'country_code': 'tw', 'city': 'tainan'} },
{ (1760399104, 1760399193) : {'state': 'texas', 'country': 'united states', 'country_code': 'us', 'city': 'dallas'} },
{ (2904302140, 2904302143) : {'state': 'iowa', 'country': 'united states', 'country_code': 'us', 'city': 'hampton'} },
{ (816078080, 816078335) : {'state': 'district of columbia', 'country': 'united states', 'country_code': 'us', 'city': 'washington'} },
{ (2061589204, 2061589207) : {'state': 'zhejiang', 'country': 'china', 'country_code': 'cn', 'city': 'hangzhou'} }]
The code I have written:
ipint=int(ipaddress.IPv4Address(ip))
for k in ip_dict.keys():
if ipint >= k[0] and ipint <= k[1]:
print(ip_dict[k]['country'], ip_dict[k]['country_code'], ip_dict[k]['state'])
where ip is just ipaddress like '192.168.0.1'.
If anyone could provide a hint regarding more efficient way to perform this task, it would be much appreciated.
Thanks
I suggest you to use another structure with a good query complexity like a tree.
Maybe you can try this library I just found https://pypi.org/project/rangetree/
As they say, it is optimized for lookups but not for insertions so if you need to insert once and lopk many it should be OK.
Another solution is to not used a dict but a list, to order it and to build an index over it. Do a dichotomy on this index when there is a query (can be less optimal if ranges are not regular so I prefer the first solution)
Create a index for each of the 2 integers: a sorted list like this:
[(left_int, [list_of_row_ids_that have_this_left_int]),
(another_greater_left_int, [...])]
You can then search for all rows that have a left int greater than the searched one in log(n).
A binary search will do here.
Do the same for the right int.
Keep the rest of the data in a list of tuples.
More in detail:
data = [( (3547237440, 3547237503), {'state': 'seoul'} ), ...]
left_idx = [(3547237440, [0,43]), (9547237440, [3])]
# 0, 43, 3 are indices in the data list
# search
min_left_idx = binary_search(left_idx, 3444444)
# now all rows referred to by left_idx[min_left_idx] ... left_idx[-1] will satisfy your criteria
min_right_idx = ...
# between these 2 all referred rows satisfy the range check
# intersect the sets

Find value of dictionary inside a dictionary

I have some data like this:
{'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
From my scarce knowledge of Python this looks like a dictionary within a dictionary.
But either way how can I use "NY" as a key to fetch the value "New York"?
It's a dictionary with one key-value pair. The value is a list of dictionaries.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
To find the name for an abbreviation you should iterate over the dictionaries in the list and then compare the abbrev-value for a match:
for city in d['cities']: # iterate over the inner list
if city['abbrev'] == 'NY': # check for a match
print(city['name']) # print the matching "name"
Instead of the print you can also save the dictionary containing the abbreviation, or return it.
When you've got a dataset not adapted to your need, instead of using it "as-is", you can build another dictionary from that one, using a dictionary comprehension with key/values as values of your sub-dictionaries, using the fixed keys.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
newd = {sd["abbrev"]:sd["name"] for sd in d['cities']}
print(newd)
results in:
{'NY': 'New York', 'BO': 'Boston'}
and of course: print(newd['NY']) yields New York
Once the dictionary is built, you can reuse it as many times as you need with great lookup speed. Build other specialized dictionaries from the original dataset whenever needed.
Use next and filter the sub dictionaries based upon the 'abbrev' key:
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'},
{'abbrev': 'BO', 'name': 'Boston'}]}
city_name = next(city['name'] for city in d['cities']
if city['abbrev'] == 'NY')
print city_name
Output:
New York
I think that I understand your problem.
'NY' is a value, not a key.
Maybe you need something like {'cities':{'NY':'New York','BO':'Boston'}, so you could type: myvar['cities']['NY'] and it will return 'New York'.
If you have to use x = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]} you could create a function:
def search(abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev:
return cities['name']
Output:
>>> search('NY')
'New York'
>>> search('BO')
'Boston'
PD: I use python 3.6
Also with this code you could also find abbrev:
def search(s, abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev: return cities['name'], cities['abbrev']
if cities['name'] == abbrev: return cities['name'], cities['abbrev']

Categories

Resources