How to convert dataframe to nested dictionary? - python

I am running Python 3.7 and Pandas 1.1.3 and have a DataFrame which looks like this:
location = {'city_id': [22000,25000,27000,35000],
'population': [3971883,2720546,667137,1323],
'region_name': ['California','Illinois','Massachusetts','Georgia'],
'city_name': ['Los Angeles','Chicago','Boston','Boston'],
}
df = pd.DataFrame(location, columns = ['city_id', 'population','region_name', 'city_name'])
I want to transform this dataframe into a dict that looks like:
{
'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}
}
And if the same cities in different regions, nested JSON should be sorted by population (for example Boston is in Massachusetts and Georgia. The city in Massachusetts is bigger, we output it first.
My code is:
result = df = df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).to_dict()
Output:
{'Boston': {'city_id': {'Massachusetts': 27000, 'Georgia': 35000}},
'Chicago': {'city_id': {'Illinois': 25000}},
'Los Angeles': {'city_id': {'California': 22000}}}
how can you see to dictionary add key - "city_id"
Tell me, please, how I should change my code that gets the expected result?

just method chain apply() method to your current solution:
result=df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).apply(lambda x:list(x.values())[0]).to_dict()
Now if you print result you will get your expected output:
{'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}}

Related

Matching part of a string with a value in two pandas dataframes

Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})
Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)

Python Print Function City name

I'm trying to print 'Phoenix' from this list dictionary but can't extract the specific name.
test = [{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}, 1000, 2000, 3000, ['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]]
print(test[0]) gives me all the city names... How do I display just 'Phoenix'?
With test[0] you're just accessing the first element of the list test, which in this case is the dictionary.
You need to use the key - in this case Arizona:
print(test[0]['Arizona'])
Output is Phoenix.
You should read up a bit on the dictionary data structure. But here you go:
test[0]['Arizona']
Try to run the below code snippet and you will get your answer by your own most probably.
test = [{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}, 1000, 2000, 3000,
['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]]
print(test[0]['Arizona'])
for index in range(len(test)):
print("test[" + str(index) + "]:")
print(str(test[index]))
print()
The output will be like the below:
Phoenix
test[0]:
{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}
test[1]:
1000
test[2]:
2000
test[3]:
3000
test[4]:
['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]
The output explains that test is an array of 5 elements and the 0th element is an associative array. Thus, to output Phoenix, you have to use test[0]['Arizona']
If you use just test[0], you can just access the first element in the test list:
{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'}.
The test[0] is a dictionary. And you need to get 'Phoenix' which is the value of the key 'Arizona', you can just use the key to get the value:
test[0]['Arizona']

How to turn a list of a list of dictionaries into a dataframe via loop

I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5

Find value of dictionary inside a dictionary

I have some data like this:
{'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
From my scarce knowledge of Python this looks like a dictionary within a dictionary.
But either way how can I use "NY" as a key to fetch the value "New York"?
It's a dictionary with one key-value pair. The value is a list of dictionaries.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
To find the name for an abbreviation you should iterate over the dictionaries in the list and then compare the abbrev-value for a match:
for city in d['cities']: # iterate over the inner list
if city['abbrev'] == 'NY': # check for a match
print(city['name']) # print the matching "name"
Instead of the print you can also save the dictionary containing the abbreviation, or return it.
When you've got a dataset not adapted to your need, instead of using it "as-is", you can build another dictionary from that one, using a dictionary comprehension with key/values as values of your sub-dictionaries, using the fixed keys.
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]}
newd = {sd["abbrev"]:sd["name"] for sd in d['cities']}
print(newd)
results in:
{'NY': 'New York', 'BO': 'Boston'}
and of course: print(newd['NY']) yields New York
Once the dictionary is built, you can reuse it as many times as you need with great lookup speed. Build other specialized dictionaries from the original dataset whenever needed.
Use next and filter the sub dictionaries based upon the 'abbrev' key:
d = {'cities': [{'abbrev': 'NY', 'name': 'New York'},
{'abbrev': 'BO', 'name': 'Boston'}]}
city_name = next(city['name'] for city in d['cities']
if city['abbrev'] == 'NY')
print city_name
Output:
New York
I think that I understand your problem.
'NY' is a value, not a key.
Maybe you need something like {'cities':{'NY':'New York','BO':'Boston'}, so you could type: myvar['cities']['NY'] and it will return 'New York'.
If you have to use x = {'cities': [{'abbrev': 'NY', 'name': 'New York'}, {'abbrev': 'BO', 'name': 'Boston'}]} you could create a function:
def search(abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev:
return cities['name']
Output:
>>> search('NY')
'New York'
>>> search('BO')
'Boston'
PD: I use python 3.6
Also with this code you could also find abbrev:
def search(s, abbrev):
for cities in x['cities']:
if cities['abbrev'] == abbrev: return cities['name'], cities['abbrev']
if cities['name'] == abbrev: return cities['name'], cities['abbrev']

Need help understanding the behavior of a for loop

I am working through a tutorial on sets in Python 2.7, and I have run into a behavior using a for loop that I do not understand, and I am trying to find out what the reason for the difference in outputs might be.
The object of the exercise is to produce a set, cities, from a dictionary that contains keys made up of city pairs of frozen sets using a for loop.
The data comes from the following dictionary:
flight_distances = {
frozenset(['Atlanta', 'Chicago']): 590.0,
frozenset(['Atlanta', 'Dallas']): 720.0,
frozenset(['Atlanta', 'Houston']): 700.0,
frozenset(['Atlanta', 'New York']): 750.0,
frozenset(['Austin', 'Dallas']): 180.0,
frozenset(['Austin', 'Houston']): 150.0,
frozenset(['Boston', 'Chicago']): 850.0,
frozenset(['Boston', 'Miami']): 1260.0,
frozenset(['Boston', 'New York']): 190.0,
frozenset(['Chicago', 'Denver']): 920.0,
frozenset(['Chicago', 'Houston']): 940.0,
frozenset(['Chicago', 'Los Angeles']): 1740.0,
frozenset(['Chicago', 'New York']): 710.0,
frozenset(['Chicago', 'Seattle']): 1730.0,
frozenset(['Dallas', 'Denver']): 660.0,
frozenset(['Dallas', 'Los Angeles']): 1240.0,
frozenset(['Dallas', 'New York']): 1370.0,
frozenset(['Denver', 'Los Angeles']): 830.0,
frozenset(['Denver', 'New York']): 1630.0,
frozenset(['Denver', 'Seattle']): 1020.0,
frozenset(['Houston', 'Los Angeles']): 1370.0,
frozenset(['Houston', 'Miami']): 970.0,
frozenset(['Houston', 'San Francisco']): 1640.0,
frozenset(['Los Angeles', 'New York']): 2450.0,
frozenset(['Los Angeles', 'San Francisco']): 350.0,
frozenset(['Los Angeles', 'Seattle']): 960.0,
frozenset(['Miami', 'New York']): 1090.0,
frozenset(['New York', 'San Francisco']): 2570.0,
frozenset(['San Francisco', 'Seattle']): 680.0,
}
There is also a test list that will create the intended set as a check:
flying_circus_cities = [
'Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver',
'New York', 'Los Angeles', 'San Francisco', 'Atlanta',
'Seattle', 'Austin'
]
When the code is written in the following form, the loop produces the intended result.
cities = set()
for pair in flight_distances:
cities = cities.union(pair)
print cities
print "Check:", cities == set(flying_circus_cities)
Output:
set(['Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver', 'New York', 'Los Angeles', 'San Francisco', 'Atlanta', 'Seattle', 'Austin'])
Check: True
However, if I attempt as a comprehension, with either of the following, I get a different result.
cities = set()
cities = {pair for pair in flight_distances}
print cities
print "Check:", cites == set(flying_circus_cities)
or
cities = set()
cities = cities.union(pair for pair in flight_distances)
print cities
print "Check:", cities == set(flying_circus_cities)
Output for both:
set([frozenset(['Atlanta', 'Dallas']), frozenset(['San Francisco', 'New York']), frozenset(['Denver', 'Chicago']), frozenset(['Houston', 'San Francisco']), frozenset(['San Francisco', 'Austin']), frozenset(['Seattle', 'Los Angeles']), frozenset(['Boston', 'New York']), frozenset(['Houston', 'Atlanta']), frozenset(['New York', 'Chicago']), frozenset(['San Francisco', 'Seattle']), frozenset(['Austin', 'Dallas']), frozenset(['New York', 'Dallas']), frozenset(['Houston', 'Chicago']), frozenset(['Seattle', 'Denver']), frozenset(['Seattle', 'Chicago']), frozenset(['Miami', 'New York']), frozenset(['Los Angeles', 'Denver']), frozenset(['Miami', 'Houston']), frozenset(['San Francisco', 'Los Angeles']), frozenset(['New York', 'Denver']), frozenset(['Atlanta', 'Chicago']), frozenset(['Boston', 'Chicago']), frozenset(['Houston', 'Austin']), frozenset(['Houston', 'Los Angeles']), frozenset(['New York', 'Los Angeles']), frozenset(['Atlanta', 'New York']), frozenset(['Denver', 'Dallas']), frozenset(['Los Angeles', 'Dallas']), frozenset(['Los Angeles', 'Chicago'])])
Check: False
I cannot figure out why the for loop in the first example unpacks the pairs as intended so that it produces a set with one instance of each city, while trying to write the loop as a comprehension pulls out the frozenset([city1, city2]) pairs and places them in the set instead.
I do not understand why pair would give the city strings in the first instance but passes the frozenset in the second instance.
Can someone explain the different behavior?
Note: As explained by Holt and donkopotamus, the issue of why this was behaving differently was that using the comprehension evaluated the entire dictionary completely before making a single assignment to the cities variable, thus creating a set of frozensets, where as the standard for loop unpacked the pairs one at a time and evaluated each individual one separately, assigning them to cities one at a time with each pass of the for loop and allowing the union function to evaluate each instance of the pairs being passed to it.
They further explained that using the *-operator unpacks the dictionary in the comprehension to produce the desired behavior.
cities = cities.union(*(set(pair) for pair in flight_distances))
The expression:
cities = set()
cities = cities.union(pair for pair in flight_distances)
will take the union of the empty set {} with another set
{pair_0, pair_1, pair_2, ..., pair_n}
leaving you with a set of sets.
In contrast, the following will give you all of the cities flown to:
>>> set.union(*(set(pair) for pair in flight_distances))
{'Atlanta',
'Austin',
'Boston',
'Chicago',
'Dallas',
'Denver',
'Houston',
'Los Angeles',
'Miami',
'New York',
'San Francisco',
'Seattle'}
Here we transform each of the frozen set keys into a plain set and find the union.
In the first version, pair is a frozenset at each loop, so you can do a union with it, while in your version, you try do a union with a set of frozenset.
The first case comes down to (union with a frozenset at each iteration):
cities = set()
cities.union(frozenset(['Atlanta', 'Chicago']))
cities.union(frozenset(['Atlanta', 'Dallas']))
...
So you have (mathematically):
cities = {} # Empty set
cities = {} U {'Atlanta', 'Chicago'} = {'Atlanta', 'Chicago'}
cities = {'Atlanta', 'Chicago'} U {'Atlanta', 'Dallas'} = {'Atlanta', 'Chicago', 'Dallas'}
...
In your (last) case, you are doing the following (one union with a sequence of frozenset):
cities = set()
cities.union([frozenset(['Atlanta', 'Chicago']), frozenset(['Atlanta', 'Dallas']), ...])
So you have:
cities = {}
cities = {} U {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...}
= {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...} # Nothing disappears
Since no two pairs are identical, you get a set of all the pairs in your initial dictionary, because you are passing a set of set (pair) of cities, not a set of cities to .union().
On a more abstract point of view, you are trying to obtain:
S = {} U S1 U S2 U S3 U ... U Sn = (((({} U S1) U S2) U S3) U ...) U Sn
With:
S = {} U {S1, S2, S3, ..., Sn}

Categories

Resources