Matching part of a string with a value in two pandas dataframes - python

Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})

Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)

Related

Replace list of items in Pandas dataframe by tree leafs

I have a tree Locations which is has a Continent -> Country -> Location hierarchy. Now I have a dataframe which per row has a list entries of this tree.
How can I replace the entries of the list per row by the leaf's its tree.
My creativity in apply or map and possible a lambda function is lacking.
Minimal example;
import pandas as pd
Locations = {
'Europe':
{'Germany': ['Berlin'],
'France': ['Paris','Bordeaux']},
'Asia':
{'China': ['Hong Kong'],
'Indonesia': ['Jakarta']},
'North America':
{'United States':['New York','Washington']}}
df = pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['North America','United States','Asia','France'],
['North America','Asia','Europe','Germany']]})
df = df.apply(...)?
df = df.map(...)?
# How to end up with:
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux'],
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux','Berlin']]})
# Note the order of the locations doesn't matter is also OK
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['Jakarta','Washington','Hong Kong','Paris','New York','Bordeaux'],
['Jakarta','Berlin','Washington','Hong Kong','Paris','New York','Bordeaux']]})
You do not really need the apply method. You can start by changing the structure of your Locations dictionary in order to map the actual values to your exploded data frame. Then, just combine several explode, drop_duplicates and groupby statements with different aggregation logics to produce your desired result.
Code:
import pandas as pd
from collections import defaultdict
from itertools import chain
Locations = {
'Europe':{'Germany': ['Berlin'], 'France': ['Paris','Bordeaux']},
'Asia': {'China': ['Hong Kong'], 'Indonesia': ['Jakarta']},
'North America': {'United States': ['New York','Washington']}
}
df = pd.DataFrame({'Persons':['A', 'B'], 'Locations': [['North America','United States','Asia','France'], ['North America','Asia','Europe']]})
mapping_locs = defaultdict(list)
for key, val in Locations.items():
mapping_locs[key] = list(chain.from_iterable(list(val.values())))
for lkey, lval in val.items():
mapping_locs[lkey] = lval
(
df.assign(
mapped_locations=(
df.explode("Locations")["Locations"].map(mapping_locs).reset_index()
.explode("Locations").drop_duplicates(subset=["index", "Locations"])
.groupby(level=0).agg({"index": "first", "Locations": list})
.groupby("index").apply(lambda x: list(chain.from_iterable(x["Locations"])))
)
)
)
Output:
Persons Locations mapped_locations
0 A [North America, United States, Asia, France] [New York, Washington, Hong Kong, Jakarta, Par...
1 B [North America, Asia, Europe] [New York, Washington, Hong Kong, Jakarta, Ber...

Aggregating and group by in Pandas considering some conditions

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?
try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

How to convert dataframe to nested dictionary?

I am running Python 3.7 and Pandas 1.1.3 and have a DataFrame which looks like this:
location = {'city_id': [22000,25000,27000,35000],
'population': [3971883,2720546,667137,1323],
'region_name': ['California','Illinois','Massachusetts','Georgia'],
'city_name': ['Los Angeles','Chicago','Boston','Boston'],
}
df = pd.DataFrame(location, columns = ['city_id', 'population','region_name', 'city_name'])
I want to transform this dataframe into a dict that looks like:
{
'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}
}
And if the same cities in different regions, nested JSON should be sorted by population (for example Boston is in Massachusetts and Georgia. The city in Massachusetts is bigger, we output it first.
My code is:
result = df = df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).to_dict()
Output:
{'Boston': {'city_id': {'Massachusetts': 27000, 'Georgia': 35000}},
'Chicago': {'city_id': {'Illinois': 25000}},
'Los Angeles': {'city_id': {'California': 22000}}}
how can you see to dictionary add key - "city_id"
Tell me, please, how I should change my code that gets the expected result?
just method chain apply() method to your current solution:
result=df.groupby(['city_name'])[['region_name','city_id']].apply(lambda x: x.set_index('region_name').to_dict()).apply(lambda x:list(x.values())[0]).to_dict()
Now if you print result you will get your expected output:
{'Boston': {'Massachusetts': 27000, 'Georgia': 35000},
'Chicago': {'Illinois': 25000},
'Los Angeles': {'California': 22000}}

Populate new column based on conditions in another column

I'm playing about with Python and pandas.
I have created a dataframe, I have a column (axis 1) called 'County' but I need to create a column called 'Region' and populate it like this (atleast I think):
If County column == 'Suffolk' or 'Norfolk' or 'Essex' then in Region column insert 'East Anglia'
If County column == 'Kent' or 'East Sussex' or 'West Sussex' then in Region Column insert 'South East'
If County column == 'Dorset' or 'Devon' or 'Cornwall' then in Region Column insert 'South West'
and so on...
So far I have this:
myDataFrame['Region'] = np.where(myDataFrame['County']=='Suffolk', 'East Anglia', '')
But I suspect this won't work for any other counties
As I'm sure is obvious I am a beginner. I have tried googling and reading but only could find out about numpy where, which got me this far.
You'll definitely need df.isin and loc based indexing:
df['Region'] = np.nan
df.loc[df.County.isin(['Suffolk','Norfolk', 'Essex']), 'Region'] = 'East Anglia'
df.loc[df.County.isin(['Kent', 'East Sussex', 'West Sussex']), 'Region'] = 'South East'
df.loc[df.County.isin(['Dorset', 'Devon', 'Cornwall']), 'Region'] = 'South West'
You could also create a mapping of sorts and use df.map or df.replace:
mapping = { 'Suffolk' : 'East Anglia', 'Norfolk': 'East Anglia', ... 'Kent' :'South East', ..., ... }
df['Region'] = df.County.map(mapping)
I would prefer a map here because it would convert non-matches to NaN, which would be the ideal thing.

Need help understanding the behavior of a for loop

I am working through a tutorial on sets in Python 2.7, and I have run into a behavior using a for loop that I do not understand, and I am trying to find out what the reason for the difference in outputs might be.
The object of the exercise is to produce a set, cities, from a dictionary that contains keys made up of city pairs of frozen sets using a for loop.
The data comes from the following dictionary:
flight_distances = {
frozenset(['Atlanta', 'Chicago']): 590.0,
frozenset(['Atlanta', 'Dallas']): 720.0,
frozenset(['Atlanta', 'Houston']): 700.0,
frozenset(['Atlanta', 'New York']): 750.0,
frozenset(['Austin', 'Dallas']): 180.0,
frozenset(['Austin', 'Houston']): 150.0,
frozenset(['Boston', 'Chicago']): 850.0,
frozenset(['Boston', 'Miami']): 1260.0,
frozenset(['Boston', 'New York']): 190.0,
frozenset(['Chicago', 'Denver']): 920.0,
frozenset(['Chicago', 'Houston']): 940.0,
frozenset(['Chicago', 'Los Angeles']): 1740.0,
frozenset(['Chicago', 'New York']): 710.0,
frozenset(['Chicago', 'Seattle']): 1730.0,
frozenset(['Dallas', 'Denver']): 660.0,
frozenset(['Dallas', 'Los Angeles']): 1240.0,
frozenset(['Dallas', 'New York']): 1370.0,
frozenset(['Denver', 'Los Angeles']): 830.0,
frozenset(['Denver', 'New York']): 1630.0,
frozenset(['Denver', 'Seattle']): 1020.0,
frozenset(['Houston', 'Los Angeles']): 1370.0,
frozenset(['Houston', 'Miami']): 970.0,
frozenset(['Houston', 'San Francisco']): 1640.0,
frozenset(['Los Angeles', 'New York']): 2450.0,
frozenset(['Los Angeles', 'San Francisco']): 350.0,
frozenset(['Los Angeles', 'Seattle']): 960.0,
frozenset(['Miami', 'New York']): 1090.0,
frozenset(['New York', 'San Francisco']): 2570.0,
frozenset(['San Francisco', 'Seattle']): 680.0,
}
There is also a test list that will create the intended set as a check:
flying_circus_cities = [
'Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver',
'New York', 'Los Angeles', 'San Francisco', 'Atlanta',
'Seattle', 'Austin'
]
When the code is written in the following form, the loop produces the intended result.
cities = set()
for pair in flight_distances:
cities = cities.union(pair)
print cities
print "Check:", cities == set(flying_circus_cities)
Output:
set(['Houston', 'Chicago', 'Miami', 'Boston', 'Dallas', 'Denver', 'New York', 'Los Angeles', 'San Francisco', 'Atlanta', 'Seattle', 'Austin'])
Check: True
However, if I attempt as a comprehension, with either of the following, I get a different result.
cities = set()
cities = {pair for pair in flight_distances}
print cities
print "Check:", cites == set(flying_circus_cities)
or
cities = set()
cities = cities.union(pair for pair in flight_distances)
print cities
print "Check:", cities == set(flying_circus_cities)
Output for both:
set([frozenset(['Atlanta', 'Dallas']), frozenset(['San Francisco', 'New York']), frozenset(['Denver', 'Chicago']), frozenset(['Houston', 'San Francisco']), frozenset(['San Francisco', 'Austin']), frozenset(['Seattle', 'Los Angeles']), frozenset(['Boston', 'New York']), frozenset(['Houston', 'Atlanta']), frozenset(['New York', 'Chicago']), frozenset(['San Francisco', 'Seattle']), frozenset(['Austin', 'Dallas']), frozenset(['New York', 'Dallas']), frozenset(['Houston', 'Chicago']), frozenset(['Seattle', 'Denver']), frozenset(['Seattle', 'Chicago']), frozenset(['Miami', 'New York']), frozenset(['Los Angeles', 'Denver']), frozenset(['Miami', 'Houston']), frozenset(['San Francisco', 'Los Angeles']), frozenset(['New York', 'Denver']), frozenset(['Atlanta', 'Chicago']), frozenset(['Boston', 'Chicago']), frozenset(['Houston', 'Austin']), frozenset(['Houston', 'Los Angeles']), frozenset(['New York', 'Los Angeles']), frozenset(['Atlanta', 'New York']), frozenset(['Denver', 'Dallas']), frozenset(['Los Angeles', 'Dallas']), frozenset(['Los Angeles', 'Chicago'])])
Check: False
I cannot figure out why the for loop in the first example unpacks the pairs as intended so that it produces a set with one instance of each city, while trying to write the loop as a comprehension pulls out the frozenset([city1, city2]) pairs and places them in the set instead.
I do not understand why pair would give the city strings in the first instance but passes the frozenset in the second instance.
Can someone explain the different behavior?
Note: As explained by Holt and donkopotamus, the issue of why this was behaving differently was that using the comprehension evaluated the entire dictionary completely before making a single assignment to the cities variable, thus creating a set of frozensets, where as the standard for loop unpacked the pairs one at a time and evaluated each individual one separately, assigning them to cities one at a time with each pass of the for loop and allowing the union function to evaluate each instance of the pairs being passed to it.
They further explained that using the *-operator unpacks the dictionary in the comprehension to produce the desired behavior.
cities = cities.union(*(set(pair) for pair in flight_distances))
The expression:
cities = set()
cities = cities.union(pair for pair in flight_distances)
will take the union of the empty set {} with another set
{pair_0, pair_1, pair_2, ..., pair_n}
leaving you with a set of sets.
In contrast, the following will give you all of the cities flown to:
>>> set.union(*(set(pair) for pair in flight_distances))
{'Atlanta',
'Austin',
'Boston',
'Chicago',
'Dallas',
'Denver',
'Houston',
'Los Angeles',
'Miami',
'New York',
'San Francisco',
'Seattle'}
Here we transform each of the frozen set keys into a plain set and find the union.
In the first version, pair is a frozenset at each loop, so you can do a union with it, while in your version, you try do a union with a set of frozenset.
The first case comes down to (union with a frozenset at each iteration):
cities = set()
cities.union(frozenset(['Atlanta', 'Chicago']))
cities.union(frozenset(['Atlanta', 'Dallas']))
...
So you have (mathematically):
cities = {} # Empty set
cities = {} U {'Atlanta', 'Chicago'} = {'Atlanta', 'Chicago'}
cities = {'Atlanta', 'Chicago'} U {'Atlanta', 'Dallas'} = {'Atlanta', 'Chicago', 'Dallas'}
...
In your (last) case, you are doing the following (one union with a sequence of frozenset):
cities = set()
cities.union([frozenset(['Atlanta', 'Chicago']), frozenset(['Atlanta', 'Dallas']), ...])
So you have:
cities = {}
cities = {} U {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...}
= {{'Atlanta', 'Chicago'}, {'Atlanta', 'Dallas'}, ...} # Nothing disappears
Since no two pairs are identical, you get a set of all the pairs in your initial dictionary, because you are passing a set of set (pair) of cities, not a set of cities to .union().
On a more abstract point of view, you are trying to obtain:
S = {} U S1 U S2 U S3 U ... U Sn = (((({} U S1) U S2) U S3) U ...) U Sn
With:
S = {} U {S1, S2, S3, ..., Sn}

Categories

Resources