pandas finding inverse of merge - python

I have two pandas dataframes, one that is a list of states, cities, and a capital flag with a multiIndex of (state, city), and another that is a non-indexed (or default indexed, if that's more appropriate) list of states and their capitals, I need to perform an inner join on the two and then also find out which items in the cities df are NOT in the join.
Cities:
capital
state city
Ohio Akron N
Toledo N
Columbus N
Colorado Boulder N
Denver N
States:
state city
0 West Virginia Charleston
1 Ohio Columbus
Inner join to find the capital of Ohio:
pd.merge(cities, states, on=['state', 'city'], how='inner')
state city capital
0 Ohio Columbus N
Now I need to get a df that includes everything in the cities df EXCEPT Columbus, Ohio. I've been looking at variations of .isin(), both with and without reset_index(), but I can't get it work.
Code to create the cities and states dfs. I have set_index() as a separate call because if I try to do it when I create the df I get an error about ValueError: Shape of passed values is (3, 3), indices imply (2, 3) and haven't figured out a way around it.
cities = pd.DataFrame({'state':['Ohio', 'Ohio', 'Ohio', 'Colorado', 'Colorado'], 'city':['Akron', 'Toledo', 'Columbus', 'Boulder', 'Denver'], 'capital':['N', 'N', 'N', 'N', 'N']}, columns=['state', 'city', 'capital'])
cities.set_index(('state', 'city'))
states = pd.DataFrame({'state':['West Virginia', 'Ohio'], 'city':['Charleston', 'Columbus']})

IIUC, you could use merge with how='outer' and indicator='source', and the keep only those that are 'left_only':
merge = cities.merge(states, on=['state', 'city'], how='outer', indicator='source')
result = merge[merge.source.eq('left_only')].drop('source', axis=1)
print(result)
Output
state city capital
0 Ohio Akron N
1 Ohio Toledo N
3 Colorado Boulder N
4 Colorado Denver N
As an alternative you could use isin, in the following way:
mask = ~cities.reset_index().city.isin(states.city)
print(cities[pd.Series(data=mask.values, index=cities.index)])
Output
capital
state city
Ohio Akron N
Toledo N
Colorado Boulder N
Denver N
The idea of the second approach is to create a boolean mask with an index matching the one in cities. A variation on the second approach is the following:
# drop the index
re_indexed = cities.reset_index()
# find the mask
mask = ~re_indexed.city.isin(states.city)
# reindex back
result = re_indexed[mask].set_index(['state', 'city'])
print(result)

Related

Create a dataframe with columns and their unique values in pandas

I have tried looking for a way to create a dataframe of columns and their unique values. I know this has less use cases but would be a great way to get an initial idea of unique values. It would look something like this....
State
County
City
Colorado
Denver
Denver
Colorado
El Paso
Colorado Springs
Colorado
Larimar
Fort Collins
Colorado
Larimar
Loveland
Turns into this...
State
County
City
Colorado
Denver
Denver
El Paso
Colorado Springs
Larimar
Fort Collins
Loveland
I would use mask and a lambda
df.mask(cond=df.apply(lambda x : x.duplicated(keep='first')), other='')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland
Reproducible example. Please add this next time to your future questions to help others answer your question.
import pandas as pd
df = pd.DataFrame({
'State': ['Colorado', 'Colorado', 'Colorado', 'Colorado'],
'County': ['Denver', 'El Paso', 'Larimar', 'Larimar'],
'City': ['Denver', 'Colorado Springs', 'Fort Collins', 'Loveland']
})
df
State County City
0 Colorado Denver Denver
1 Colorado El Paso Colorado Springs
2 Colorado Larimar Fort Collins
3 Colorado Larimar Loveland
Drop duplicates from each column separately and then concatenate. Fill NaN with empty string.
pd.concat([df[col].drop_duplicates() for col in df], axis=1).fillna('')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland
This is the best solution I have come up with, hope to help others looking for something like it!
def create_unique_df(df) -> pd.DataFrame:
""" take a dataframe and creates a new one containing unique values for each column
note, it only works for two columns or more
:param df: dataframe you want see unique values for
:param type: pandas.DataFrame
return: dataframe of columns with unique values
"""
# using list() allows us to combine lists down the line
data_series = df.apply(lambda x: list( x.unique() ) )
list_df = data_series.to_frame()
# to create a df from lists they all neet to be the same leng. so we can append null
# values
# to lists and make them the same length. First find differenc in length of longest list and
# the rest
list_df['needed_nulls'] = list_df[0].str.len().max() - list_df[0].str.len()
# Second create a column of lists with one None value
list_df['null_list_placeholder'] = [[None] for _ in range(list_df.shape[0])]
# Third multiply the null list times the difference to get a list we can add to the list of
# unique values making all the lists the same length. Example: [None] * 3 == [None, None,
# None]
list_df['null_list_needed'] = list_df.null_list_placeholder * list_df.needed_nulls
list_df['full_list'] = list_df[0] + list_df.null_list_needed
unique_df = pd.DataFrame(
list_df['full_list'].to_dict()
)
return unique_df

Search and assign multiple strings with numpy.where or numpy.select in Python

I am trying to do a conditional string assignment - if the cell contains the locations, assign the geo name into the cell next to it. I tried np.where and np.select and they tend to work on a single value assignment instead of multiple value assignment. Any suggestion I can do it through Numpy or there's an easier way to do this?
Europe = ['London', 'Paris', 'Berlin']
North_America = ['New York', 'Toroto', 'Boston']
Asia = ['Hong Kong', 'Tokyo', 'Singapore']
data = {'location':["London, Paris", "Hong Kong", "London, New York", "Singapore, Toroto", "Boston"]}
df = pd.DataFrame(data)
location
0 London, Paris
1 Hong Kong
2 London, New York
3 Singapore, Toroto
4 Boston
# np.where approach
df['geo'] = np.where(( ( (df['location'].isin(Europe) ) ) | ( (df['location'].isin(North_America) ) ) ), 'Europe', 'North America')
# np.select approach
conditions = [
df['location'].isin(Europe),
df['location'].isin(North_America)
]
choices = ['Europe', 'North America']
df['geo'] = np.select(conditions, choices, default=0)
Expected output:
location geo
0 London, Paris Europe, Europe
1 Hong Kong Asia
2 London, New York Europe, North America
3 Singapore, Toroto Asia, North America
4 Boston North America
Create a mapping of each country -> area, then use explode and map to apply the mapping and finally, use groupby and apply to rebuild the list:
geo = {'Europe': Europe, 'North_America': North_America, 'Asia': Asia}
mapping = {country: area for area, countries in geo.items() for country in countries}
df['geo'] = df['location'].str.split(', ').explode().map(mapping) \
.groupby(level=0).apply(', '.join)
Output:
>>> df
location geo
0 London, Paris Europe, Europe
1 Hong Kong Asia
2 London, New York Europe, North_America
3 Singapore, Toroto Asia, North_America
4 Boston North_America
By using NumPy library together with python for loops we can get the results. At first we combine lists of country's cities together and then create another list named continents which length is the same as the created list of cities:
import numpy as np
import pandas as pd
continents = ["Europe"] * len(Europe) + ["North_America"] * len(North_America) + ["Asia"] * len(Asia)
countries = Europe + North_America + Asia
locations = data['location']
Then for each city, even for each in the combinations, we find its index in the created country list. Then we create a list for number of commas in each of that combinations for using to create the desired output with commas:
corsp = []
comma_nums = []
for i in locations:
for j, k in enumerate(i.split(', ')):
corsp.append(np.where(np.array(countries) == k)[0][0])
comma_nums.append(j)
continents list will be reordered and modified by created index list. Then its arguments combined in list format as the combination style which where in locations, and then the lists convert to strings as they are needed for the output:
reordered_continents = [continents[i] for i in corsp]
mod_continents = []
iter = 0
f = 1
for i in comma_nums:
mod_continents.append(reordered_continents[iter:i + f])
iter = i + f
f = iter + 1
for i, j in enumerate(mod_continents):
if len(j) > 1:
for k in j:
mod_continents[i] = ', '.join(j)
else:
mod_continents[i] = ''.join(j)
df['geo'] = mod_continents

Pandas: How to find whether address in one dataframe is from city and state in another dataframe?

I have a dataframe of addresses as below:
main_df =
address
0 3, my_street, Mumbai, Maharashtra
1 Bangalore Karnataka 45th Avenue
2 TelanganaHyderabad some_street, some apartment
And I have a dataframe with city and state as below (note few states have cities with same names too:
city_state_df =
city state
0 Mumbai Maharashtra
1 Ahmednagar Maharashtra
2 Ahmednagar Bihar
3 Bangalore Karnataka
4 Hyderabad Telangana
I want to have a mapping of city and state next to each address. I am able to do so with iterrows() with nested for loops. However, both take more than an hour each for mere 15k records. What is the optimum way of achieving this considering addresses are randomly written and multiple states have same city name?
My code below:
main_df = pd.DataFrame({'address': ['3, my_street, Mumbai, Maharashtra', 'Bangalore Karnataka 45th Avenue', 'TelanganaHyderabad some_street, some apartment']})
city_state_df = pd.DataFrame({'city': ['Mumbai', 'Ahmednagar', 'Ahmednagar', 'Bangalore', 'Hyderabad'],
'state': ['Maharashtra', 'Maharashtra', 'Bihar', 'Karnataka', 'Telangana']})
df['city'] = np.nan
df['state'] = np.nan
for i, df_row in df.iterrows():
for j, city_row in city_state_df.iterrows():
if city_row['city'] in df_row['address']:
city_filtered = city[city['city'] == city_row['city']]
for k, fil_row in city_filtered.iterrows():
if fil_row['state'] in df_row['address']:
df_row['city'] = fil_row['city']
df_row['state'] = fil_row['state']
break
break
Hello maybe something like this:
main_df = main_df.reindex(columns=[*main_df.columns.tolist(), 'state', 'city'],fill_value=None)
for i, row in city_state_df.iterrows():
main_df.loc[(main_df.address.str.contains(row.city)) & \
(main_df.address.str.contains(row.state)), \
['city', 'state']] = [row.city, row.state]

Removing duplicate elements within a pandas cell and counting the number of elements

I have a dataframe like this:
Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan
I want to get the following dataframe (without space between the cities):
Destinations_2 no_destinations
Paris,Oslo,Milan,Athens,Amsterdam 5
Boston,New York,London,Paris 4
Nice,Paris,Milan 3
How to remove duplicates within a cell?
You can use a list comprehension which is faster than using apply() (replace Col with the original column name) :
df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)
Col no_destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)
if there are spaces in between use
df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)
Output
Destinations nodestinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
# your data:
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)
>>>
Destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam
1 Boston,New York, Boston,London, Paris,New York
2 Nice,Paris, Milan,Paris, Nice,Milan
First: make every row of your column a list.
df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))
>>>
Destinations
0 [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1 [Boston, New York, Boston, London, Paris, New York]
2 [Nice, Paris, Milan, Paris, Nice, Milan]
Second: removes dups from the lists
df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))
# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))
>>>
Destinations
0 [Paris, Oslo, Milan, Athens, Amsterdam]
1 [Boston, New York, London, Paris]
2 [Nice, Paris, Milan]
Finally, create desired columns
df['no_destinations'] = df.Destinations.apply(lambda x: len(x))
df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))
All steps use the apply and lambda functions, you can chain or nest them together if you want
All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations). Let me try to answer both of your queries.
The idea below is to apply a method on the Destinations column which returns 2 series named Destinations_2 and no_destinations which contain unique elements separated by comma with no space, and count of unique elements, respectively.
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan'
]}
def remove_dups(x):
data = set(x.replace(" ", "").split(','))
return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])
df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())
Output:
Note: As you are not concerned with the order, I have used set above. If you need to maintain the order, you will have to replace set with some other logic to remove duplicates.

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

Categories

Resources