grouping duplicates and combining the string columns via Pandas

grouping duplicates and combining the string columns via Pandas - python

let's say I have the following pandas dataframe called example:
city state school_lvl schl_name elem_name middle_name highschoo_name
Orlando fl 1 Union Park Union Park
Orlando fl 2 Legacy Legacy
Orlando fl 3 Colonial Colonial
where columns like elem_name were generated using if conditions on school_lvl and schl_name
what I would like instead is
city state elem_name middle_name highschoo_name
Orlando fl Union Park Legacy Colonial
How would I go about doing this? It's not really a groupie since there is no aggregate function? I'd greatly appreciate any help

Use groupby with lambda function for forward and back filling and then drop_duplicates by first 2 and last 3 columns:
c = example.columns[:2].tolist() + example.columns[-3:].tolist()
print (c)
['city', 'state', 'elem_name', 'middle_name', 'highschoo_name']
df = example.groupby(['city', 'state']).apply(lambda x: x.ffill().bfill()).drop_duplicates(c)
print (df)
city state school_lvl schl_name elem_name middle_name \
0 Orlando fl 1 Union Park Union Park Legacy
highschoo_name
0 Colonial
If want remove columns simplier is first drop and then remove duplicates by all columns:
example = example.drop(['school_lvl','schl_name'], axis=1)
df = example.groupby(['city', 'state']).apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df)
city state elem_name middle_name highschoo_name
0 Orlando fl Union Park Legacy Colonial

Related

Check if a pandas.core.series.Series contains a specific string in Python

I have a dataframe (df1) that looks like this:
first_name
last_name
affiliation
jean
dulac
University of Texas
peter
huta
University of Maryland
I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.
The second dataframe has the following form:
first_name
last_name
affiliations_all
jean
dulac
[{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean
dulac
[{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]
The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).
I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of
test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)
adapted to the fact that affiliations_all is a series.Series.
Any idea how to do that?
Thanks!

One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:
# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})
print(df2)
This prints:
first_name last_name city_name country_name affiliation
0 jean dulac Kyoto Japan Kyoto University
1 jean dulac Texas USA University of Texas
And the seconds step will be merge df1 with transformed df2:
df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
print(df_out)
Prints:
first_name last_name affiliation city_name country_name
0 jean dulac University of Texas Texas USA

Fill pandas dataframe with dictionary elements

I have a dataframe df structured as well:
Name Surname Nationality
Joe Tippy Italian
Adam Wesker American
I would like to create a new record based on a dictionary whose keys corresponds to the column names:
new_record = {'Name': 'Jimmy', 'Surname': 'Turner', 'Nationality': 'Australian'}
How can I do that? I tried with a simple:
df = df.append(new_record, ignore_index=True)
but if I have a missing value in my record the dataframe doesn't get filled with a space, instead it leaves me the last column empty.

IIUC replace missing values in next step:
new_record = {'Surname': 'Turner', 'Nationality': 'Australian'}
df = pd.concat([df, pd.DataFrame([new_record])], ignore_index=True).fillna('')
print (df)
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Turner Australian
Or use DataFrame.reindex:
df = pd.concat([df, pd.DataFrame([new_record])].reindex(df.columns, fill_value='', axis=1), ignore_index=True)

A simple way if you have a range index:
df.loc[len(df)] = new_record
Updated dataframe:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Turner Australian
If you have a missing key (for example 'Surname'):
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy NaN Australian
If you want empty strings:
df.loc[len(df)] = pd.Series(new_record).reindex(df.columns, fill_value='')
Output:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Australian

Pandas: How to find whether address in one dataframe is from city and state in another dataframe?

I have a dataframe of addresses as below:
main_df =
address
0 3, my_street, Mumbai, Maharashtra
1 Bangalore Karnataka 45th Avenue
2 TelanganaHyderabad some_street, some apartment
And I have a dataframe with city and state as below (note few states have cities with same names too:
city_state_df =
city state
0 Mumbai Maharashtra
1 Ahmednagar Maharashtra
2 Ahmednagar Bihar
3 Bangalore Karnataka
4 Hyderabad Telangana
I want to have a mapping of city and state next to each address. I am able to do so with iterrows() with nested for loops. However, both take more than an hour each for mere 15k records. What is the optimum way of achieving this considering addresses are randomly written and multiple states have same city name?
My code below:
main_df = pd.DataFrame({'address': ['3, my_street, Mumbai, Maharashtra', 'Bangalore Karnataka 45th Avenue', 'TelanganaHyderabad some_street, some apartment']})
city_state_df = pd.DataFrame({'city': ['Mumbai', 'Ahmednagar', 'Ahmednagar', 'Bangalore', 'Hyderabad'],
'state': ['Maharashtra', 'Maharashtra', 'Bihar', 'Karnataka', 'Telangana']})
df['city'] = np.nan
df['state'] = np.nan
for i, df_row in df.iterrows():
for j, city_row in city_state_df.iterrows():
if city_row['city'] in df_row['address']:
city_filtered = city[city['city'] == city_row['city']]
for k, fil_row in city_filtered.iterrows():
if fil_row['state'] in df_row['address']:
df_row['city'] = fil_row['city']
df_row['state'] = fil_row['state']
break
break

Hello maybe something like this:
main_df = main_df.reindex(columns=[*main_df.columns.tolist(), 'state', 'city'],fill_value=None)
for i, row in city_state_df.iterrows():
main_df.loc[(main_df.address.str.contains(row.city)) & \
(main_df.address.str.contains(row.state)), \
['city', 'state']] = [row.city, row.state]

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?

I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

Fill pandas dataframe rows from values in another dataframe rows

I have two pandas dataframes as given below:
df1
Name City Postal_Code State
James Phoenix 85003 AZ
John Scottsdale 85259 AZ
Jeff Phoenix 85003 AZ
Jane Scottsdale 85259 AZ
df2
Postal_Code Income Category
85003 41038 Two
85259 104631 Four
I would like to insert two columns, Income and Category, to df1 by capturing the values for Income and Category from df2 corresponding to the postal_code for each row in df1.
The closest question that I could find in SO was this - Fill DataFrame row values based on another dataframe row's values pandas. But, the pd.merge solution does not solve the problem for me. Specifically, I used
pd.merge(df1,df2,on='postal_code',how='outer')
All I got was nan values in the two new columns. Not sure whether this is because the no of rows for df1 and df2 are different. Any suggestions to solve this problem?

you just have the wrong how, use 'inner' instead. This matches where keys exist in both dataframes
df1.Postal_Code = df1.Postal_Code.astype(int)
df2.Postal_Code = df2.Postal_Code.astype(int)
df1.merge(df2,on='Postal_Code',how='inner')
Name City Postal_Code State Income Category
0 James Phoenix 85003 AZ 41038 Two
1 Jeff Phoenix 85003 AZ 41038 Two
2 John Scottsdale 85259 AZ 104631 Four
3 Jane Scottsdale 85259 AZ 104631 Four

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

grouping duplicates and combining the string columns via Pandas - python

Related

Check if a pandas.core.series.Series contains a specific string in Python

Fill pandas dataframe with dictionary elements

Pandas: How to find whether address in one dataframe is from city and state in another dataframe?

Make Pandas Dataframe column equal to value in another Dataframe based on index

Fill pandas dataframe rows from values in another dataframe rows

Categories

Resources