Fill pandas dataframe with dictionary elements - python

I have a dataframe df structured as well:
Name Surname Nationality
Joe Tippy Italian
Adam Wesker American
I would like to create a new record based on a dictionary whose keys corresponds to the column names:
new_record = {'Name': 'Jimmy', 'Surname': 'Turner', 'Nationality': 'Australian'}
How can I do that? I tried with a simple:
df = df.append(new_record, ignore_index=True)
but if I have a missing value in my record the dataframe doesn't get filled with a space, instead it leaves me the last column empty.

IIUC replace missing values in next step:
new_record = {'Surname': 'Turner', 'Nationality': 'Australian'}
df = pd.concat([df, pd.DataFrame([new_record])], ignore_index=True).fillna('')
print (df)
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Turner Australian
Or use DataFrame.reindex:
df = pd.concat([df, pd.DataFrame([new_record])].reindex(df.columns, fill_value='', axis=1), ignore_index=True)

A simple way if you have a range index:
df.loc[len(df)] = new_record
Updated dataframe:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Turner Australian
If you have a missing key (for example 'Surname'):
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy NaN Australian
If you want empty strings:
df.loc[len(df)] = pd.Series(new_record).reindex(df.columns, fill_value='')
Output:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Australian

Related

Pandas: a Pythonic way to create a hyperlink from a value stored in another column of the dataframe

I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee

Pivoting without numerical aggregation/ a numerical column

I have a dataframe that looks like this
d = {'Name': ['Sally', 'Sally', 'Sally', 'James', 'James', 'James'], 'Sports': ['Tennis', 'Track & field', 'Dance', 'Dance', 'MMA', 'Crosscountry']}
df = pd.DataFrame(data=d)
Name
Sports
Sally
Tennis
Sally
Track & field
Sally
Dance
James
Dance
James
MMA
James
Crosscountry
It seems that pandas' pivot_table only allows reshaping with numerical aggregation, but I want to reshape it to wide format such that the strings are in the "values":
Name
First_sport
Second_sport
Third_sport
Sally
Tennis
Track & field
Dance
James
Dance
MMA
Crosscountry
Is there a method in pandas that can help me do this? Thanks!
You can do that, either with .pivot() if your column / index names are unique, or with .pivot_table() by providing an aggregation function that works on strings too, e.g. 'first'.
>>> df['Sport_num'] = 'Sport ' + df.groupby('Name').cumcount().astype(str)
>>> df
Name Sports Sport_num
0 Sally Tennis Sport 0
1 Sally Track & field Sport 1
2 Sally Dance Sport 2
3 James Dance Sport 0
4 James MMA Sport 1
5 James Crosscountry Sport 2
>>> df.pivot(index='Name', values='Sports', columns='Sport_num')
Sport_num Sport 0 Sport 1 Sport 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance
>>> df.pivot_table(index='Name', values='Sports', columns='Sport_num', aggfunc='first')
Sport_num Sport 0 Sport 1 Sport 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance
Another solution:
print(
df.groupby("Name")
.agg(list)["Sports"]
.apply(pd.Series)
.rename(columns={0: "First", 1: "Second", 2: "Third"})
.add_suffix("_sport")
.reset_index()
)
Prints:
Name First_sport Second_sport Third_sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance
We can also use groupby cumcount in conjunction with set_index + unstack:
new_df = df.set_index(['Name', df.groupby('Name').cumcount()]).unstack()
new_df:
Sports
0 1 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance
We can do some additional cleanup by renaming and collapsing the MultiIndex:
new_df = (
df.set_index(['Name', df.groupby('Name').cumcount()])
.unstack()
.rename(columns={0: "First", 1: "Second", 2: "Third",
'Sports': 'Sport'})
)
new_df.columns = new_df.columns.swaplevel().map('_'.join)
new_df = new_df.reset_index()
new_df:
Name First_Sport Second_Sport Third_Sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance
If wanting a programmatic conversion from ints to ordinal words we can use something like inflect:
import inflect
new_df = df.set_index([
'Name', df.groupby('Name').cumcount().add(1)
]).unstack()
# Collapse MultiIndex
p = inflect.engine()
new_df.columns = new_df.columns.map(
# Convert to Ordinal Word and Column to singular noun
lambda c: f'{p.number_to_words(p.ordinal(c[1])).capitalize()}_'
f'{p.singular_noun(c[0])}'
)
new_df = new_df.reset_index()
new_df:
Name First_Sport Second_Sport Third_Sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance

How to choose random item from a dictionary to df and exclude one item?

I have a dictionary and a dataframe, for example:
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I've created a column called CITY by using the following:
df["CITY"]=np.random.choice(list(city), len(df))
df
Name CITY
0 Tom New York
1 Joseph LA
2 Krish Miami
3 John New Yor
Now, I would like to generate a new column - CITY2 with a random item from city dictionary, but I would like CITY will be a different item than CITY2, so basically when I'm generating CITY2 I need to exclude CITY item.
It's worth mentioning that my real df is quite large so I need it to be effective as possible.
Thanks in advance.
continue with approach you have used
have used pd.Series() as a convenience to remove value that has already been used
wrapped in apply() to get value of each row
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
df["CITY"]=np.random.choice(list(city), len(df))
df["CITY2"] = df["CITY"].apply(lambda x: np.random.choice(pd.Series(city).drop(x).index))
Name
CITY
CITY2
0
Tom
Miami
New York
1
Joseph
LA
Miami
2
Krish
New York
Miami
3
John
New York
LA
You could also first group by "CITY", remove the current city per group from the city dict and then create the new random list of cities.
Maybe this is faster because you don't have to drop one city per row, but per group.
city2 = pd.Series()
for key,group in df.groupby('CITY'):
cities_subset = np.delete(np.array(list(city)),list(city).index(key))
city2 = city2.append(pd.Series(np.random.choice(cities_subset, len(group)),index=group.index))
df["CITY2"] = city2
This gives for example:
Name CITY CITY2
0 Tom New York LA
1 Joseph New York Miami
2 Krish LA New York
3 John New York LA

Matching and Joining Two Inconsistent DataFrames

I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.
As an example:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?
you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
This assumes that each rows age and location are unique

Merging two columns with different information, python

I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!
Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith
Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith

Categories

Resources