Faster way to query & compute in Pandas [duplicate]

Faster way to query & compute in Pandas [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?

I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')

I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania

Related

Transform one row to a data frame with multiple rows

I have a data frame containing one row:
df_1D = pd.DataFrame({'Day1':[5],
'Day2':[6],
'Day3':[7],
'ID':['AB12'],
'Country':['US'],
'Destination_A':['Miami'],
'Destination_B':['New York'],
'Destination_C':['Chicago'],
'First_Agent':['Jim'],
'Second_Agent':['Ron'],
'Third_Agent':['Cynthia']},
)
Day1 Day2 Day3 ID ... Destination_C First_Agent Second_Agent Third_Agent
0 5 6 7 AB12 ... Chicago Jim Ron Cynthia
I'm wondering if there's an easy way, to transform it into a dataframe with three rows as shown here:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia

Have you tried to pivot it with .pivot function? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

One option using reshaping, which only requires to know the final columns:
# define final columns
cols = ['Day', 'ID', 'Destination', 'Country', 'Agent']
# the part below is automatic
# ------
# extract the keywords
pattern = f"({'|'.join(cols)})"
new = df_1D.columns.str.extract(pattern)[0]
# and reshape
out = (df_1D
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount()]), axis=1)
.loc[0].unstack(0).ffill()[cols]
)
Output:
Day ID Destination Country Agent
0 5 AB12 Miami US Jim
1 6 AB12 New York US Ron
2 7 AB12 Chicago US Cynthia
alternative defining idx/cols separately
idx = ['ID', 'Country']
cols = ['Day', 'Destination', 'Agent']
df2 = df_1D.set_index(idx)
pattern = f"({'|'.join(cols)})"
new = df2.columns.str.extract(pattern)[0]
out = (df2
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount().astype(str)],
names=[None, None]),
axis=1)
.stack().reset_index(idx)
)

clomuns_day=[col for col in df_1D if col.startswith('Day')]
clomuns_dest=[col for col in df_1D if col.startswith('Destination')]
clomuns_agent=[col for col in df_1D if 'Agent'in col]
new_df=pd.DataFrame()
new_df['Day']=df_1D[clomuns_day].values.tolist()[0]
new_df['ID']= list(df_1D['ID'])*len(new_df)
new_df['Country']= list(df_1D['Country'])*len(new_df)
new_df['Destination']=df_1D[clomuns_dest].values.tolist()[0]
new_df['Agent']=df_1D[clomuns_agent].values.tolist()[0]
Out:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
you can use it whatever destination is repeat

One option is with pivot_longer from pyjanitor, where for this case, you pass a list of regexes to names_pattern, and the new column names to names_to:
# pip install pyjanitor
import janitor
import pandas as pd
(df_1D
.pivot_longer(
index=['ID','Country'],
names_to = ['Day','Destination','Agent'],
names_pattern=['Day','Destination','Agent'])
)
ID Country Day Destination Agent
0 AB12 US 5 Miami Jim
1 AB12 US 6 New York Ron
2 AB12 US 7 Chicago Cynthia

I don't think there is a way to treat this fully automated. It requires manual manipulation. This is the shortest code that comes to my mind. Feel free to comment:
d1 = {}
for k in ['Day', 'Destination', 'Agent']:
d1[k] = [d[i][0] for i in d.keys() if k in i]
for k in ['ID', 'Country']:
d1[k] = d[k] * len(d1['Day'])
d1 = pd.DataFrame(d1)
Output:
Hope this help.

Replacing values from one dataframe to another

I'm trying to fix discrepancies in a column from one df to a column in another.
The tables are not sorted as well.
How can i do this using python. Example:
df1
Age Name
40 Sid Jones
50 Alex, Bot
32 Tony Jar
65 Fred, Smith
24 Brad, Mans
df2
Age Name
24 Brad Mans
32 Tony Jar
40 Sid Jones
65 Fred Smith
50 Alex Bot
I need to replace the values in df2 to match those in df1 as you can see in my example the discrepancies are commas in the names.
Expected outcome for df2:
Age Name
24 Brad, Mans
32 Tony Jar
40 Sid Jones
65 Fred, Smith
50 Alex, Bot
The values in df2 should be changed to match the df1s values.

Create a column in df1 with commas removed from the Name column
df1['Name_nocomma'] = df1.Name.str.replace(',', '')
merge df1 to df2 using Name_nocomma & Name to get the corrected Name create a new version of df2
df2_out = df2.merge(df1, left_on='Name', right_on='Name_nocomma', how='left')[['Age_x', 'Name_x', 'Name_y']]
use combine_first to coalesce Name_y & Name_x into a new column Name
df2_out['Name'] = df2_out.Name_y.combine_first(df2_out.Name_x)
drop / rename the intermediate columns
del df1['Name_nocomma']
del df2_out['Name_x']
del df2_out['Namy_y']
df2_out.rename({'Age_x': 'Age'}, axis=1, inplace=True)
df2_out
#outputs:
Age Name
0 24 Brad Mans
1 32 Tony Jar
2 40 Sid Jones
3 65 Fred Smith
4 50 Alex Bot

you need sort and append
df1.sort(by=['Age'], inplace = True)
df2.sort(by=['Age'], inplace = True)
result_df = df1.append(df2)

Matching and Joining Two Inconsistent DataFrames

I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.
As an example:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?

you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
This assumes that each rows age and location are unique

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?

I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

add a row at top in pandas dataframe [duplicate]

This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?

Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male

If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...

#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index

Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male

import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works

This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to query & compute in Pandas [duplicate] - python

I believe you can do a map: s = df2.groupby('name')[['City','State']].agg(list) df['city_list'] = df['Name'].map(s['City']) df['State_list'] = df['Name'].map(s['State']) Or a left merge after you got s: df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')

Related

Transform one row to a data frame with multiple rows

Replacing values from one dataframe to another

Matching and Joining Two Inconsistent DataFrames

Make Pandas Dataframe column equal to value in another Dataframe based on index

add a row at top in pandas dataframe [duplicate]

Categories

Resources