I've read through the pandas documentation on merging but am still quite confused on how to apply them to my case. I have 2 dataframes that I'd like to merge - I'd like to merge on the common column 'Town', and also merge on the 'values' in a column which are the 'column names' in the 2nd df.
The first df summarizes the top 5 most common venues in each town:
The second df summarizes the frequencies of all the venue categories in each town:
The output I want:
Ang Mo Kio | Food Court | Coffee Shop | Dessert Shop | Chinese Restaurant | Japanese Restaurant | Freq of Food Court | Freq of Coffee Shop |...
What I've tried with merge:
newdf = pd.merge(sg_onehot_grouped, sg_venues_sorted, left_on=['Town'], right_on=['1st Most Common Venue'])
#only trying the 1st column because wanted to scale down my code
but I got an empty dataframe with the column names as all the columns from both dataframes.
Appreciate any help. Thanks.
Related
I have a df, like
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
2 C Coco Gym Beer
... ... ... ... ...
If I want to select and get a new df which rows contains 'Park'.
Desired result:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
...
another new df which rows contains 'Gym'.
Desired results:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
2 C Coco Gym Beer
...
How could I do it?
There is no problem to select park in one column, df.[df['1st'] == 'park'] but have problems to select from multi columns 1st, 2nd, 3rd etc.
You can perform "or" operations in pandas by using the pipe |, so in this specific case, you could try:
df_filtered = df[(df['1st'] == 'park') | (df['2nd'] == 'park') | (df['3rd'] == 'park')]
Alternatively, you could use the .any() function with the argument axis=1 which will return a row where there is any match:
df_filtered = df[df[['1st', '2nd', '3rd']].isin(['park']).any(axis=1)]
I have seen a number of similar questions but cannot find a straightforward solution to my issue.
I am working with a pandas dataframe containing contact information for constituent donors to a nonprofit. The data has Households and Individuals. Most Households have member Individuals, but not all Individuals are associated with a Household. There is no data that links the Household to the container Individuals, so I am attempting to match them up based on other data - Home Street Address, Phone Number, Email, etc.
A simplified version of the dataframe looks something like this:
Constituent Id Type Home Street
1234567 Household 123 Main St.
2345678 Individual 123 Main St.
3456789 Individual 123 Main St.
4567890 Individual 433 Elm Rd.
0123456 Household 433 Elm Rd.
1357924 Individual 500 Stack Ln.
1344444 Individual 500 Stack Ln.
I am using groupby in order to group the constituents. In this case, by Home Street. I'm trying to ensure that I only get groupings with more than one record (to exclude Individuals unassociated with a Household). I am using something like:
df1 = df[df.groupby('Home Street').filter(lambda x: len(x)>1)
What I would like to do is somehow export the grouped dataframe to a new dataframe that includes the Household Constituent Id first, then any Individual Constituent Ids. And in the case that there is no Household in the grouping, place the Individual Constituents in the appropriate locations. The output for my data set above would look like:
Household Individual Individual
1234567 2345678 3456789
0123456 4567890
1357924 1344444
I have toyed with iterating through the groupby object, but I feel like I'm missing some easy way to accomplish my task.
This should do it
df['Type'] = df['Type'] + '_' + (df.groupby(['Home Street','Type']).cumcount().astype(str))
df.pivot_table(index='Home Street', columns='Type', values='Constituent Id', aggfunc=lambda x: ' '.join(x)).reset_index(drop=True)
Output
Type Household_0 Individual_0 Individual_1
0 1234567 2345678 3456789
1 0123456 4567890 NaN
2 NaN 1357924 1344444
IIUC, we can use groupby agg(list) and some re-shaping using .join & explode
s = df.groupby(["Home Street", "Type"]).agg(list).unstack(1).reset_index(
drop=True
).droplevel(level=0, axis=1).explode("Household")
df1 = s.join(pd.DataFrame(s["Individual"].tolist()).add_prefix("Indvidual_")).drop(
"Individual", axis=1
)
print(df1.fillna(' '))
Household Indvidual_0 Indvidual_1
0 1234567 2345678 3456789
1 0123456 4567890
2 1357924 1344444
or we can ditch the join and cast Household to your index.
df1 = pd.DataFrame(s["Individual"].tolist(), index=s["Household"])\
.add_prefix("Individual_")
print(df1)
Individual_0 Individual_1
Household
1234567 2345678 3456789
0123456 4567890 None
NaN 1357924 1344444
Given DF1:
Title | Origin | %
Analyst Referral 3
Analyst University 10
Manager University 1
and DF2:
Title | Referral | University
Analyst
Manager
I'm trying set the values inside DF2 based on conditions such as:
DF2['Referral'] = np.where((DF1['Title']=='Analyst') & (DF1['Origin']=='Referral')), DF1['%'], '0'
What I'm getting as a result, is all the values in DF1['%'], and Im expecting to get only the value in the row where the conditions are met.
Like this:
Title | Referral | University
Analyst 3 10
Manager 1
Also, there is probably a more efficient way of doing this, I'm open to suggestions!
just use pivot, no need for logic:
s = """Title|Origin|%
Analyst|Referral|3
Analyst|University|10
Manager|University|1"""
df = pd.read_csv(StringIO(s), sep='|')
df.pivot('Title', 'Origin', '%')
Origin Referral University
Title
Analyst 3.0 10.0
Manager NaN 1.0
I have a table (df1) that contains 3 columns - id, Industry, Job. I also have a table 2 (df2), which contains some of the values that are missing from df1.
DF1:
ID | Industry | Job
1 Tech Data Engineer
2 N/A N/A
3 Blah Blah
4 N/A Police Officer
8 Transport N/A
DF2:
ID | Industry | Job
1 Tech Data Engineer
2 Oil Engineer
4 Government Police Officer
10 E-Sports Gamer
I want to transfer the values that are missing in df1 from df2. Note that I DO NOT want to fully replace any values that are in df1 already and I only want to take values from df2 if they are missing in df1. Also note that ID 8 is missing in DF2, so I would want to keep the N/A in df1.
How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.
You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.
If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')