Copy values only to a new empty dataframe with column names - Pandas - python

I have two dataframes.
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df2 = pd.DataFrame(columns=['pid','gen','ethn'])
As you can see, the second dataframe (df2) is empty. But may also contain few rows of data at times
What I would like to do is copy dataframe values (only) from df1 to df2 with column names of df2 remain unchanged.
I tried the below but both didn't work
df2 = df1.copy(deep=False)
df2 = df1.copy(deep=True)
How can I achieve my output to be like this? Note that I don't want the column names of df1. I only want the data

Do:
df1.columns = df2.columns.tolist()
df2 = df2.append(df1)
## OR
df2 = pd.concat([df1, df2])
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
Edit based on OPs comment linking to the nature of dataframes:
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})
final_df = pd.DataFrame(columns=['pid','gen','ethn'])
Now do:
frame = [df1, df2, df3]
for i in range(len(frame)):
frame[i].columns = final_df.columns.tolist()
final_df = final_df.append(frame[i])
print(final_df)
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
0 4 Male Chinese
1 5 Female Indian
2 6 Not disclosed European
0 7 Male Chinese
1 8 Female Indian
2 9 Not disclosed European

The cleanest solution I think is to just append the df1 after its column names have been set properly:
df2 = df2.append(pd.DataFrame(df1.values, columns=df2.columns))

Related

Python: merge dataframes and keep all values in cells if not identical

So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.
EXAMPLE:
User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.
INPUT:
df
user
age
team
1
24
x
2
56
y
3
32
z
df = pd.DataFrame({'user': ['1', '2', '3'],
'age': [24,56,32],
'team': [x,y,z]})
df1
user
age
name
1
27
Ronald
2
NaN
Eugene
4
44
Jeff
5
61
Britney
df = pd.DataFrame({'user': ['1','2','4','5'],
'age': [27,NaN,44,61],
'name': ['Ronald','Eugene','Jeff','Britney']})
EXPECTED OUTPUT:
CASES:
two identical values: keep one
one value is NaN: keep non NaN value
two different values: concat with delimiter so it can be review later. I will highlight it.
user
age
team
name
1
24
27
2
56
y
Eugene
3
32
z
NaN
4
44
NaN
Jeff
5
61
NaN
Britney
Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
print(excel_files)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
That's what the OUTPUT looks like
user
age
team
name
1
27
x
Ronald
2
56
y
Eugene
3
32
z
NaN
4
44
NaN
Jeff
5
61
NaN
Britney
Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
#df1.set_index('uid',inplace=True)
print(excel_files)
#print(df1)
#print(df1.dtypes)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
#df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
cols_to_remove = df.columns
for column in cols_to_remove:
if "_dupe" in column:
new_col = str(column).replace('_dupe','')
df[new_col] = df[new_col].str.cat(df[column],sep='||')
print('New Values: ',df[new_col])
df.pop(column)
Any help will be appreciated. Thanks Raf
I would merge, then apply a groupby.agg on columns:
merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg('last')
)
Output:
user age team name
0 1 27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z None
3 4 44.0 None Jeff
4 5 61.0 None Britney
Alterntive output:
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)
Output:
user age team name
0 1 24.0|27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z
3 4 44.0 Jeff
4 5 61.0 Britney

merging data frames without deleting unique values (Python)

I have what seems like a simple problem but I can figure out how to do it...
I have 3 Dataframes.
df1 : 1 column, Product SKU
df2 : 2 Columns, Product SKU, Price(supplier 1)
df3 : 2 Columns, Product SKU, Price( Supplier 2)
I need to create a df4.
df4 : 3 Columns, Product SKU, Supplier 1 Price, Supplier 2 Price
Supplier 1 and 2 have some matching SKU.
df4 needs to contain all SKU's, and the price from each supplier. When the Supplier doesn't have a price for that SKU it should be 0 or Nan.
Any help will be great, I've tried merge(), join(), concatenate() and dropping duplicates but can't achieve the result I am looking for.
Many Thanks in advance.
USE - df4 = df2.merge(df3, on="Product_SKU", how = 'outer')
Code-
I Created random dataframe df4 it contains all unique Product_SKU and some rows will contain NaN values as it's price is not present in df2 or df3.
# initialize list of lists
data2 = [['clutch', 10], ['brake', 15],['tyre',50]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns=['Product_SKU', 'Price_Supplier_1'])
# initialize list of lists
data3 = [['tyre', 30], ['brake', 25],['gear',100]]
# Create the pandas DataFrame
df3 = pd.DataFrame(data3, columns=['Product_SKU', 'Price_Supplier_2'])
df4 = df2.merge(df3, on="Product_SKU", how = 'outer')
df4
Output-
Product_SKU Price_Supplier_1 Price_Supplier_2
0 clutch 10.0 NaN
1 brake 15.0 25.0
2 tyre 50.0 30.0
3 gear NaN 100.0

Left joining multiple datasets with same column headers

not sure if this question is answered ,Please help me to solve this .
I have tried my max to explain this .
Please refer the images to understand my query .
I want my below query solved in Python .
The query is :
I need to left merge a dataframe with 3 other dataframes .
But the tricky part is all the dataframes are having same column headers , and I want the same column to overlap the preceeding column in my output dataframe .
But while I use left merge in python , the column headers of all the dataframes are printed along with sufix "_x" and "_y".
The below are my 4 dataframes:
df1 = pd.DataFrame({"Fruits":['apple','banana','mango','strawberry'],
"Price":[100,50,60,70],
"Count":[1,2,3,4],
"shop_id":['A','A','A','A']})
df2 = pd.DataFrame({"Fruits":['apple','banana','mango','chicku'],
"Price":[10,509,609,1],
"Count":[8,9,10,11],
"shop_id":['B','B','B','B']})
df3 = pd.DataFrame({"Fruits":['apple','banana','chicku'],
"Price":[1000,5090,10],
"Count":[5,6,7],
"shop_id":['C','C','C']})
df4 = pd.DataFrame({"Fruits":['apple','strawberry','mango','chicku'],
"Price":[50,51,52,53],
"Count":[11,12,13,14],
"shop_id":['D','D','D','D']})
Now I want to left join df1 , with df2 , df3 and df4.
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Fruits'],
how='left'), data_frames)
But this produces an output as below :
The same columns are printed in the o/p dataset with suffix _x and _y
I want only a single Price , shop_id and count column like below:
It looks like what you want is combine_first, not merge:
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: right.set_index('Fruits').combine_first(left.set_index('Fruits')).reset_index(),
data_frames)
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 chicku 53 14 D
3 mango 52 13 D
4 strawberry 51 12 D
To filter the output to get only the keys from df1:
df_merged.set_index('Fruits').loc[df1['Fruits']].reset_index()
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 mango 52 13 D
3 strawberry 51 12 D
NB. everything would actually be easier if you set Fruits as index

Combining two dataframes with same column

I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE

Calculate weights for grouped data in pandas

I would like to calculate portfolio weights with a pandas dataframe. Here is some dummy data for an example:
df1 = DataFrame({'name' : ['ann','bob']*3}).sort('name').reset_index(drop=True)
df2 = DataFrame({'stock' : list('ABC')*2})
df3 = DataFrame({'val': np.random.randint(10,100,6)})
df = pd.concat([df1, df2, df3], axis=1)
Each person owns 3 stocks with a value val. We can calculate portfolio weights like this:
df.groupby('name').apply(lambda x: x.val/(x.val).sum())
which gives this:
If I want to add a column wgt to df I need to merge this result back to df on name and index. This seems rather clunky.
Is there a way to do this in one step? Or what is the way to do this that best utilizes pandas features?
Use transform, this will return a series with an index aligned to your original df:
In [114]:
df['wgt'] = df.groupby('name')['val'].transform(lambda x: x/x.sum())
df
Out[114]:
name stock val wgt
0 ann A 18 0.131387
1 ann B 43 0.313869
2 ann C 76 0.554745
3 bob A 16 0.142857
4 bob B 44 0.392857
5 bob C 52 0.464286

Categories

Resources