Data frame Merge in a specific format - python

I have two dataframe, and i am able to merge it. but I want to merge it in specific format ( column wise), Below are the further details
>df1
id A B C
0 1 20 0 1
1 2 23 1 2
>df2
id A B C
0 1 10 1 1
1 2 20 1 1
Below is my code and output
df = pd.merge(df1,df2,on='id',suffixes=('_Pre', '_Post'))
The output of this is :
id A_Pre B_Pre C_Pre A_Post B_Post C_Post
0 1 20 0 1 10 1 1
1 2 23 1 2 20 1 1
But the EXPECTED output should be, Can someone help or guide me for this :
id A_Pre A_Post B_Pre B_Post C_Pre C_Post
0 1 20 10 0 1 1 1
1 2 23 20 1 1 2 1

When subsequently manipulation is possible you can do domething like:
df[np.array([[x+"_Pre", x+"_Post"] for x in df1.columns.drop("id")]).flatten()]

If you just want to modify the order of your columns you can use reindex :
df = df.reindex(columns=['A_Pre','A_Post','B_Pre','B_Post','C_Pre','C_Post'])

You can order the columns in the new dataset using sorted and just add the column "id" in a second statement
order_col = sorted(df.columns[1:], key=lambda x:x[:3])
df_final = pd.concat([df['id'],df[order_col]], axis=1)

Related

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

How to multiply every column in one dataframe with all columns in other dataframe

I have two dataframes X_dummy and X_var, where X_dummy contains dummies and looks like this:
dummy1 dummy2
1 0
0 1
1 0
The X_var dataframe looks contains variables and looks like this:
var1 var2
4 2
10 5
1 1
Now I want to create a dataframe containing the cellwise product of every column from X_dummy with the complete X_var dataframe. Hence, my resulting dataframe should look like, X_result:
var1dummy1 var2dummy1 var1dummy2 var2dummy2
4 2 0 0
0 0 10 5
1 1 0 0
Does anyone know how to do this without using multiple for loops?
Something like numpy broadcast
new = pd.DataFrame(np.concatenate(df2.T.values * df1.T.values[:,None]).T)
new
Out[161]:
0 1 2 3
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
##new.columns = pd.MultiIndex.from_product([df1.columns,df2.columns]).map('_'.join)
Try:
pd.concat([(df1[i]*df2[j]).rename(f'{i}{j}') for i in df1 for j in df2], axis=1)
Output:
dummy1var1 dummy1var2 dummy2var1 dummy2var2
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
You can definitely do it with one loop:
dummies = X_dummy.astype(bool)
pd.concat([X_var.loc[dummies[c]] for c in dummies], axis=1).fillna(0).astype(int)
# var1 var2 var1 var2
#0 4 2 0 0
#1 0 0 10 5
#2 1 1 0 0
Note that because one of your dataframes contains dummies, you do not need multiplication at all.

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Adding non-existing combination

I want to make a table with all available products for every customer. However, I only have a table with the combination of product and customer if it was bought. I want to make a new table that also included the product that were not bought by the customer. The current table looks as follows:
The table I want to end up with is:
Could anyone help me how to do this in pandas?
One way to do this is to use pd.MultiIndex and reindex:
df = pd.DataFrame({'Product':list('ABCDEF'),
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
indx = pd.MultiIndex.from_product([df['Product'].unique(),
df['Customer'].unique()],
names=['Product','Customer'])
df.set_index(['Product','Customer'])\
.reindex(indx, fill_value=0)\
.reset_index()\
.sort_values(['Customer','Product'])
Output:
Product Customer Amount
0 A 1 4
3 B 1 5
6 C 1 0
9 D 1 0
12 E 1 0
15 F 1 0
1 A 2 0
4 B 2 0
7 C 2 3
10 D 2 0
13 E 2 0
16 F 2 0
2 A 3 0
5 B 3 0
8 C 3 0
11 D 3 1
14 E 3 1
17 F 3 2
You can also create a pivot to do what you want in one line. Note that the output format is different -- it's a pandas.DataFrame.pivot rather than a standard pandas data frame. But if you're not especially fussed about that (depends on how you intend to use the final table), the following code does the job.
df = pd.DataFrame({'Product':['A','B','C','D','E','F'],
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
pivot_df = df.pivot(index='Product',
columns='Customer',
values='Amount').fillna(0).astype('int')
Output:
Customer 1 2 3
Product
A 4 0 0
B 5 0 0
C 0 3 0
D 0 0 1
E 0 0 1
F 0 0 2
df.pivot creates NaN values when there are no corresponding entries in the original df (it creates a NaN value for Product A and Customer 2, for instance). NaNs are float values, so all the 'Amounts' in the pivot are implicitly converted into floats. This is why I use fillna(0) to convert the NaN values into 0s, and then finally change the dtype back to int.

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Categories

Resources