Python Pandas DataFrame: Matching column names to row index' - python

I have a DataFrame containing my raw data:
Var1 Var2 Var3
0 3090.032408 18.0 1545.016204
1 3048.781680 18.0 1524.390840
2 3090.032408 18.0 1545.016204
3 3112.086341 18.0 1556.043170
4 3075.100780 16.0 1537.550390
And a DataFrame containing values relating to the variables in my first DataFrame:
minVal maxVal
Var1 3045 4000
Var2 15 19
Var3 1500 1583
For every column in DF1, I need to find the relating row in DF2 in order to apply standardisation where I'm subtracting the minVal and dividing by the range. Column1 in DF1 may not relate to row1 in DF2 - there are more rows in DF2 than columns in DF1.
How do I loop through my columns and apply standardisation in an efficient way?
Many thanks

Thanks to Pandas' automatic index alignment, expressing this computation is remarkably easy:
(DF1-DF2['minVal'])/(DF2['maxVal']-DF2['minVal'])
import pandas as pd
DF1 = pd.DataFrame({
'Var1': [3090.032408, 3048.78168, 3090.032408, 3112.086341, 3075.10078],
'Var2': [18.0, 18.0, 18.0, 18.0, 16.0],
'Var3': [1545.016204, 1524.39084, 1545.016204, 1556.04317, 1537.55039]})
DF2 = pd.DataFrame({'maxVal': [4000, 19, 1583,10], 'minVal': [3045, 15, 1500,11],
'A':[1,2,3,12], 'B':[5,6,7,13]},
index=['Var1', 'Var2', 'Var3','Var4'])
DF3 = DF2.loc[DF1.columns, :]
result = (DF1-DF3['minVal'])/(DF3['maxVal']-DF3['minVal'])
print(result)
yields
Var1 Var2 Var3
0 0.047154 0.75 0.542364
1 0.003960 0.75 0.293866
2 0.047154 0.75 0.542364
3 0.070247 0.75 0.675219
4 0.031519 0.25 0.452414

Here's a simple way to get what you want. Calculates min, max, range for each column on the fly
df2 = (df - df.min()) / (df.max() - df.min())

Related

Mean and standard deviation with multiple dataframes

I have multiple dataframes having the same columns and the same number of observations:
For example
d1 = {'ID': ['A','B','C','D'], 'Amount':
[1,2,3,4]}
df1 =pd.DataFrame(data=d1)
d2 = {'ID': ['A','B','C','D'], 'Amount':
[6,0,1,5]}
df2 =pd.DataFrame(data=d2)
d3 = {'ID': ['A','B','C','D'], 'Amount':
[8,1,2,3]}
df3 =pd.DataFrame(data=d3)
I need to drop one column (D) and its corresponding value in each of the dataframes and then, for each variable, calculating the mean and standard deviation.
The expected output should be
avg std
A 5 ...
B ... ...
C ... ...
Generally, for one dataframe, I would use drop columns and then I would compute the average using mean() and the standard deviation std().
How can I do this in an easy and fast way with multiple dataframes? (I have at least 10 of them).
Use concat with remove D in DataFrame.query and aggregate by GroupBy.agg with named aggregations:
df = (pd.concat([df1, df2, df3])
.query('ID != "D"')
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std')))
print (df)
avg std
ID
A 5 3.605551
B 1 1.000000
C 2 1.000000
Or remove D in last step by DataFrame.drop:
df = (pd.concat([df1, df2, df3])
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std'))
.drop('D'))
You can use pivot_table as well:
import numpy as np
pd.concat([df1, df2, df3]).pivot_table(index='ID', aggfunc=[np.mean, np.std]).drop('D')

In python, convert pandas "one to many" dataset transposing rows to columns

I have two python datasets in "one to many" format, linked by ID column.
df1 = pd.DataFrame({'id': [1, 2],
'mandante': ['flamengo', 'botafogo'],
'visitante': ['ceara', 'são paulo'],
'vencedor': ['mandante', 'visitante']})
df2 = pd.DataFrame({'id': [1,1,2,2],
'tipo': ['mandante', 'visitante', 'mandante', 'visitante'],
'posse':['25%', '75%', '50%', '50%'],
'pontos': [25, 20, 14, 10]})
And I would like to join this datasets but adding columns to DF1 dataset for every two lines of DF2, creating a new dataset with columns made by eat row of DF2 dataset using "tipo" column in names...
thanks a lot!!!
Use DataFrame.pivot with flatten columns of MultiIndex:
df2 = df2.pivot(index='id',columns='tipo')
#alternative
#df2 = df2.set_index(['id','tipo']).unstack()
df2.columns = df2.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df2)
posse_mandante posse_visitante pontos_mandante pontos_visitante
id
1 25% 75% 25 20
2 50% 50% 14 10
And then add to df1 by DataFrame.join:
df = df1.join(df2, on='id')
print (df)
id mandante visitante vencedor posse_mandante posse_visitante \
0 1 flamengo ceara mandante 25% 75%
1 2 botafogo são paulo visitante 50% 50%
pontos_mandante pontos_visitante
0 25 20
1 14 10

pandas: populate df column with values matching index and column in another df

I am facing a problem that I am uncapable of finding a way around it.
I find very difficult too to explain what I am trying to do so hopefully a small example would help
I have df1 as such:
Id product_1 product_2
Date
1 0.1855672 0.8855672
2 0.1356667 0.0356667
3 1.1336686 1.7336686
4 0.9566671 0.6566671
and I have df2 as such:
product_1 Month
Date
2018-03-30 11.0 3
2018-04-30 18.0 4
2019-01-29 14.0 1
2019-02-28 22.0 2
and what I am trying to achieve is this in df2:
product_1 Month seasonal_index
Date
2018-03-30 11.0 3 1.1336686
2018-04-30 18.0 4 0.9566671
2019-01-29 14.0 1 0.1855672
2019-02-28 22.0 2 0.1356667
So what I try is to match the product name in df2 with the corresponding column in d1 and then get the value of for each index value that matches the month number in df2
I have tried doing things like:
for i in df1:
df2['seasonal_index'] = df1.loc[df1.iloc[:,i] == df2['Month']]
but with no success. Hopefully someone could have a clue on how to unblock the situation
Here you are my friend, this produces exactly the output you specified.
import pandas as pd
# replicate df1
data1 = [[0.1855672, 0.8855672],
[0.1356667, 0.0356667],
[1.1336686, 1.7336686],
[0.9566671, 0.6566671]]
index1 = [1, 2, 3, 4]
df = pd.DataFrame(data=data1,
index= index1,
columns=['product_1', 'product_2'])
df.columns.name = 'Id'
df.index.name = 'Date'
# replicate df2
data2 = [[11.0, 3],
[18.0, 4],
[14.0, 1],
[22.0, 2]]
index2 = [pd.Timestamp('2018-03-30'),
pd.Timestamp('2018-04-30'),
pd.Timestamp('2019-01-29'),
pd.Timestamp('2019-02-28')]
df2 = pd.DataFrame(data=data2, index=index2,
columns=['product_1', 'Month'])
df2.index.name = 'Date'
# Merge your data
df3 = pd.merge(left=df2, right=df[['product_1']],
left_on='Month',
right_index=True,
how='outer',
suffixes=('', '_df2'))
df3 = df3.rename(columns={'product_1_df2': 'seasonal_index'})
print(df3)
If you are interested in learning why this works, take a look at this link explaining the pandas.merge function. Notice specifically that for your dataframes, the key for df2 is one of its columns (so we use the left_on parameter in pd.merge) and the key for df is its index (so we use the right_index parameter in pd.merge).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

How to get pandas crosstab to sum up values for multiple columns?

Let's assume we have a table like:
id chr val1 val2
... A 2 10
... B 4 20
... A 3 30
...and we'd like to have a contingency table like this (grouped by chr, thus using 'A' and 'B' as the row indices and then summing up the values for val1 and val2):
val1 val2 total
A 5 40 45
B 4 20 24
total 9 60 69
How can we achieve this?
pd.crosstab(index=df.chr, columns=["val1", "val2"]) looked quite promising but it just counts the rows and does not sum up the values.
I have also tried (numerous times) to supply the values manually...
pd.crosstab(
index=df.chr.unique(),
columns=["val1", "val2"],
values=[
df.groupby("chr")["val1"],
df.groupby("chr")["val2"]
],
aggfunc=sum
)
...but this always ends up in shape mismatches and when I tried to reshape via NumPy:
values=np.array([
df.groupby("chr")["val1"].values,
df.groupby("chr")["val2"].values
].reshape(-1, 2)
...crosstab tells me that it expected 1 value instead of the two given for each row.
import pandas as pd
df = pd.DataFrame({'chr': {0: 'A', 1: 'B', 2: 'A'},
'val1': {0: 2, 1: 4, 2: 3},
'val2': {0: 10, 1: 20, 2: 30}})
# aggregate values by chr
df = df.groupby('chr').sum().reset_index()
df = df.set_index('chr')
# Column Total
df.loc['total', :] = df.sum()
# Row total
df['total'] = df.sum(axis=1)
Output
val1 val2 total
chr
A 5.0 40.0 45.0
B 4.0 20.0 24.0
total 9.0 60.0 69.0
What you want is pivot_table
table = pd.pivot_table(df, values=['val1','val2'], index=['char'], aggfunc=np.sum)
table['total'] = table['val1'] + table['val2']

Split pandas dataframe into two dataframes based on another dataframe

I have tried to search on Stackoverflow for the answer to this and while there are similar answers, I have tried to adapt the accepted answers and I'm struggling to achieve the result I want.
I have a dataframe:
df = pd.DataFrame({'Customer':
['A', 'B', 'C', 'D'],
'Sales':
[100, 200, 300, 400],
'Cost':
[2.25, 2.50, 2.10, 3.00]})
and another one:
split = pd.DataFrame({'Customer':
['B', 'D']})
I want to create two new dataframes from the original dataframe df, one containing the data from the split dataframe and the other one containing data, not in the split. I need the original structure of df to remain in both of the newly created dataframes.
I have explored isin, merge, drop and loops but there must be an elegant way to what appears to be a simple solution?
Use Series.isin with boolean indexing for filtering, ~ is for inverse boolen mask:
mask = df['Customer'].isin(split['Customer'])
df1 = df[mask]
print (df1)
Customer Sales Cost
1 B 200 2.5
3 D 400 3.0
df2 = df[~mask]
print (df2)
Customer Sales Cost
0 A 100 2.25
2 C 300 2.10
Another solution, also working if need match multiple columns with DataFrame.merge (if no parameter on it join by all columns), use outer join with indicator parameter:
df4 = df.merge(split, how='outer', indicator=True)
print (df4)
Customer Sales Cost _merge
0 A 100 2.25 left_only
1 B 200 2.50 both
2 C 300 2.10 left_only
3 D 400 3.00 both
And again filtering by different masks:
df11 = df4[df4['_merge'] == 'both']
print (df11)
Customer Sales Cost _merge
1 B 200 2.5 both
3 D 400 3.0 both
df21 = df4[df4['_merge'] == 'left_only']
print (df21)
Customer Sales Cost _merge
0 A 100 2.25 left_only
2 C 300 2.10 left_only

Categories

Resources