computing correlation between values of one column

computing correlation between values of one column - python

I have a huge dataframe that looks like this:
gemeente Partij Perioden Bevolking/Bevolkingssamenstelling op 1 januari/Totale bevolking (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Mannen (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Vrouwen (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/Jonger dan 5 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/5 tot 10 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/10 tot 15 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/15 tot 20 jaar (aantal)
0 's-Hertogenbosch VVD 2007 135648.0 66669.0 68979.0 7986.0 7809.0 7514.0 7612.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 's-Hertogenbosch VVD 2008 136481.0 67047.0 69434.0 7885.0 7853.0 7517.0 7680.0 ... 5.8 8.6 41.3 5.2 4.0 20.0 4.0 5.0 25.0 3.0
2 's-Hertogenbosch VVD 2009 137775.0 67715.0 70060.0 7915.0 7890.0 7497.0 7628.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 's-Hertogenbosch VVD 2010 139607.0 68628.0 70979.0 8127.0 7852.0 7527.0 7752.0 ... 5.6 8.4 40.7 5.4 4.0 20.0 3.0 5.0 24.0 3.0
4 Aa en Hunze PVDA 2007 25563.0 12653.0 12910.0
Partij consists of 6 possible labels and I have 270 columns.
I want to compute the correlation and/or similarity between those 6 labels in Partij with the data from those 270 columns.
I tried pd.groupby but that only give me correlations between columns and not parties.
I tried to make a pd.pivot_table but and make the Partij as column names but then I still had all those normal column names and couldn't access the Partij names and compute correlation.

You can make Partij values appear as columns by using the transpose method of pandas' DataFrame:
df = df.transpose()

Related

Pandas sum values in a group within a group

First of all, sorry for the bad title. I will illustrate better here. I have a dataframe such as this:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
15
2343
11
1980
7117
2.4
1221
1.3
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
12
nan
nan
1870
7117
2.1
5445
11
nan
nan
It is a flatten hierarchy that describe which components goes into a product. Level 1 being the finished goods (e.g. pizza) and level 2,3 and so on are ingredients used to make said product. I need to the following logic.
df_grouped = df.groupby (by = ['level 1'])
range = [4,3,2]
for group in df_grouped:
for i in range:
df[f'qty {i}] = df[f'qty {i-1}'] * df[f'qty {i}']/df[f'qty {i}'].groupby(f'level {i-1}')[f'qty {i}'].transform ('sum'))
Okey, so what I need to do if we, for instance, look at level 1 = 1980 and level 2 = 7117. I need to take 2.4 * 15/(15+1.3). The same goes for the row below: 2.4 * 1.3/(15 + 1.3)
This needs to be done for each level of each level 1 (product)
expected output:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
2.20858895706
2343
15
1980
7117
2.4
1221
0.19141104294
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
1.09565217391
nan
nan
1870
7117
2.1
5445
1.00434782609
nan
nan

How would you flip and fold diagonaly a matrix with pandas?

I have some datas I would like to organize for visualization and statistics but I don't know how to proceed.
The data are in 3 columns (stimA, stimB and subjectAnswer) and 10 rows (numero of pairs) and they are from a pairwise comparison test, in panda's dataFrame format. Example :
stimA
stimB
subjectAnswer
1
2
36
3
1
55
5
3
98
...
...
...
My goal is to organize them as a matrix with each row and column corresponding to one stimulus with the subjectAnswer data grouped to the left side of the matrix' diagonal (in my example, the subjectAnswer 36 corresponding to stimA 1 and stimB 2 should go to the index [2][1]), like this :
stimA/stimB
1
2
3
4
5
1
...
2
36
3
55
4
...
5
...
...
98
I succeeded in pivoting the first table to the matrix but I couldn't succeed the arrangement on the left side of the diag of my datas, here is my code :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
session1 = pd.read_csv(filepath, names=['stimA', 'stimB', 'subjectAnswer'])
pivoted = session1.pivot('stimA','stimB','subjectAnswer')
Which gives :
session1 :
stimA stimB subjectAnswer
0 1 3 6
1 4 3 21
2 4 5 26
3 2 3 10
4 1 2 6
5 1 5 6
6 4 1 6
7 5 2 13
8 3 5 15
9 2 4 26
pivoted :
stimB 1 2 3 4 5
stimA
1 NaN 6.0 6.0 NaN 6.0
2 NaN NaN 10.0 26.0 NaN
3 NaN NaN NaN NaN 15.0
4 6.0 NaN 21.0 NaN 26.0
5 NaN 13.0 NaN NaN NaN
The expected output for pivoted :
stimB 1 2 3 4 5
stimA
1 NaN NaN Nan NaN NaN
2 6.0 NaN Nan NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Thanks a lot for your help !

If I understand you correctly, the stimuli A and B are interchangeable. So to get the matrix layout you want, you can swap A with B in those rows where A is smaller than B. In other words, you don't use the original A and B for the pivot table, but the maximum and minimum of A and B:
session1['stim_min'] = np.min(session1[['stimA', 'stimB']], axis=1)
session1['stim_max'] = np.max(session1[['stimA', 'stimB']], axis=1)
pivoted = session1.pivot('stim_max', 'stim_min', 'subjectAnswer')
pivoted
stim_min 1 2 3 4
stim_max
2 6.0 NaN NaN NaN
3 6.0 10.0 NaN NaN
4 6.0 26.0 21.0 NaN
5 6.0 13.0 15.0 26.0

sort the columns stimA and stimB along the columns axis and assign two temporary columns namely x and y in the dataframe. Here sorting is required because we need to ensure that the resulting matrix clipped on the upper right side.
Pivot the dataframe with index as y, columns as x and values as subjectanswer, then reindex the reshaped frame in order to ensure that all the available unique stim names are present in the index and columns of the matrix
session1[['x', 'y']] = np.sort(session1[['stimA', 'stimB']], axis=1)
i = np.union1d(session1['x'], session1['y'])
session1.pivot('y', 'x','subjectAnswer').reindex(i, i)
x 1 2 3 4 5
y
1 NaN NaN NaN NaN NaN
2 6.0 NaN NaN NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.

You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')

Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

Joining 3 separate DataFrames based off 3 common column values in Pandas

Working with an NFL dataset with pandas containing all offensive player stats for week 1 of the 2019 season. I currently have three DataFrames, one for passing stats, rushing stats, and receiving stats. I want to combine all three DataFrames into one final DataFrame. The problem is that some players appear in one or more DataFrames. For example, a QB can run and pass the ball, so some QB's appear in both the passing DF and the rushing DF. "Player" is the common index I want to combine them on, but each duplicated row will also have in common the 'Pos' and 'Tm' value. So I want to combine these three DataFrames on the columns 'Player', 'Tm' and 'Pos'.
I currently have each DataFrame saved to a variable in a list named dfs.
I tried
df = dfs[0].join(dfs[1:])
but that results in giving me a DataFrame with one row - Julian Edelman - the only player who ran, passed, and caught the ball in week 1 of the 2019 season. Suffice to say that's not what I'm looking for.
Copied below is the first five rows of each of the DataFrames.
Pos Tm PassingYds PassingTD Int PassingAtt Cmp
Player
Lamar Jackson QB BAL 324 5 0 20 17
Dak Prescott QB DAL 405 4 0 32 25
Robert Griffin QB BAL 55 1 0 6 6
Patrick Mahomes QB KAN 378 3 0 33 25
Kirk Cousins QB MIN 98 1 0 10 8
--------------------------------------------------------------------------
Pos Tm Rec Tgt ReceivingYds ReceivingTD
Player
Sammy Watkins WR KAN 9 11 198 3
Michael Gallup WR DAL 7 7 158 0
John Ross WR CIN 7 12 158 2
DeSean Jackson WR PHI 8 9 154 2
Marquise Brown WR BAL 4 5 147 2
---------------------------------------------------------------------------
Pos Tm RushingAtt RushingYds RushingTD
Player
Marlon Mack RB IND 25 174 1
Christian McCaffrey RB CAR 19 128 2
Saquon Barkley RB NYG 11 120 0
Dalvin Cook RB MIN 21 111 2
Mark Ingram RB BAL 14 107 2

You're looking for an outer join with Player, Pos and Tm as an index. First, append these to your index, then call your current attempt with a join type of outer
dfs = [d.set_index(['Pos', 'Tm'], append=True) for d in dfs]
dfs[0].join(dfs[1:], how='outer')
PassingYds PassingTD Int PassingAtt Cmp Rec Tgt ReceivingYds ReceivingTD RushingAtt RushingYds RushingTD
Player Pos Tm
Christian McCaffrey RB CAR NaN NaN NaN NaN NaN NaN NaN NaN NaN 19.0 128.0 2.0
Dak Prescott QB DAL 405.0 4.0 0.0 32.0 25.0 NaN NaN NaN NaN NaN NaN NaN
Dalvin Cook RB MIN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21.0 111.0 2.0
DeSean Jackson WR PHI NaN NaN NaN NaN NaN 8.0 9.0 154.0 2.0 NaN NaN NaN
John Ross WR CIN NaN NaN NaN NaN NaN 7.0 12.0 158.0 2.0 NaN NaN NaN
Kirk Cousins QB MIN 98.0 1.0 0.0 10.0 8.0 NaN NaN NaN NaN NaN NaN NaN
Lamar Jackson QB BAL 324.0 5.0 0.0 20.0 17.0 NaN NaN NaN NaN NaN NaN NaN
Mark Ingram RB BAL NaN NaN NaN NaN NaN NaN NaN NaN NaN 14.0 107.0 2.0
Marlon Mack RB IND NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 174.0 1.0
Marquise Brown WR BAL NaN NaN NaN NaN NaN 4.0 5.0 147.0 2.0 NaN NaN NaN
Michael Gallup WR DAL NaN NaN NaN NaN NaN 7.0 7.0 158.0 0.0 NaN NaN NaN
Patrick Mahomes QB KAN 378.0 3.0 0.0 33.0 25.0 NaN NaN NaN NaN NaN NaN NaN
Robert Griffin QB BAL 55.0 1.0 0.0 6.0 6.0 NaN NaN NaN NaN NaN NaN NaN
Sammy Watkins WR KAN NaN NaN NaN NaN NaN 9.0 11.0 198.0 3.0 NaN NaN NaN
Saquon Barkley RB NYG NaN NaN NaN NaN NaN NaN NaN NaN NaN 11.0 120.0 0.0

It's better to convert these following data in .CSV format and then merge the data and later you can import in form of dataframe.

rolling moving average and std dev by multiple columns dynamically

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc

Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan

This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.

IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0

You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.