I have a pandas pivot table that was previously shifted and now looks like this:
pivot
A B C D E
0 5.3 5.1 3.5 4.2 4.5
1 5.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 NaN NaN NaN
4 4.3 NaN NaN NaN NaN
I'm trying to calculate a rolling average with a variable window (in this case 3 and 4 periods) over the inverse diagonal iterating over every column and trying to store that value in a new dataframe, which would look like this:
expected_df with a 3 periods window
A B C D E
0 4.3 4.1 3.5 4.2 4.5
expected_df with a 4 periods window
A B C D E
0 4.5 4.3 3.5 4.2 4.5
So far, I tried to subset the original pivot table and create a different dataframe that only contains the specified window values for each column, to then calculate the average, like this:
subset
A B C D E
0 4.3 4.1 3.5 4.2 4.5
1 4.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
For this, I tried to build the following for loop:
df2 = pd.DataFrame()
size = pivot.shape[0]
window = 3
for i in range(size):
df2[i] = pivot.iloc[size-window-i:size-i,i]
Which does not work even when this pivot.iloc[size-window-i:size-i,i] does return the values I need when I manually pass in the indexes, but in the for loop, it misses the first value of the second column and so on:
df2
A B C D E
0 4.3 NaN NaN NaN NaN
1 4.3 4.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
Does anyone have a good idea on how to calculate the moving average or on how to fix the for loop part? Thanks in advance for your comments.
IIUC:
shift everything back
shifted = pd.concat([df.iloc[:, i].shift(i) for i in range(df.shape[1])], axis=1)
shifted
A B C D E
0 5.3 NaN NaN NaN NaN
1 5.3 5.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 3.5 4.2 NaN
4 4.3 4.1 3.5 4.2 4.5
Then you can get your mean.
# Change this 🡇 to get the last n number of rows
shifted.iloc[-3:].mean()
A 4.3
B 4.1
C 3.5
D 4.2
E 4.5
dtype: float64
Or the rolling mean
# Change this 🡇 to get the last n number of rows
shifted.rolling(3, min_periods=1).mean()
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
Numpy strides
I'll use strides to construct a 3-D array and average over one of the axes. This is faster but confusing as all ...
Also, I wouldn't use this. I just wanted to nail down how to grab diagonal elements via strides. This was more practice for me and I wanted to share.
from numpy.lib.stride_tricks import as_strided as strided
a = df.values
roll = 3
r_ = roll - 1 # one less than roll
h, w = a.shape
w_ = w - 1 # one less than width
b = np.empty((h + 2 * w_ + r_, w), dtype=a.dtype)
b.fill(np.nan)
b[w_ + r_:-w_] = a
s0, s1 = b.strides
a_ = np.nanmean(strided(b, (h + w_, roll, w), (s0, s0, s1 - s0))[w_:], axis=1)
pd.DataFrame(a_, df.index, df.columns)
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
Numba
I feel better about this than I do using strides
import numpy as np
from numba import njit
import warnings
#njit
def dshift(a, roll):
h, w = a.shape
b = np.empty((h, roll, w), dtype=np.float64)
b.fill(np.nan)
for r in range(roll):
for i in range(h):
for j in range(w):
k = i - j - r
if k >= 0:
b[i, r, j] = a[k, j]
return b
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=RuntimeWarning)
df_ = pd.DataFrame(np.nanmean(dshift(a, 3), axis=1, ), df.index, df.columns)
df_
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
Related
I have a pandas dataframe that looks like this:
X Y Z
0 9.5 -2.3 4.13
1 17.5 3.3 0.22
2 NaN NaN -5.67
...
I want to add 2 more columns. Is invalid and Is Outlier.
Is Invalid will just keep a track of the invalid/NaN values in that given row. So for the 2nd row, Is Invalid will have a value of 2. For rows with valid entries, Is Invalid will display 0.
Is Outlier will just check whether that given row has outlier data. This will just be True/False.
At the moment, this is my code:
dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])
How can I go about adding these features?
x='''Z,Y,X,W,V,U,T
1,2,3,4,5,6,60
17.5,3.3,.22,22.11,-19,44,0
,,-5.67,,,,
'''
import pandas as pd, io, scipy.stats
df = pd.read_csv(io.StringIO(x))
df
Sample input:
Z Y X W V U T
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0
2 NaN NaN -5.67 NaN NaN NaN NaN
Transformations:
df['is_invalid'] = df.isna().sum(axis=1)
df['is_outlier'] = df.iloc[:,:-1].apply(lambda r: (r < (r.quantile(0.25) - 1.5*scipy.stats.iqr(r))) | ( r > (r.quantile(0.75) + 1.5*scipy.stats.iqr(r))) , axis=1).sum(axis = 1)
df
Final output:
Z Y X W V U T is_invalid is_outlier
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0 0 1
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0 0 0
2 NaN NaN -5.67 NaN NaN NaN NaN 6 0
Explanation for outlier:
Valid range is from Q1-1.5IQR to Q3+1.5IQR
Since it needs to calculated per row, we used apply and pass each row (r). To count outliers, we flipped the range i.e. anything less than Q1-1.5IQR and greater than Q3+1.5IQR is counted.
I happen to have a dataset that looks like this:
A-B A-B A-B A-B A-B B-A B-A B-A B-A B-A
2 3 2 4 5 3.1 3 2 2.5 2.6
NaN 3.2 3.3 3.5 5.2 NaN 4 2.7 3.2 5
NaN NaN 4.1 4 6 NaN NaN 4 4.1 6
NaN NaN NaN 4.2 5.1 NaN NaN NaN 3.5 5.2
NaN NaN NaN NaN 6 NaN NaN NaN NaN 5.7
It's very bad, I know. But what I would like to obtain is:
A-B B-A
2 3.1
3.2 4
4.1 4
4.2 3.5
6 5.7
Which are the values on the "diagonals"
Is there a way I can get something like this?
You could use groupby and a dictionary comprehension with numpy.diag:
df2 = pd.DataFrame({x: np.diag(g) for x, g in df.groupby(level=0, axis=1)})
output:
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
Another option is to convert to long form, and then drop duplicates: this can be achieved with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.pivot_longer(names_to=".value",
names_pattern=r"(.+)",
ignore_index=False)
.dropna()
.loc[lambda df: ~df.index.duplicated()]
)
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
#mozway's solution should be faster though, as you avoid building large number of rows only to prune them, which is what this option does.
I am encountering strange problem with df.sub function in pandas. I wrote a simple code to subtract df columns with a reference column.
import pandas as pd
def normalize(df,col):
'''Enter the column value in "col".'''
return df.sub(df[col], axis=0)
df = pd.read_csv('norm_debug.txt', sep='\t', index_col=0); print(df.head(3))
new = normalize(df,'A'); print(new.head(3))
The output of this code is the following, as expected:
df:
A B C D E
target_id
one 10.0 3 20 10 1
two 10.0 4 30 10 1
three 6.7 5 40 10 1
A B C D E
target_id
one 0.0 -7.0 10.0 0.0 -9.0
two 0.0 -6.0 20.0 0.0 -9.0
three 0.0 -1.7 33.3 3.3 -5.7
But, when I put this as an executable in argparse, I get all NaNs !
import argparse
import platform
import os
import pandas as pd
def normalize(df,col):
'''Normalize the log table with desired column,
Enter the column value in "col".'''
return df.sub(df[col], axis=0)
parser = argparse.ArgumentParser(description='''Manipulate tables ''',
usage='python3 %(prog)s -e input.tsv [-nm col_name] -op output.tsv',
epilog='''Short prog. desc:\
Pass the expression matrix to filter, log2(val) etc.,''')
parser.add_argument("-e","--expr", metavar='', required=True, help="tab-delimited expression matrix file")
parser.add_argument("-op","--outprefix", metavar='', required=True, help="output file prefix")
parser.add_argument("-nm","--norm", metavar='', required=True, nargs=1, type=str, help="Normalize table based on column chosen")
args=parser.parse_args()
print(args)
if (os.path.isfile(args.expr)):
df = pd.read_csv(args.expr, sep='\t', index_col=0); print(df.head(3))
if(args.norm):
norm_df = normalize(df,args.norm); print(norm_df.head(3))
outfile = args.outprefix + ".normalized.tsv"
norm_df.to_csv(outfile, sep='\t'); print("Normalized table written to ", outfile)
else:
print("Provide valid option...")
else:
print("Please provide proper input..")
Output for this execution is:
python norm_debug.py -e norm_debug.txt -nm A -op norm_debug
A B C D E
target_id
one 10.0 3 20 10 1
two 10.0 4 30 10 1
three 6.7 5 40 10 1
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
I use, Python version: 3.6.7, Pandas version: 1.1.2. The first one (hard-coded) was executed in Jupyter notebook, while the argparse was executed in standard terminal. What is the issue here?
Thanks in advance
args.norm has been parsed as a list ['A'], not as a scalar 'A' (because of the option nargs=1). Remove that option.
Problem is you divide by one column DataFrame like:
new = normalize(df,['A'])
print (new)
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
print (df.sub(df[['A']], axis=0))
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
Because parameter col_name is one element list like [col_name], not string like col_name.
If not possible, you can change function by DataFrame.squeeze:
def normalize(df,col):
'''Enter the column value in "col".'''
return df.sub(df[col].squeeze(), axis=0)
# df = pd.read_csv('norm_debug.txt', sep='\t', index_col=0); print(df.head(3))
new = normalize(df,['A'])
print (new)
A B C D E
target_id
one 0.0 -7.0 10.0 0.0 -9.0
two 0.0 -6.0 20.0 0.0 -9.0
three 0.0 -1.7 33.3 3.3 -5.7
Or use solution from #DYZ answer
I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN