I happen to have a dataset that looks like this:
A-B A-B A-B A-B A-B B-A B-A B-A B-A B-A
2 3 2 4 5 3.1 3 2 2.5 2.6
NaN 3.2 3.3 3.5 5.2 NaN 4 2.7 3.2 5
NaN NaN 4.1 4 6 NaN NaN 4 4.1 6
NaN NaN NaN 4.2 5.1 NaN NaN NaN 3.5 5.2
NaN NaN NaN NaN 6 NaN NaN NaN NaN 5.7
It's very bad, I know. But what I would like to obtain is:
A-B B-A
2 3.1
3.2 4
4.1 4
4.2 3.5
6 5.7
Which are the values on the "diagonals"
Is there a way I can get something like this?
You could use groupby and a dictionary comprehension with numpy.diag:
df2 = pd.DataFrame({x: np.diag(g) for x, g in df.groupby(level=0, axis=1)})
output:
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
Another option is to convert to long form, and then drop duplicates: this can be achieved with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.pivot_longer(names_to=".value",
names_pattern=r"(.+)",
ignore_index=False)
.dropna()
.loc[lambda df: ~df.index.duplicated()]
)
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
#mozway's solution should be faster though, as you avoid building large number of rows only to prune them, which is what this option does.
Related
I've got a dataset with an insanely high sampling rate, and would like to remove excess data where the columnar value changes less than a predefined value down through the dataset. However, some intermediary points need to be kept in order to not loose all data.
e.g.
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
5 3.7 4.2
6 3.8 4.6
7 4.4 5.4
8 5.1 6.0
9 6.0 7.0
10 7.0 10.0
Now I want to delete all the rows where the change in V from one row to another is less than dV, AND the change in t is below dt, but still keep datapoints such that there is data at roughly every interval dV or dt.
Lets say for dV = 1 and dt = 1, the wanted output would be:
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 4.4 5.4
9 6.0 7.0
10 7.0 10.0
Meaning row 5, 6 and 8 was deleted since it was within the changevalue, but row 7 remains since it has a changevalue above dt and dV in both directions.
The easy solution is iterating over the rows in the dataframe, but a faster (and more proper) solution is wanted.
EDIT:
The question was edited to reflect the point that intermediary points must be kept in order to not delete too much.
Use DataFrame.diff with boolean indexing:
dV = 1
dt = 1
df = df[~(df['t'].diff().lt(dt) & df['V'].diff().lt(dV))]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
Or:
dV = 1
dt = 1
df1 = df.diff()
df = df[df1['t'].fillna(dt).ge(dt) | df1['V'].fillna(dV).ge(dV)]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
you might want to use shift() method:
diff_df = df - df.shift()
and then filter rows with loc:
diff_df = diff_df.loc[diff_df['V'] > 1.0 & diff_df['t'] > 1.0]
You can use loc for boolean indexing and do the comparison between the values between rows within each column using shift():
# Thresholds
dv = 1
dt = 1
# Filter out
print(df.loc[~((df.V.sub(df.V.shift()) < 1) & (df.t.sub(df.t.shift()) < 1))])
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")
I have a pandas series of keys and would like to create a dataframe by selecting values from other dataframes.
eg.
data_df = pandas.DataFrame({'key' : ['a','b','c','d','e','f'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
keys = pandas.Series(['a','b','a','c','e','f','a','b','c'])
data_df
# key value1 value2
#0 a 1.1 7.1
#1 b 2.0 8.0
#2 c 3.0 9.0
#3 d 4.0 10.0
#4 e 5.0 11.0
#5 f 6.0 12.0
I would like to get the result like this
result
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
one way I have successfully done this is by using
def append_to_series(key):
new_series=data_df[data_df['key']==key].iloc[0]
return new_series
pd.DataFrame(key_df.apply(append_to_series))
However, this function is very slow and not clean. Is there a way to do this more efficiently?
convert the series intodataframe with column name key
use pd.merge() to merge value1,value2
keys = pd.DataFrame(['a','b','a','c','e','f','a','b','c'],columns=['key'])
res = pd.merge(keys,data_df,on=['key'],how='left')
print(res)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
Create index by key column and then use DataFrame.reindex or DataFrame.loc:
Notice: Necessary unique values of original key column.
df = data_df.set_index('key').reindex(keys.rename('key')).reset_index()
Or:
df = data_df.set_index('key').loc[keys].reset_index()
print (df)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
I have a pandas pivot table that was previously shifted and now looks like this:
pivot
A B C D E
0 5.3 5.1 3.5 4.2 4.5
1 5.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 NaN NaN NaN
4 4.3 NaN NaN NaN NaN
I'm trying to calculate a rolling average with a variable window (in this case 3 and 4 periods) over the inverse diagonal iterating over every column and trying to store that value in a new dataframe, which would look like this:
expected_df with a 3 periods window
A B C D E
0 4.3 4.1 3.5 4.2 4.5
expected_df with a 4 periods window
A B C D E
0 4.5 4.3 3.5 4.2 4.5
So far, I tried to subset the original pivot table and create a different dataframe that only contains the specified window values for each column, to then calculate the average, like this:
subset
A B C D E
0 4.3 4.1 3.5 4.2 4.5
1 4.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
For this, I tried to build the following for loop:
df2 = pd.DataFrame()
size = pivot.shape[0]
window = 3
for i in range(size):
df2[i] = pivot.iloc[size-window-i:size-i,i]
Which does not work even when this pivot.iloc[size-window-i:size-i,i] does return the values I need when I manually pass in the indexes, but in the for loop, it misses the first value of the second column and so on:
df2
A B C D E
0 4.3 NaN NaN NaN NaN
1 4.3 4.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
Does anyone have a good idea on how to calculate the moving average or on how to fix the for loop part? Thanks in advance for your comments.
IIUC:
shift everything back
shifted = pd.concat([df.iloc[:, i].shift(i) for i in range(df.shape[1])], axis=1)
shifted
A B C D E
0 5.3 NaN NaN NaN NaN
1 5.3 5.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 3.5 4.2 NaN
4 4.3 4.1 3.5 4.2 4.5
Then you can get your mean.
# Change this 🡇 to get the last n number of rows
shifted.iloc[-3:].mean()
A 4.3
B 4.1
C 3.5
D 4.2
E 4.5
dtype: float64
Or the rolling mean
# Change this 🡇 to get the last n number of rows
shifted.rolling(3, min_periods=1).mean()
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
Numpy strides
I'll use strides to construct a 3-D array and average over one of the axes. This is faster but confusing as all ...
Also, I wouldn't use this. I just wanted to nail down how to grab diagonal elements via strides. This was more practice for me and I wanted to share.
from numpy.lib.stride_tricks import as_strided as strided
a = df.values
roll = 3
r_ = roll - 1 # one less than roll
h, w = a.shape
w_ = w - 1 # one less than width
b = np.empty((h + 2 * w_ + r_, w), dtype=a.dtype)
b.fill(np.nan)
b[w_ + r_:-w_] = a
s0, s1 = b.strides
a_ = np.nanmean(strided(b, (h + w_, roll, w), (s0, s0, s1 - s0))[w_:], axis=1)
pd.DataFrame(a_, df.index, df.columns)
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
Numba
I feel better about this than I do using strides
import numpy as np
from numba import njit
import warnings
#njit
def dshift(a, roll):
h, w = a.shape
b = np.empty((h, roll, w), dtype=np.float64)
b.fill(np.nan)
for r in range(roll):
for i in range(h):
for j in range(w):
k = i - j - r
if k >= 0:
b[i, r, j] = a[k, j]
return b
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=RuntimeWarning)
df_ = pd.DataFrame(np.nanmean(dshift(a, 3), axis=1, ), df.index, df.columns)
df_
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5