I've got a dataset with an insanely high sampling rate, and would like to remove excess data where the columnar value changes less than a predefined value down through the dataset. However, some intermediary points need to be kept in order to not loose all data.
e.g.
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
5 3.7 4.2
6 3.8 4.6
7 4.4 5.4
8 5.1 6.0
9 6.0 7.0
10 7.0 10.0
Now I want to delete all the rows where the change in V from one row to another is less than dV, AND the change in t is below dt, but still keep datapoints such that there is data at roughly every interval dV or dt.
Lets say for dV = 1 and dt = 1, the wanted output would be:
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 4.4 5.4
9 6.0 7.0
10 7.0 10.0
Meaning row 5, 6 and 8 was deleted since it was within the changevalue, but row 7 remains since it has a changevalue above dt and dV in both directions.
The easy solution is iterating over the rows in the dataframe, but a faster (and more proper) solution is wanted.
EDIT:
The question was edited to reflect the point that intermediary points must be kept in order to not delete too much.
Use DataFrame.diff with boolean indexing:
dV = 1
dt = 1
df = df[~(df['t'].diff().lt(dt) & df['V'].diff().lt(dV))]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
Or:
dV = 1
dt = 1
df1 = df.diff()
df = df[df1['t'].fillna(dt).ge(dt) | df1['V'].fillna(dV).ge(dV)]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
you might want to use shift() method:
diff_df = df - df.shift()
and then filter rows with loc:
diff_df = diff_df.loc[diff_df['V'] > 1.0 & diff_df['t'] > 1.0]
You can use loc for boolean indexing and do the comparison between the values between rows within each column using shift():
# Thresholds
dv = 1
dt = 1
# Filter out
print(df.loc[~((df.V.sub(df.V.shift()) < 1) & (df.t.sub(df.t.shift()) < 1))])
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
Related
I have a pandas series of keys and would like to create a dataframe by selecting values from other dataframes.
eg.
data_df = pandas.DataFrame({'key' : ['a','b','c','d','e','f'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
keys = pandas.Series(['a','b','a','c','e','f','a','b','c'])
data_df
# key value1 value2
#0 a 1.1 7.1
#1 b 2.0 8.0
#2 c 3.0 9.0
#3 d 4.0 10.0
#4 e 5.0 11.0
#5 f 6.0 12.0
I would like to get the result like this
result
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
one way I have successfully done this is by using
def append_to_series(key):
new_series=data_df[data_df['key']==key].iloc[0]
return new_series
pd.DataFrame(key_df.apply(append_to_series))
However, this function is very slow and not clean. Is there a way to do this more efficiently?
convert the series intodataframe with column name key
use pd.merge() to merge value1,value2
keys = pd.DataFrame(['a','b','a','c','e','f','a','b','c'],columns=['key'])
res = pd.merge(keys,data_df,on=['key'],how='left')
print(res)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
Create index by key column and then use DataFrame.reindex or DataFrame.loc:
Notice: Necessary unique values of original key column.
df = data_df.set_index('key').reindex(keys.rename('key')).reset_index()
Or:
df = data_df.set_index('key').loc[keys].reset_index()
print (df)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of
I would like to fill df's nan with an average of adjacent elements.
Consider a dataframe:
df = pd.DataFrame({'val': [1,np.nan, 4, 5, np.nan, 10, 1,2,5, np.nan, np.nan, 9]})
val
0 1.0
1 NaN
2 4.0
3 5.0
4 NaN
5 10.0
6 1.0
7 2.0
8 5.0
9 NaN
10 NaN
11 9.0
My desired output is:
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0 <<< deadend
10 7.0 <<< deadend
11 9.0
I've looked into other solutions such as Fill cell containing NaN with average of value before and after, but this won't work in case of two or more consecutive np.nans.
Any help is greatly appreciated!
Use ffill + bfill and divide by 2:
df = (df.ffill()+df.bfill())/2
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0
10 7.0
11 9.0
EDIT : If 1st and last element contains NaN then use (Dark
suggestion):
df = pd.DataFrame({'val':[np.nan,1,np.nan, 4, 5, np.nan,
10, 1,2,5, np.nan, np.nan, 9,np.nan,]})
df = (df.ffill()+df.bfill())/2
df = df.bfill().ffill()
print(df)
val
0 1.0
1 1.0
2 2.5
3 4.0
4 5.0
5 7.5
6 10.0
7 1.0
8 2.0
9 5.0
10 7.0
11 7.0
12 9.0
13 9.0
Althogh in case of multiple nan's in a row it doesn't produce the exact output you specified, other users reaching this page may actually prefer the effect of the method interpolate():
df = df.interpolate()
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 6.3
10 7.7
11 9.0
Suppose I have a dataframe that looks like:
df =
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Then it is possible to use df.fillna(method='ffill', axis=1) to obtain:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
i.e. I forward fill the rows.
However, now I have a dataframe with -1 instead of np.nan. Pandas has the replace function that also has the possibility to use method='ffill', but replace() does not take an axis argument, so to obtain the same result as above, I would need to call df.T.replace(-1, method='ffill').T. Since transposing is quite expensive (especially considering I'm working on a dataframe of multiple gigabytes), this is not an option. How could I achieve the desired result?
Use mask and ffill
df.mask(df.eq(-1)).ffill(axis=1)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
You can convert your -1 values to NaN before using pd.DataFrame.ffill:
print(df)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 -1.0
2 6.0 -1.0 -1.0
res = df.replace(-1, np.nan)\
.ffill(axis=1)
print(res)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
IIUC, use mask and ffill with axis=1:
Where df1 = df.fillna(-1.0)
df1.mask(df1 == -1).ffill(1)
Output:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
Let df be a pandas dataframe of the following form:
n days
1 9.0
2 4.0
3 5.0
4 1.0
5 4.0
6 1.0
7 7.0
8 3.0
For given N, and every row i>=N I want to sum the values indf.days.iloc[i-N+1:i+1], and write them into a new column, in row i.
The result should look like this (e.g., for N = 3):
n days loc_sum
1 9.0 NaN
2 4.0 NaN
3 5.0 18.0
4 1.0 10.0
5 4.0 10.0
6 1.0 6.0
7 7.0 12.0
8 3.0 11.0
Of course, I could simply loop through all i, and insert df.days.iloc[i-N+1:i+1].sum() for every i.
My question is: Is there a more elegant way, using pandas functionality? Especially for large datasets, looping through the rows seems to be a very slow option.
Use rolling with a windows equal to 3 and function sum:
df['loc_sum'] = df['days'].rolling(3).sum()
Output:
n days loc_sum
0 1 9.0 NaN
1 2 4.0 NaN
2 3 5.0 18.0
3 4 1.0 10.0
4 5 4.0 10.0
5 6 1.0 6.0
6 7 7.0 12.0
7 8 3.0 11.0