Normalizing/Adjusting time series dataframe - python

I am fairly new to Python and Pandas; been searching for a solution for couple days with no luck... here's the problem:
I have a data set like the below and I need to cull the first few values of some rows so the highest value in each row is in column A. In the below example, rows 0 & 3 would drop the values in column A and row 4 drop the values in column A and B and then shift all remaining values to left.
A B C D
0 11 23 21 14
1 24 18 17 15
2 22 18 15 13
3 10 13 12 10
4 5 7 14 11
Desired
A B C D
0 23 21 14 NaN
1 24 18 17 15
2 22 18 15 13
3 13 12 10 NaN
4 14 11 NaN NaN
I've looked at the df.shift(), but don't see how I can get that function to work on a unique row by row basis. Should I instead be using an array and a loop function?
Any help is greatly appreciated.

You need to turn all left values of the max to np.nan and use the solution in this question. I use the one from #cs95
df_final = df[df.eq(df.max(1), axis=0).cummax(1)].apply(lambda x: sorted(x, key=pd.isnull), 1)
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN

You can loop over the unique shifts (fewer of these than rows) with a groupby and join the results back:
import pandas as pd
shifts = df.to_numpy().argmax(1)
pd.concat([gp.shift(-i, axis=1) for i, gp in df.groupby(shifts)]).sort_index()
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN

One approach is to convert each row of the data frame to a list (excluding the index) and append NaN values. Then keep N elements, starting with the max value.
ncols = len(df.columns)
nans = [np.nan] * ncols
new_rows = list()
for row in df.itertuples():
# convert each row of the data frame to a list
# start at 1 to exclude the index;
# and append list of NaNs
new_list = list(row[1:]) + nans
# find index of max value (exluding NaNs we appended)
k = np.argmax(new_list[:ncols])
# collect `new row`, starting at max element
new_rows.append(new_list[k : k+ncols])
# create new data frame
df_new = pd.DataFrame(new_rows, columns=df.columns)
df_new

for i in range(df.shape[0]):
arr = list(df.iloc[i,:])
c = 0
while True:
if arr[0] != max(arr):
arr.remove(arr[0])
c += 1
else:
break
nan = ["NaN"]*c
arr.extend(nan)
df.iloc[i,:] = arr
print(df)
I have looped over every row and found out max value and remove values before the max and padding "NaN" values at the end to match the number of columns for every row.

Related

Replace values in dataframe with difference to last row value by condition

I'm trying to replace every value above 1000 in my dataframe by its difference to the previous row value.
This is the way I tried with pandas:
data_df.replace(data_df.where(data_df["value"] >= 1000), data_df["value"].diff(), inplace=True)
This does not result in an error, but nothing in the dataframe changes. What am I missing?
import numpy as np
import pandas as pd
d = {'value': [1000, 200002,50004,600005], }
data_df = pd.DataFrame(data=d)
data_df["diff"] = data_df["value"].diff()
data_df["value"] = np.where((data_df["value"]>10000) ,data_df["diff"],data_df["value"])
data_df.drop(columns='diff', inplace=True)
I introduce one column "diff" to get the difference of pervious row.
np.where allow u implement the if else statement.
Hope it helps u thanks!
You can set the freq to 1000 or whatever interval you want. I have it at 10 to make the sample easier to see. Basically shifting the row, and for every row where the index is evenly divisible by the frequency, use the shifted value, otherwise leave as is.
import pandas as pd
import numpy as np
freq = 10
df = pd.DataFrame({'data':[x for x in range(30)]})
df['previous'] = df['data'].shift(1)
df['data'] = np.where((df.index % freq==0) & (df.index>0), df['data'] -df['previous'], df['data'])
df.drop(columns='previous', inplace=True)
Output
data
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
10 1.0
11 11.0
12 12.0
13 13.0
14 14.0
15 15.0
16 16.0
17 17.0
18 18.0
19 19.0
20 1.0
21 21.0
22 22.0
23 23.0
24 24.0
25 25.0
26 26.0
27 27.0
28 28.0
29 29.0

Slicing each dataframe row into 3 windows with different slicing ranges

I want to slice each row of my dataframe into 3 windows with slice indices that are stored in another dataframe and change for each row of the dataframe. Afterwards i want to return a single dataframe containing the windows in form of a MultiIndex. The rows in each windows that are shorter than the longest row in the window should be filled with NaN values.
Since my actual dataframe has around 100.000 rows and 600 columns, i am concerned about an efficient solution.
Consider the following example:
This is my dataframe which i want to slice into 3 windows
>>> df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
And the second dataframe containing my slicing indices having the same count of rows as df:
>>> df_slice
0 1
0 3 5
1 2 6
2 4 7
I've tried slicing the windows, like so:
first_window = df.iloc[:, :df_slice.iloc[:, 0]]
first_window.columns = pd.MultiIndex.from_tuples([("A", c) for c in first_window.columns])
second_window = df.iloc[:, df_slice.iloc[:, 0] : df_slice.iloc[:, 1]]
second_window.columns = pd.MultiIndex.from_tuples([("B", c) for c in second_window.columns])
third_window = df.iloc[:, df_slice.iloc[:, 1]:]
third_window.columns = pd.MultiIndex.from_tuples([("C", c) for c in third_window.columns])
result = pd.concat([first_window,
second_window,
third_window], axis=1)
Which gives me the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 3
1 2
2 4
Name: 0, dtype: int64] of <class 'pandas.core.series.Series'>
My expected output is something like this:
>>> result
A B C
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 NaN 3 4 NaN NaN 5 6 7
1 8 9 NaN NaN 10 11 12 13 14 15 NaN
2 16 17 18 19 20 21 22 NaN 23 NaN NaN
Is there an efficient solution for my problem without iterating over each row of my dataframe?
Here's a solution which, using melt and then pivot_table, plus some logic to:
Identify the three groups 'A', 'B', and 'C'.
Shift the columns to the left, so that NaN would only appear at the right side of each window.
Rename columns to get the expected output.
t = df.reset_index().melt(id_vars="index")
t = pd.merge(t, df_slice, left_on="index", right_index=True)
t.variable = pd.to_numeric(t.variable)
t.loc[t.variable < t.c_0,"group"] = "A"
t.loc[(t.variable >= t.c_0) & (t.variable < t.c_1), "group"] = "B"
t.loc[t.variable >= t.c_1, "group"] = "C"
# shift relevant values to the left
shift_val = t.groupby(["group", "index"]).variable.transform("min") - t.groupby(["group"]).variable.transform("min")
t.variable = t.variable - shift_val
# extract a, b, and c groups, and create a multi-level index for their
# columns
df_a = pd.pivot_table(t[t.group == "A"], index= "index", columns="variable", values="value")
df_a.columns = pd.MultiIndex.from_product([["a"], df_a.columns])
df_b = pd.pivot_table(t[t.group == "B"], index= "index", columns="variable", values="value")
df_b.columns = pd.MultiIndex.from_product([["b"], df_b.columns])
df_c = pd.pivot_table(t[t.group == "C"], index= "index", columns="variable", values="value")
df_c.columns = pd.MultiIndex.from_product([["c"], df_c.columns])
res = pd.concat([df_a, df_b, df_c], axis=1)
res.columns = pd.MultiIndex.from_tuples([(c[0], i) for i, c in enumerate(res.columns)])
print(res)
The output is:
a b c
0 1 2 3 4 5 6 7 8 9 10
index
0 0.0 1.0 2.0 NaN 3.0 4.0 NaN NaN 5.0 6.0 7.0
1 8.0 9.0 NaN NaN 10.0 11.0 12.0 13.0 14.0 15.0 NaN
2 16.0 17.0 18.0 19.0 20.0 21.0 22.0 NaN 23.0 NaN NaN

How to loc 5 rows before and 5 rows after value 1 in column

I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN

Applying values to column and grouping all columns by those values

I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M

Fill Nan based on group

I would like to fill NaN based on a column values' mean.
Example:
Groups Temp
1 5 27
2 5 23
3 5 NaN (will be replaced by 25)
4 1 NaN (will be replaced by the mean of the Temps that are in group 1)
Any suggestions ? Thanks !
Use groupby, transfrom, and lamdba function with fillna and mean:
df = df.assign(Temp=df.groupby('Groups')['Temp'].transform(lambda x: x.fillna(x.mean())))
print(df)
Output:
Temp
0 27.0
1 23.0
2 25.0
3 NaN

Categories

Resources