Filling previous values within a group in Pandas - python

I want the values to ffill() in S0.0,S1.0,S2.0 within the 'ID' group
ID Close S0.0 S1.0 S2.0
0 UNITY 11.66 NaN 54 NaN
1 UNITY 11.55 56 NaN NaN
2 UNITY 11.59 NaN NaN 78
3 TRINITY 11.69 47 NaN NaN
4 TRINITY 11.37 NaN 69 NaN
5 TRINITY 11.89 NaN NaN 70
intended result:
ID Close S0.0 S1.0 S2.0
0 UNITY 11.66 NaN 54 NaN
1 UNITY 11.55 56 54 NaN
2 UNITY 11.59 56 54 78
3 TRINITY 11.69 47 NaN NaN
4 TRINITY 11.37 47 69 NaN
5 TRINITY 11.89 47 69 70
Here are my attempts and their undesired outcomes:
Attempt 1:
df[df['S0.0']==""] = np.NaN
df[df['S1.0']==""] = np.NaN
df[df['S2.0']==""] = np.NaN
df['S0.0'].groupby('ID').fillna(method='ffill', inplace = True)
df['S1.0'].groupby('ID').fillna(method='ffill', inplace = True)
df['S2.0'].groupby('ID').fillna(method='ffill', inplace = True)
output:
raise KeyError(gpr)
KeyError: 'ID'
Attempt 2:
df.groupby('ID')[['S0.0', 'S1.0', 'S2.0']].ffill()
#this makes no difference to the data.
#but when I try this:
df = df.groupby('ID')[['S0.0', 'S1.0', 'S2.0']].ffill()
df
Output:
S0.0 S1.0 S2.0
NaN 54 NaN
56 54 NaN
56 54 78
47 NaN NaN
47 69 NaN
47 69 70
which again is not what I wanted. Little help will be appreciated.
THANKS!

UPDATE
The second attempt is the right one! Just don't specify the Sx.0's columns.
id = df.ID
df = pd.concat([id,df.groupby('ID').ffill()],axis=1)
output:
ID Close S0.0 S1.0 S2.0
0 UNITY 11.66 NaN 54.0 NaN
1 UNITY 11.55 56.0 54.0 NaN
2 UNITY 11.59 56.0 54.0 78.0
3 TRINITY 11.69 47.0 NaN NaN
4 TRINITY 11.37 47.0 69.0 NaN
5 TRINITY 11.89 47.0 69.0 70.0

Just do:
df[['S0.0', 'S1.0', 'S2.0']] = df.groupby('ID')[['S0.0', 'S1.0', 'S2.0']].ffill()
print(df)
Output
Close S0.0 S1.0 S2.0
0 11.66 NaN 54.0 NaN
1 11.55 56.0 54.0 NaN
2 11.59 56.0 54.0 78.0
3 11.69 47.0 NaN NaN
4 11.37 47.0 69.0 NaN
5 11.89 47.0 69.0 70.0

Related

remove certain numbers from two dataframes python

I have two dataframes
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 36 28 6 20 1 ... 5 0 0 50 23 0
1 2021-04-13 46 15 5 16 6 ... 5 0 0 122 12 1
2 2021-04-14 12 4 1 5 2 ... 2 0 0 39 1 0
3 2021-04-15 30 23 3 14 2 ... 15 0 0 101 9 0
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 41 28 4 33 10 ... 5 0 0 56 14 3
1 2021-04-13 76 22 7 12 29 ... 4 0 0 134 8 2
2 2021-04-14 21 15 2 7 16 ... 2 0 0 61 3 0
3 2021-04-15 54 43 9 2 31 ... 16 0 0 83 13 1
I want to remove numbers from two dataframe that are lower than 10 if the instance is deleted from one dataframe the same cell should be remove in another dataframe same thing goes other way around
Appreciate your help
Use a mask:
## pre-requisite
df1 = df1.set_index('dt')
df2 = df2.set_index('dt')
## processing
mask = df1.lt(10) | df2.lt(10)
df1 = df1.mask(mask)
df2 = df2.mask(mask)
output:
>>> df1
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 36 28.0 NaN 20.0 NaN NaN NaN NaN 50 23.0 NaN
2021-04-13 46 15.0 NaN 16.0 NaN NaN NaN NaN 122 NaN NaN
2021-04-14 12 NaN NaN NaN NaN NaN NaN NaN 39 NaN NaN
2021-04-15 30 23.0 NaN NaN NaN 15.0 NaN NaN 101 NaN NaN
>>> df2
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 41 28.0 NaN 33.0 NaN NaN NaN NaN 56 14.0 NaN
2021-04-13 76 22.0 NaN 12.0 NaN NaN NaN NaN 134 NaN NaN
2021-04-14 21 NaN NaN NaN NaN NaN NaN NaN 61 NaN NaN
2021-04-15 54 43.0 NaN NaN NaN 16.0 NaN NaN 83 NaN NaN

Is there a way to replace a whole pandas dataframe row using ffill, if one value of a specific column is NaN?

I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0

Merging Two Columns in DataFrame With Variable Column Names

Editing my original post to hopefully simplify my question... I'm merging multiple DataFrames into one, SomeData.DataFrame, which gives me the following:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03-03
0 A 80 NaN NaN 80
1 B NaN NaN 45 36
2 C 44 NaN 39 NaN
3 D 80 NaN NaN 12
4 E 49 2 NaN NaN
What I'm trying to do now is efficiently merge the columns ending in "_x" and "_y" while keeping everything else in place so that I get:
Key 2019-02-17 2019-02-24 2019-03-03
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 NaN
3 D 80 NaN 12
4 E 49 2 NaN
The other issue I'm trying to account for is that the data contained in SomeData.DataFrame changes weekly so that my column headers are unpredictable. Meaning, some weeks I may not have the above issue at all and other weeks, there may be multiple instances for example:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03_10_x 2019-03-10_y
0 A 80 NaN NaN 80 NaN
1 B NaN NaN 45 36 NaN
2 C 44 NaN 39 NaN 12
3 D 80 NaN NaN 12 NaN
4 E 49 2 NaN NaN 17
So that again the desired result would be:
Key 2019-02-17 2019-02-24 2019-03_10
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 12
3 D 80 NaN 12
4 E 49 2 17
Is what I'm asking reasonable or am I venturing outside the bounds of Pandas' limits? I can't find anyone trying to do anything similar so I'm not sure anymore. Thank you in advance!
Edited answer to updated question:
df = df.set_index('Key')
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-03
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 0.0
D 80.0 0.0 12.0
E 49.0 2.0 0.0
Second dataframe Output:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-10
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 12.0
D 80.0 0.0 12.0
E 49.0 2.0 17.0
You could try something like this:
df_t = df.T
df_t.set_index(df_t.groupby(level=0).cumcount(), append=True)\
.unstack().T\
.sort_values(df.columns[0])[df.columns.unique()]\
.reset_index(drop=True)
Output:
val03-20 03-20 val03-24 03-24
0 a 1 d 5
1 b 6 e 7
2 c 4 f 10
3 NaN NaN g 5
4 NaN NaN h 6
5 NaN NaN i 1

why does df.diff give me NaN in the 4 th column?

I have the following code:
# create dataframes for the lists of arrays (df_Avg_R), list of maxima
# (df_peaks) and for the inter-beat-intervals (df_ibi)
df_Avg_R = pd.DataFrame(Avg_R_val)
df_idx_max = pd.DataFrame(idx_of_max)
# delete first and last maxima
df_idx_max.drop([0, 11], axis=1, inplace=1)
df_ibi = df_idx_max.diff(axis=1)
df_idx_max is the following dataframe (only the first rows):
1 2 3 4 5 6 7 8 9 10
0 55 92 132 181.0 218.0 251.0 NaN NaN NaN NaN
1 84 140 198 235.0 251.0 NaN NaN NaN NaN NaN
2 47 64 103 123.0 185.0 251.0 NaN NaN NaN NaN
3 58 102 146 189.0 251.0 NaN NaN NaN NaN NaN
4 53 96 139 182.0 201.0 225.0 251.0 NaN NaN NaN
5 46 89 131 173.0 215.0 251.0 NaN NaN NaN NaN
6 67 121 161 175.0 231.0 251.0 NaN NaN NaN NaN
7 52 109 165 206.0 220.0 251.0 NaN NaN NaN NaN
8 80 135 191 251.0 NaN NaN NaN NaN NaN NaN
9 38 83 139 188.0 251.0 NaN NaN NaN NaN NaN
10 33 73 113 161.0 205.0 251.0 NaN NaN NaN NaN
11 54 81 126 153.0 180.0 204.0 251.0 NaN NaN NaN
12 44 64 116 160.0 206.0 251.0 NaN NaN NaN NaN
13 56 109 165 220.0 251.0 NaN NaN NaN NaN NaN
14 43 100 124 155.0 211.0 251.0 NaN NaN NaN NaN
however the command df_ibi = df_idx_max.diff(axis=1) gives me NaN in all the
4th column of the df_ibi
1 2 3 4 5 6 7 8 9 10
0 NaN 37.0 40.0 NaN 37.0 33.0 NaN NaN NaN NaN
1 NaN 56.0 58.0 NaN 16.0 NaN NaN NaN NaN NaN
2 NaN 17.0 39.0 NaN 62.0 66.0 NaN NaN NaN NaN
3 NaN 44.0 44.0 NaN 62.0 NaN NaN NaN NaN NaN
4 NaN 43.0 43.0 NaN 19.0 24.0 26.0 NaN NaN NaN
5 NaN 43.0 42.0 NaN 42.0 36.0 NaN NaN NaN NaN
6 NaN 54.0 40.0 NaN 56.0 20.0 NaN NaN NaN NaN
7 NaN 57.0 56.0 NaN 14.0 31.0 NaN NaN NaN NaN
8 NaN 55.0 56.0 NaN NaN NaN NaN NaN NaN NaN
9 NaN 45.0 56.0 NaN 63.0 NaN NaN NaN NaN NaN
10 NaN 40.0 40.0 NaN 44.0 46.0 NaN NaN NaN NaN
11 NaN 27.0 45.0 NaN 27.0 24.0 47.0 NaN NaN NaN
12 NaN 20.0 52.0 NaN 46.0 45.0 NaN NaN NaN NaN
13 NaN 53.0 56.0 NaN 31.0 NaN NaN NaN NaN NaN
14 NaN 57.0 24.0 NaN 56.0 40.0 NaN NaN NaN NaN
Do you know why this happens? Thanks
If you convert your entire dataframe to floats, it should work without a problem:
df_idx_max = df_idx_max.astype(float, errors='ignore')
df_ibi = df_idx_max.diff(axis=1)
I think it is something like a bug, look at this issue. You can use the following code to temporarily overcome this problem:
df.T.diff().T
With your data should be:
df_idx_max.T.diff().T
Let me know if it works.

Appending variable length columns in Pandas dataframe Python

I have a few csv files which contain a pair of bearings for many locations. I am trying to expand the values to include every number between the bearing pairs for each location and export the variable lengths as a csv in the same format.
Example:
df = pd.read_csv('bearing.csv')
Data structure:
A B C D E
0 0 94 70 67 84
1 120 132 109 152 150
Ideal result is a variable length multidimensional array:
A B C D E
0 0 94 70 67 84
1 1 95 71 68 85
2 3 96 72 69 86
...
n 120 132 109 152 150
I am looping through each column and getting the range of the pair of values, but I am struggling when trying to overwrite the old column with the new range of values.
for col in bear:
min_val = min(bear[col])
max_val = max(bear[col])
range_vals = range(min(bear[col]), max(bear[col])+1)
bear[col] = range_vals
I am getting the following error:
ValueError: Length of values does not match length of index
You can use dict comprehension with min and max in DataFrame contructor, but you get a lot NaN in the end of columns:
df = pd.DataFrame({col: pd.Series(range(df[col].min(),
df[col].max() + 1)) for col in df.columns })
print (df)
print (df)
A B C D E
0 0 94.0 70.0 67.0 84.0
1 1 95.0 71.0 68.0 85.0
2 2 96.0 72.0 69.0 86.0
3 3 97.0 73.0 70.0 87.0
4 4 98.0 74.0 71.0 88.0
5 5 99.0 75.0 72.0 89.0
6 6 100.0 76.0 73.0 90.0
7 7 101.0 77.0 74.0 91.0
8 8 102.0 78.0 75.0 92.0
9 9 103.0 79.0 76.0 93.0
10 10 104.0 80.0 77.0 94.0
11 11 105.0 81.0 78.0 95.0
12 12 106.0 82.0 79.0 96.0
13 13 107.0 83.0 80.0 97.0
14 14 108.0 84.0 81.0 98.0
15 15 109.0 85.0 82.0 99.0
16 16 110.0 86.0 83.0 100.0
17 17 111.0 87.0 84.0 101.0
18 18 112.0 88.0 85.0 102.0
19 19 113.0 89.0 86.0 103.0
20 20 114.0 90.0 87.0 104.0
21 21 115.0 91.0 88.0 105.0
22 22 116.0 92.0 89.0 106.0
23 23 117.0 93.0 90.0 107.0
24 24 118.0 94.0 91.0 108.0
25 25 119.0 95.0 92.0 109.0
26 26 120.0 96.0 93.0 110.0
27 27 121.0 97.0 94.0 111.0
28 28 122.0 98.0 95.0 112.0
29 29 123.0 99.0 96.0 113.0
.. ... ... ... ... ...
91 91 NaN NaN NaN NaN
92 92 NaN NaN NaN NaN
93 93 NaN NaN NaN NaN
94 94 NaN NaN NaN NaN
95 95 NaN NaN NaN NaN
96 96 NaN NaN NaN NaN
97 97 NaN NaN NaN NaN
98 98 NaN NaN NaN NaN
99 99 NaN NaN NaN NaN
100 100 NaN NaN NaN NaN
101 101 NaN NaN NaN NaN
102 102 NaN NaN NaN NaN
103 103 NaN NaN NaN NaN
104 104 NaN NaN NaN NaN
105 105 NaN NaN NaN NaN
106 106 NaN NaN NaN NaN
107 107 NaN NaN NaN NaN
108 108 NaN NaN NaN NaN
109 109 NaN NaN NaN NaN
110 110 NaN NaN NaN NaN
111 111 NaN NaN NaN NaN
112 112 NaN NaN NaN NaN
113 113 NaN NaN NaN NaN
114 114 NaN NaN NaN NaN
115 115 NaN NaN NaN NaN
116 116 NaN NaN NaN NaN
117 117 NaN NaN NaN NaN
118 118 NaN NaN NaN NaN
119 119 NaN NaN NaN NaN
120 120 NaN NaN NaN NaN
If you have only few columns, is possible use:
df = pd.DataFrame({'A': pd.Series(range(df.A.min(), df.A.max() + 1)),
'B': pd.Series(range(df.B.min(), df.B.max() + 1))})
EDIT:
If min value is in first row and the max in last, you can use iloc:
df = pd.DataFrame({col: pd.Series(range(df[col].iloc[0],
df[col].iloc[-1] + 1)) for col in df.columns })
Timings:
In [3]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].iloc[0], df[col].iloc[-1] + 1)) for col in df.columns }) )
1000 loops, best of 3: 1.75 ms per loop
In [4]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].min(), df[col].max() + 1)) for col in df.columns }) )
The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.18 ms per loop

Categories

Resources