Hello I'm trying to figure out a method that finds the longest common continuous subsequence (in this case time interval) without any missing (Nan) values from a set of sequences. This is a example dataframe.
time s_1 s_2 s_3
0 1 2 2 Nan
1 2 3 Nan Nan
2 3 3 2 2
3 4 5 3 10
4 5 8 4 3
5 6 Nan Nan 7
6 7 5 2 Nan
7 8 Nan 3 Nan
For this small example the "best" time interval would be from 3-5 or index 2-4. The real dataframe is way bigger and contains more series. Is it possible to find an efficient solution to this problem?
Thank you very much.
I updated this with a bit of setup for a working example:
import pandas as pd
import numpy as np
s1 = [2,3,3,5,8,np.NAN,5,np.NAN,1]
s2 = [2,np.NAN,2,3,4,np.NAN,2,3,1]
s3 = [np.NAN,np.NAN,2,10,3,7,np.NAN,np.NAN,1]
data = {'time':np.arange(1,9+1),'s_1':s1,'s_2':s2,'s_3':s3}
df = pd.DataFrame(data)
print(df)
This will create a the DataFrame you posted above, but with an additional entry on the end so there will be two zones with continuous indexes.
I think the best approach from here is to drop all of the rows that are missing data and then count up the longest sequence in the remaining index. Something like this should do the trick:
sequence = np.array(df.dropna(how='any').index)
longest_seq = max(np.split(sequence, np.where(np.diff(sequence) != 1)[0]+1), key=len)
print(df.iloc[longest_seq])
Which will give you:
time s_1 s_2 s_3
2 3 3.0 2.0 2.0
3 4 5.0 3.0 10.0
4 5 8.0 4.0 3.0
dropna first , then we using cumsum with diff to build a key for distinguish different group by it is continue or not (different by 1 )
s=df.dropna()
idx=s.time.groupby(s.time.diff().ne(1).cumsum()).transform('count')
idx
0 1
2 3
3 3
4 3
Name: time, dtype: int64
yourmax=s[idx==idx.max()]
yourmax
time s_1 s_2 s_3
2 3 3.0 2.0 2.0
3 4 5.0 3.0 10.0
4 5 8.0 4.0 3.0
Related
I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN
I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN
I have a dataset like this.
A B C A2
1 2 3 4
5 6 7 8
and I want to combine A and A2.
A B C
1 2 3
5 6 7
4
8
how can I combine two columns?
Hope for help. Thank you.
I don't think it is possible directly. But you can do it with a few lines of code:
df = pd.DataFrame({'A':[1,5],'B':[2,6],'C':[3,7],'A2':[4,8]})
df_A2 = df[['A2']]
df_A2.columns = ['A']
df = pd.concat([df.drop(['A2'],axis=1),df_A2])
You will get this if you print df:
A B C
0 1 2.0 3.0
1 5 6.0 7.0
0 4 NaN NaN
1 8 NaN NaN
You could append the last columns after renaming it:
df.append(df[['A2']].set_axis(['A'], axis=1)).drop(columns='A2')
it gives as expected:
A B C
0 1 2.0 3.0
1 5 6.0 7.0
0 4 NaN NaN
1 8 NaN NaN
if the index is not important to you:
import pandas as pd
pd.concat([df[['A','B','C']], df[['A2']].rename(columns={'A2': 'A'})]).reset_index(drop=True)
I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
I have a time series where each observation represents the total amount of something since the last observation, if there is no observation in that timestep then the value is reported as NaN. An example of the format:
Timestep Value
1 10
2 NaN
3 NaN
4 9
5 NaN
6 NaN
7 NaN
8 16
9 NaN
10 NaN
What I would like to do is distribute the observed values across the NaNs prior to it. For example, a sequence like [5, NaN, NaN, 6] would become [5, 2, 2, 2] with the final observation, 6, distributed over the last 2 NaN values. Applied to the dataframe above the desired output would be:
Timestep Value
1 10
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 NaN
10 NaN
I've tried doing this with some of the pandas backfill and interpolate methods but haven't found anything which quite does what I want.
transform
df.Value.bfill().div(
df.groupby(df.Value.notna()[::-1].cumsum()).Value.transform('size')
)
0 10.0
1 3.0
2 3.0
3 3.0
4 4.0
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Name: Value, dtype: float64
np.bincount and pd.factorize
a = df.Value.notna().values
f, u = pd.factorize(a[::-1].cumsum()[::-1])
df.Value.bfill().div(np.bincount(f)[f])
0 10.0
1 3.0
2 3.0
3 3.0
4 4.0
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Name: Value, dtype: float64
Alternative shorter version. This works because cumsum naturally gives me what factorize does.
a = df.Value.notna().values[::-1].cumsum()[::-1]
df.Value.bfill().div(np.bincount(a)[a])
Details
In both options above, we need to identify where the null values are and use cumsum on the reversed series to define groups. In the transform option, I use groupby and size to count the size of those groups.
The second option uses bin counting and slicing to get at the same series.
Thank you #ScottBoston for reminding me to mention the reversed element [::-1]
Count the cumulative NA, then we do update
s=df.Value.notnull().cumsum().shift(1)
df.Value.update(df.Value.bfill()/s.groupby(s).transform('count'))
df
Out[885]:
Timestep Value
0 1 10.0
1 2 3.0
2 3 3.0
3 4 3.0
4 5 4.0
5 6 4.0
6 7 4.0
7 8 4.0
8 9 NaN
9 10 NaN