I have dataframe and I have to split in subset with this conditions:
start split: c = 1
end split: c = -1
example:
a b c
False False
False False -1
False False
True False 1 start first subset
False False
False False 1
False False 1
False False
False False 1
False False
False False 1
False False
False True -1 end of first subset
False False
False False
False True -1
False False
False False
True False 1 start second subset
False False -1 end of second subset
This could be a solution, although I'm not sure if it's the most efficient approach. This basically uses cumsum and some and/or logic.
import pandas as pd
import numpy as np
df = pd.DataFrame({'c': [np.nan, 1, np.nan, 1, np.nan, np.nan,
-1, np.nan, np.nan, 1, np.nan, np.nan,
1, 1, np.nan, -1, -1, 1, -1]})
c
0 NaN
1 1.0
2 NaN
3 1.0
4 NaN
5 NaN
6 -1.0
7 NaN
8 NaN
9 1.0
10 NaN
11 NaN
12 1.0
13 1.0
14 NaN
15 -1.0
16 -1.0
17 1.0
18 -1.0
(
df
.assign(
start_end=lambda df: df.index.isin(
df
.loc[lambda df: df.c.isin([1,-1])]
.loc[lambda df: df.c.shift(1,fill_value=0)!=df.c]
.index),
start=lambda df: np.where(np.logical_and(df.start_end==True,df.c==1),1,0),
end=lambda df: np.where(np.logical_and(df.start_end==True,df.c==-1),1,0),
subset=lambda df: np.where(df.start.cumsum() != df.end.shift(1, fill_value=0).cumsum(),
df.start.cumsum(),
0)
)
.drop(columns=['start_end','start','end'])
)
c subset
0 NaN 0
1 1.0 1
2 NaN 1
3 1.0 1
4 NaN 1
5 NaN 1
6 -1.0 1
7 NaN 0
8 NaN 0
9 1.0 2
10 NaN 2
11 NaN 2
12 1.0 2
13 1.0 2
14 NaN 2
15 -1.0 2
16 -1.0 0
17 1.0 3
18 -1.0 3
```
Related
I have a dataframe similar to the following.
import pandas as pd
data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000],
'col1': [1,2,2,2,3,4,5,5,6,7],
'col2': [9,2,2,2,9,9,5,5,9,9],
'col3': [11,2,2,2,11,11,5,5,11,11],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})
There is an index ind, a number of columns col 1, 2 and 3, and some other column with a value val. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1, col2, col3 here).
This would give me a dataframe like this with indeces 333/444 and 888 removed:
data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000],
'col1': [1,2,3,4,5,6,7],
'col2': [9,2,9,9,5,9,9],
'col3': [11,2,11,11,5,11,11],
'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})
What is the best way to go about this for a larger dataframe?
You can use shift and any for boolean indexing:
cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]
Output:
ind col1 col2 col3 val
0 111 1 9 11 a
1 222 2 2 2 b
4 555 3 9 11 e
5 666 4 9 11 f
6 777 5 5 5 g
8 999 6 9 11 i
9 0 7 9 11 j
Intermediates
# shifted dataset
data[cols].shift()
col1 col2 col3
0 NaN NaN NaN
1 1.0 9.0 11.0
2 2.0 2.0 2.0
3 2.0 2.0 2.0
4 2.0 2.0 2.0
5 3.0 9.0 11.0
6 4.0 9.0 11.0
7 5.0 5.0 5.0
8 5.0 5.0 5.0
9 6.0 9.0 11.0
# comparison
data[cols].ne(data[cols].shift())
col1 col2 col3
0 True True True
1 True True True
2 False False False
3 False False False
4 True True True
5 True False False
6 True True True
7 False False False
8 True True True
9 True False False
# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)
0 True
1 True
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True
The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True
I want to fill column with True and NaN values
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = np.where(df['col1'].isin(my_list), True, np.NaN)
print (df)
It prints:
col1 col2
0 0 NaN
1 1 1.0
2 2 1.0
3 3 1.0
4 4 1.0
5 5 1.0
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
But it is very important for me to print bool value True, not float number 1.0. This column interacts with other columns. They are bool, so it must be bool too.
I know I can change it with replace function. But my DataFrame is very large. I cannot waste time. Is there a simple option to do it?
This code will solve your problem. np.where will returning you true because of numpy only deals with the number and True means 1 in number. that's why it's giving you 1.0 instead of True
Code
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = df['col1'].apply(lambda x: True if x in my_list else np.NaN)
print (df)
Results
col1 col2
0 0 NaN
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
Use Nullable Boolean data type:
df['col2'] = pd.Series(np.where(df['col1'].isin(my_list), True, np.NaN), dtype='boolean')
print (df)
col1 col2
0 0 <NA>
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 <NA>
7 7 <NA>
8 8 <NA>
9 9 <NA>
10 10 <NA>
you can call this
df.col2 = df.col2.apply(lambda x: True if x==1.0 else x)
I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool
I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())