Compare Boolean variables within Time Window of Pandas Dataframe - python

I have a pandas dataframe that looks like so:
datetime Online TEST
61 2018-03-03 True False
62 2018-03-04 True False
63 2018-03-05 True False
64 2018-03-06 True False
65 2018-03-07 True False
66 2018-03-08 True False
67 2018-03-09 True False
68 2018-03-10 True False
69 2018-03-11 False False
70 2018-03-12 False False
I need to check that for each False in the TEST column, that within a date range of 7 days that there is a False in the Online column. For example, on 2018-03-03, since TEST is False, I would want to check all plus or minus 7 days (ie plus or minus timedelta(days = 7)) for False values in the Online column. So since there are no False Online values within a 7 day time frame, then we would return a False. On the other hand, consider the date 2018-03-09, where Online is True and TEST is False. Since there is a False in Online on the day 2018-03-11, I need to return a boolean True value saying that there was a False within my 7 day time range.
I can achieve this using some slow and ugly looping mechanisms (ie go through each row using DataFrame.iterrows(), check if TEST is false, then pull the time window of plus or minus 7 days to see if Online also has a corresponding False value. But I would ideally like to have something snazzier and faster. For a visual, this is what I need my final dataframe to look like:
datetime Online TEST Check
61 2018-03-03 True False False
62 2018-03-04 True False True
63 2018-03-05 True False True
64 2018-03-06 True False True
65 2018-03-07 True False True
66 2018-03-08 True False True
67 2018-03-09 True False True
68 2018-03-10 True False True
69 2018-03-11 False False True
70 2018-03-12 False False True
Any ideas out there? Thanks in advance!

Building upon great #piRSquared comments (I didn't even know about the rolling method, it seems very useful!), you can use
check = ~(df.TEST + df.Online.rolling(15, center=True, min_periods=1).apply(np.prod).eq(1))
The second summand creates a Series object in which every element is a boolean indicating if there is not any False value in a window of size 15; this is achieved by multiplying (NumPy's prod function) all the values inside this rolling window.
The sum (with the logical inverse operator ~) is what compares booleans so we only get True in the Check series if there are two False in both columns (element-wise, of course).
Hope it helps.

Related

Pandas data precision [another]

I got an unexpected behaviour, at least for me, while working with pandas dataframe. I use a 2d array to check if the values are smaller than. If I check the entire Dataframe at once, the check is wrong at some values. But if I check explicitly the concerned cell the result is correct.
print(df.loc[5,397])
out: 14.4 #--> its actually 14.3999996185
print(df.loc[5,397] < 14.4)
out: True
print(df.loc[4:6,396:398] < 14.4)
out:
396 397 398
4 False False False
5 False False False #[5,397] should be True!
6 False False False
However, if I try to reproduce the error, I got the correct result?!
data = numpy.array([[15,15,15], [15,14.3999996185,15], [15,15,15]])
df = pd.DataFrame(data)
print(df.loc[1,1])
out: 14.3999996185
print(df.loc[1,1] < 14.4)
out: True
print(df < 14.4)
out:
0 1 2
0 False False False
1 False True False
2 False False False
Thank you

Filter Pandas DataFrame using a loop that is applying a dynamic conditional

I have a dataframe which contains two columns.
One column shows distance, the other column contains unique 'trackIds' that are associated with a set of distances.
Example:
trackId. distance
2. 17.452
2. 8.650
2. 10.392
2. 11.667
2. 23.551
2. 9.881
3. 6.052
3. 7.241
3. 8.459
3. 22.644
3. 126.890
3. 12.442
3. 5.891
4. 44.781
4. 7.657
4. 36.781
4. 224.001
What I am trying to do is eliminate any trackIds that contain a large spike in distance -- a spike that is > 75.
In this example case, track Ids 3 and 4 (and all their associated distances) would be removed from the dataframe because we see spikes in distance greater than 75, thus we would just be left with a dataframe containing trackId 2 and all of its associated distance values.
Here is my code:
i = 0
k = 1
length = len(dataframe)
while i < length:
if (dataframe.distance[k] - dataframe.distance[i]) > 75:
bad_id = dataframe.trackId[k]
condition = dataframe.trackid != bad_id
df2 = dataframe[condition]
i+=1
I tried to use a while loop that was able to go through all the different trackIds, subtract all the distance values and see if the result was > 75, if it was, then the program associated that trackId with the variable 'bad_id' and used that as a condition to filter the dataframe to only contain trackIds that are not equal to the bad_id(s).
I just keep getting nameErrors because I'm unsure of how to properly structure the loop and am in general not sure if this approach works anyways.
We can use diff to compute the difference between each row, then use groupby transform to check if there are any differences in the group gt 75. Then keep groups where there are not any matches:
m = ~(df['distance'].diff().gt(75).groupby(df['trackId']).transform('any'))
filtered_df = df.loc[m, df.columns]
filtered_df:
trackId distance
0 2.0 17.452
1 2.0 8.650
2 2.0 10.392
3 2.0 11.667
4 2.0 23.551
5 2.0 9.881
Breakdown of steps as a DataFrame:
breakdown = pd.DataFrame({'diff': df['distance'].diff()})
breakdown['gt 75'] = breakdown['diff'].gt(75)
breakdown['groupby any'] = (
breakdown['gt 75'].groupby(df['trackId']).transform('any')
)
breakdown['negation'] = ~breakdown['groupby any']
print(breakdown)
breakdown:
diff gt 75 groupby any negation
0 NaN False False True
1 -8.802 False False True
2 1.742 False False True
3 1.275 False False True
4 11.884 False False True
5 -13.670 False False True
6 -3.829 False True False
7 1.189 False True False
8 1.218 False True False
9 14.185 False True False
10 104.246 True True False # Spike of more than 75
11 -114.448 False True False
12 -6.551 False True False
13 38.890 False True False
14 -37.124 False True False
15 29.124 False True False
16 187.220 True True False # Spike of more than 75

Invert a boolean Series will yield -1 for False and -2 for True in Pandas

I have a subset of Series in Pandas dataframe populated with bool value of True and False. I am trying to invert the series by using ~.
This is the original subset of the Series.
7 True
8 False
14 True
38 False
72 False
...
Name: Status, Length: 197, dtype: object
Now I am using the following the code to invert the value.
mask = ~subset_df['Status']
I am expecting the result to be
7 -2
8 -1
14 -2
38 -1
72 -1
...
Name: Status, Length: 197, dtype: object
but what I really want is the following output:
7 False
8 True
14 False
38 True
72 True
...
Name: Status, Length: 197, dtype: object
I would really appreciate if you could let me know how to invert a boolean Series without converting them into -1 and -2. Thank you so much.
For some reason, you have a series of object dtype, filled with what are probably ordinary Python bools. Applying ~ to such a series goes through elementwise and applies the ordinary ~ operator to each element, and ordinary Python bools inherit ~ from int - they do not perform logical negation for this operation.
You can convert your series to boolean dtype first before applying ~ to get logical negation:
~series.astype(bool)
On a sufficiently recent Pandas version (1.0 and up), you may wish to instead use the new nullable boolean dtype with astype('boolean') instead of astype(bool).
You should also figure out why your series has object dtype in the first place - it's likely that the correct place to address this issue is somewhere earlier in your code, not here. Perhaps you built it wrong, or you tried to use NaN or None to represent missing values.
You can use the replace function of pandas Series as:
subset_df['Status'].replace(to_replace=[True, False], value=[False, True])
This will return another series with replaced values. But if you want to change the actual dataframe, then you can add a parameter 'inplace=True' as:
subset_df['Status'].replace(to_replace=[True, False], value=[False, True], inplace=True)
EDIT:
If Status column is object/string type. Here we are checking if the Status value is True then we are replacing it with False and False string with True.
df=pd.DataFrame({
'Status':['True', 'False', 'True', 'False', 'False']
})
df.Status = np.where(df['Status']=='True',False,True)
df
Output
Status
0 False
1 True
2 False
3 True
4 True
If series is of boolean type, then below options can be followed.
s = pd.Series([True, False, True, False, False])
Two options
-s
or
np.invert(s)
Output
0 False
1 True
2 False
3 True
4 True
dtype: bool
What you tried works for df as well
df=pd.DataFrame({
'Status':[True, False, True, False, False]
})
df['Status'] = ~df['Status']
df
Output
Status
0 False
1 True
2 False
3 True
4 True

Isn't taking the mean of a pandas column of boolean values supposed to return the proportion that is True?

So I took the mean of a pandas data frame column that contains boolean values. I've done this in the past multiple times and understood that it would return the proportion that is True. But when I wrote it in this particular instance, it didn't work. It returns the proportion that is False and not only that, the denominator it uses doesn't seem to relate to anything. I have no idea where it pulls the denominator from to calculate the proportion value. I discovered it works the way I want it to when I remove the second line of code (datadf = datadf[1:])
# get current row value minus previous row value and returns True if > 0
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
# remove first row because it gives 'None'
datadf = datadf[1:]
# calculate proportion that is True
accretionscore = datadf['increase'].mean()
This is the output
date price increase
1 2020-09-28 488.51 True
2 2020-09-29 489.33 True
3 2020-09-30 490.43 True
4 2020-10-01 499.51 True
5 2020-10-02 478.99 False
correct value: 0.8
value given: 0.2
When I try adding another sample that's when things get weirder:
date price increase
1 2020-09-27 479.78 False
2 2020-09-28 488.51 True
3 2020-09-29 489.33 True
4 2020-09-30 490.43 True
5 2020-10-01 499.51 True
6 2020-10-02 478.99 False
correct value: 0.6666666666666666
value given: 0.16666666666666666
they don't even add up to 1!
I'm so confused. Can anyone tell me what is going on? How does taking out the second line fix the problem?
Hint: if you want to convert from boolean to int, then you just can use:
datadf['increase'] = datadf['increase'].astype(int)
and this way things will work fine.
If we run your code, you can see that datadf['increase'] is an object instead of a boolean, so taking mean on this is most likely converting the categories to a number and so on.. basically something weird:
import pandas as pd
datadf = pd.DataFrame({'price':[470,488.51,489.33,490.43,499.51,478.99]})
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
datadf['increase']
Out[8]:
0 None
1 True
2 True
3 True
4 True
5 False
Name: increase, dtype: object
datadf['increase'].dtype
dtype('O')
From what I can see, you want True / False on whether the row is larger than its preceding, so do:
datadf['increase'] = datadf.price > datadf.price.shift(1)
datadf['increase'].dtype
dtype('bool')
And we just omit the first row by doing:
datadf['increase'][1:].mean()
0.8

Python - isnull().sum() vs isnull().count()

So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().

Categories

Resources