I m trying to create a bool with multiple conditions on a datetimeindex. Here is my example:
df = pd.DataFrame(index=pd.date_range('2020-05-24', '2020-05-26', freq='1H', closed='left'))
mybool = np.logical_and(df.index.weekday < 5 , df.index.hour > 7 , df.index.hour < 20)
So mybool should be True for Mon-Friday for 12 hrs from 8am to 8pm. However this returns true on Mon-Fri from 8am to midnight. So it looks like the first two conditions are picked up but the third is not. But this also returns no error.
Chain your conditions with bitwise ands:
(df.index.weekday < 5) & (df.index.hour > 7) & (df.index.hour < 20)
Note that np.logical_and expects only two input arrays x1, x2. An alternative would be to use np.logical_and.reduce on a list of conditions:
np.logical_and.reduce([df.index.weekday < 5, df.index.hour > 7, df.index.hour < 20])
Related
I am trying to very efficiently chain a variable amount of boolean pandas Series, to be used as a filter on a DataFrame through boolean indexing.
Normally when dealing with multiple boolean conditions, one chains them like this
condition_1 = (df.A > some_value)
condition_2 = (df.B <= other_value)
condition_3 = (df.C == another_value)
full_indexer = condition_1 & condition_2 & condition_3
but this becomes a problem with a variable amount of conditions.
bool_indexers = [
condition_1,
condition_2,
...,
condition_N,
]
I have tried out some possible solutions, but I am convinced it can be done more efficiently.
Option 1
Loop over the indexers and apply consecutively.
full_indexer = bool_indexers[0]
for indexer in bool_indexers[1:]:
full_indexer &= indexer
Option 2
Put into a DataFrame and calculate the row product.
full_indexer = pd.DataFrame(bool_indexers).product(axis=0)
Option 3
Use numpy.product (like in this answer) and create a new Series out of the result.
full_indexer = pd.Series(np.prod(np.vstack(bool_indexers), axis=0))
All three solutions are somewhat inefficient because they rely on looping or force you to create a new object (which can be slow if repeated many times).
Can it be done more efficiently or is this it?
Use np.logical_and:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2], 'B': [0, 1, 2], 'C': [0, 1, 2]})
m1 = df.A > 0
m2 = df.B <= 1
m3 = df.C == 1
m = np.logical_and.reduce([m1, m2, m3])
# OR m = np.all([m1, m2, m3], axis=0)
out = df[np.logical_and.reduce([m1, m2, m3])]
Output:
>>> pd.concat([m1, m2, m3], axis=1)
A B C
0 False True False
1 True True True
2 True False False
>>> m
array([False, True, False])
>>> out
A B C
1 1 1 1
I have a metereological DataFrame, indexed by TimeStamp, and I want to find all the possible periods of 24 hours present in the DataFrame with these conditions:
at least 6 hours of Rainfalls with Temperature > 10°C
a minimum of 6 consecutive hours of Relative Humidity > 90%.
The hours taken in consideration may also be 'overlapped' (a period with 6 hours of both RH > 90 and Rainfalls > 0 is sufficient).
A sample DataFrame with 48 hours can be created by:
df = pd.DataFrame({'TimeStamp': pd.date_range('1/5/2015 00:00:00', periods=48, freq='H'),
'Temperature': np.random.choice( [11,12,13], 48),
'Rainfalls': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2,0.3,0.3,0.3,0.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'RelativeHumidity': [95,95,95,95,95,95,80,80,80,80,80,80,80,80,85,85,85,85,85,85,85,85,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80]})
df = df.set_index('TimeStamp')
In output I just want the indexes of the various TimeStamps from which every period with the mentioned characteristics starts. In the case of the sample df, only the first TimeStamp is given in output.
I have tried to use the df.rolling() function but I managed to find only the 6 hours of consecutive RH > 90.
Thanks in advance for the help.
I hope I've understood your question right. This example will find all groups where Temperature > 10 and RH > 90 of minimum length of 6 and then prints the first index of these groups:
x = (df.Temperature > 10).astype(int) + (df.RelativeHumidity > 90).astype(int)
out = (
x.groupby((x != x.shift(1)).cumsum().values)
.apply(lambda x: x.index[0] if (x.iat[0] == 2) and len(x) > 5 else np.nan)
.dropna()
)
print(out)
Prints:
1 2015-01-05
dtype: datetime64[ns]
I am trying to return values less than 40 and greater than 100 in a column using pandas. The current line of code that I am using only returns the values between 40 and 100 (so basically the opposite range that I want).
df = pd.DataFrame(data)
Test = df[(df['QC 1'] >= 40) & (df['QC 1'] <= 100)]
print(Test)
I feel like I'm probably missing something very obvious here but I haven't been able to figure out what that is.
Change >= to <, <= to > and | for bitwise OR:
df = pd.DataFrame({'QC 1':[10,50,300],'B':[8,2,0]})
test = df[(df['QC 1'] < 40) | (df['QC 1'] > 100)]
print(test)
QC 1 B
0 10 8
2 300 0
Working same like inverted mask in your solution:
Test = df[~((df['QC 1'] >= 40) & (df['QC 1'] <= 100))]
I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).
df
df['Time'] = pd.to_datetime(df['Time'])
#making sure Time column read from the csv file is time object
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
def f(row):
if (row['AMP'] > 100) & (row['Time'] > day_1):
val = 'new_positives'
elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
val = 'rec_positives'
elif (row['AMP'] > 100 & row['Time'] < day_2):
val = 'old_positives'
else:
val = 'old_negatives'
return val
df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')
df[(df['AMP'] > 100) & (df['Time'] > day_1)] #this works fine
df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)] #this works fine
df[(df['AMP'] > 100) & (df['Time'] < day_2)] #this works fine
#df = df.groupby('GRP')
I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?
EDIT:
Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if row['a'] < 5 & row['b'] < 0.5:
value = 'less'
elif row['a'] < 5 & row['b'] > 0.5:
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')
EDIT 2:
As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.
However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?
Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif statement. The other is to avoid mixing datetime.date and Timestamp objects.
Fix 1: change this:
elif (row['AMP'] > 100 & row['Time'] < day_2):
to this:
elif (row['AMP'] > 100) & (row['Time'] < day_2):
These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time']. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence
Fix 2: Change these 3 lines:
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
to these 2 lines:
day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)
Some parentheses need to be added in the if-statements:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if (row['a'] < 5) & (row['b'] < 0.5):
value = 'less'
elif (row['a'] < 5) & (row['b'] > 0.5):
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)
For the Time column, I used this code. It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess)
import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)
Here is the raw data
mydf
a b Time
0 0 0.550149 2018-10-14
1 1 0.889209 2018-10-15
2 2 0.845740 2018-10-16
3 3 0.340310 2018-10-17
4 4 0.613575 2018-10-18
5 5 0.229802 2018-10-19
6 6 0.013724 2018-10-20
7 7 0.810413 2018-10-21
8 8 0.897373 2018-10-22
9 9 0.175050 2018-10-23
One approach involves using masks for columns
# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Another approach is to set a, b and Time as a multi-index and use index-based masks to set values
mydf.set_index(['a','b','Time'], inplace=True)
# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')
# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Source to filter by datetime and create a range of dates.
You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.
def get_letter_type(letter):
if letter.lower() in 'aeiou':
return 'vowel'
else:
return 'consonant'
In [6]: grouped = df.groupby(get_letter_type, axis=1)
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
I'm trying to create a new column in a pandas dataframe to then assign an integer value depending on conditional formatting. An example would be:
if ((a > 1) & (a < 5)) give value 10, if ((a >= 5) & (a < 10)) give value 24, if ((a > 10) & (a < 5)) give value 57
where 'a' is another column in the dataframe.
Is there any way to do it with pandas/numpy without creating a function? I tried few different options but none worked.
Using pd.cut
df = pd.DataFrame({'a': [
2, 3, 5,7,8,10,100]})
pd.cut(df.a,bins=[1,5,10,np.inf],labels=[10,24,57])
Out[282]:
0 10
1 10
2 10
3 24
4 24
5 24
6 57
Name: a, dtype: category
Categories (3, int64): [10 < 24 < 57]
I think any way of doing this without creating a function would be pretty roundabout, though it's actually not too bad with a function. Additionally, your conditions don't really mesh with each other, but I assume that's a typo. If your conditions are relatively simple, you can define your function on the fly to keep your code compact:
df['new column'] = df['a'].apply(lambda x: 10 if x < 5 else 24 if x < 10 else 57)
that can get a little hairy if your conditions are more complicatied - it's easier to manage if you define the function more explicitly:
def f(x):
if x > 1 and x < 5: return 10
elif x >= 5 and x < 10: return 14
else: return 57
df['new column'] = df['a'].apply(f)
if your really want to avoid functions, the best i can think of is creating a new list for your new column, populating it by iterating through your data, and then adding it to your dataframe:
newcol = []
for a in df['a'].values:
if x > 1 and x < 5: newcol.append(10)
elif x >= 5 and x < 10: newcol.append(24)
else: newcol.append(57)
df['newcol'] = newcol