Grouping Pandas dataframe based on conditions? - python

I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).
df
df['Time'] = pd.to_datetime(df['Time'])
#making sure Time column read from the csv file is time object
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
def f(row):
if (row['AMP'] > 100) & (row['Time'] > day_1):
val = 'new_positives'
elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
val = 'rec_positives'
elif (row['AMP'] > 100 & row['Time'] < day_2):
val = 'old_positives'
else:
val = 'old_negatives'
return val
df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')
df[(df['AMP'] > 100) & (df['Time'] > day_1)] #this works fine
df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)] #this works fine
df[(df['AMP'] > 100) & (df['Time'] < day_2)] #this works fine
#df = df.groupby('GRP')
I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?
EDIT:
Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if row['a'] < 5 & row['b'] < 0.5:
value = 'less'
elif row['a'] < 5 & row['b'] > 0.5:
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')
EDIT 2:
As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.
However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?

Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif statement. The other is to avoid mixing datetime.date and Timestamp objects.
Fix 1: change this:
elif (row['AMP'] > 100 & row['Time'] < day_2):
to this:
elif (row['AMP'] > 100) & (row['Time'] < day_2):
These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time']. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence
Fix 2: Change these 3 lines:
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
to these 2 lines:
day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)

Some parentheses need to be added in the if-statements:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if (row['a'] < 5) & (row['b'] < 0.5):
value = 'less'
elif (row['a'] < 5) & (row['b'] > 0.5):
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)

If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)
For the Time column, I used this code. It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess)
import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)
Here is the raw data
mydf
a b Time
0 0 0.550149 2018-10-14
1 1 0.889209 2018-10-15
2 2 0.845740 2018-10-16
3 3 0.340310 2018-10-17
4 4 0.613575 2018-10-18
5 5 0.229802 2018-10-19
6 6 0.013724 2018-10-20
7 7 0.810413 2018-10-21
8 8 0.897373 2018-10-22
9 9 0.175050 2018-10-23
One approach involves using masks for columns
# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Another approach is to set a, b and Time as a multi-index and use index-based masks to set values
mydf.set_index(['a','b','Time'], inplace=True)
# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')
# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Source to filter by datetime and create a range of dates.

You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.
def get_letter_type(letter):
if letter.lower() in 'aeiou':
return 'vowel'
else:
return 'consonant'
In [6]: grouped = df.groupby(get_letter_type, axis=1)
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html

Related

Filter all Dataframe with values >= -1 & <= -1, and add that values to an another array

I would like to know how to achieve this question. Given a dataframe, I want to create an array getting all values between -1 to 1, just the values, I don't care about the day or index.
Here is the code:
import pandas as pd
import numpy as np
import random
data = [[round(random.uniform(1,100),2) for i in range(7)] for i in range(10)]
header = [['Lunes', 'Martes', 'Miércoles', 'Jueves','Viernes', 'Sábado','Domingo']]
df = pd.DataFrame(data, columns = header)
mean = df.mean()
std = df.std()
df_normalizado = (df-mean)/std
Lunes Martes Miércoles Jueves Viernes Sábado Domingo
0 -0.250799 1.001706 -0.491738 0.444629 -0.296997 -0.670781 -1.554641
1 -0.868792 -0.100689 -0.359056 1.282681 1.352212 1.176829 -1.374482
2 -0.614918 1.187862 1.398010 1.037513 -1.149555 -0.834707 0.143520
3 -0.319758 1.113691 -0.719597 -1.392089 -0.591716 0.943564 -1.163994
4 -0.718137 -1.300041 1.267097 -0.797168 0.053323 1.187264 0.078008
5 -0.883286 -0.821076 -0.671478 1.268079 0.002583 -0.897651 1.096177
6 1.933040 -0.534570 -1.142057 -0.262689 1.417233 0.851335 0.780141
7 -0.433957 -0.575776 1.406855 0.248020 -1.113399 -0.178332 0.497165
8 1.357213 -1.070254 -0.882708 -1.133679 -0.863344 -1.613941 0.491402
9 0.799394 1.099147 0.194671 -0.695298 1.189661 0.036420 1.006704
To clarify:
enter image description here
Thank you. community!
Since just an array is needed, grab values from the DataFrame and use normal boolean indexing:
a = df.values
print(a[(-1 <= a) & (a <= 1)])
Output:
[-0.250799 -0.491738 0.444629 -0.296997 -0.670781 -0.868792 -0.100689
-0.359056 -0.614918 -0.834707 0.14352 -0.319758 -0.719597 -0.591716
0.943564 -0.718137 -0.797168 0.053323 0.078008 -0.883286 -0.821076
-0.671478 0.002583 -0.897651 -0.53457 -0.262689 0.851335 0.780141
-0.433957 -0.575776 0.24802 -0.178332 0.497165 -0.882708 -0.863344
0.491402 0.799394 0.194671 -0.695298 0.03642 ]
Python pandas suggest query function.
I hope this would be helpful to slove your issue
df.query("Lunes >= -1 and Lunes <= 1 and
Martes >= -1 and Martes <= 1 and
Miércoles >= -1 and Miércoles <= 1 and
Jueves >= -1 and Jueves <= 1 and
Viernes >= -1 and Viernes <= 1 and
Sábado >= -1 and Sábado <= 1 and
Domingo >= -1 and Domingo <=1")

Pandas - New column based on the value of another column N rows back, when N is stored in a column

I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.

Column operation conditioned on previous row

I presume similar questions exist, but could not locate them. I have Pandas 0.19.2 installed. I have a large dataframe, and for each row value I want to carry over the previous row's value for the same column based on some logical condition.
Below is a brute-force double for loop solution for a small example. What is the most efficient way to implement this? Is it possible to solve this in a vectorised manner?
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.uniform(low=-0.2, high=0.2, size=(10,2) ))
print(df)
for col in df.columns:
prev = None
for i,r in df.iterrows():
if prev is not None:
if (df[col].loc[i]<= prev*1.5) and (df[col].loc[i]>= prev*0.5):
df[col].loc[i] = prev
prev = df[col].loc[i]
print(df)
Output:
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.132356 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
EDIT: Please note one value can be carried over multiple times, so long as it satisfies the logical condition.
prev = df.shift()
replace_mask = (0.5 * prev <= df) & (df <= 1.5 * prev)
df = df.where(~replace_mask, prev)
I came up with this:
keep_going = True
while keep_going:
df = df.mask((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).ffill()
trimming_to_do = ((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).values.any()
if not trimming_to_do:
keep_going= False
which gives the desired result (at least for this case):
print(df)
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.120775 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830

Pandas DataFrame: Assign integer to a new column if fulfilling multiple conditions

I'm trying to create a new column in a pandas dataframe to then assign an integer value depending on conditional formatting. An example would be:
if ((a > 1) & (a < 5)) give value 10, if ((a >= 5) & (a < 10)) give value 24, if ((a > 10) & (a < 5)) give value 57
where 'a' is another column in the dataframe.
Is there any way to do it with pandas/numpy without creating a function? I tried few different options but none worked.
Using pd.cut
df = pd.DataFrame({'a': [
2, 3, 5,7,8,10,100]})
pd.cut(df.a,bins=[1,5,10,np.inf],labels=[10,24,57])
Out[282]:
0 10
1 10
2 10
3 24
4 24
5 24
6 57
Name: a, dtype: category
Categories (3, int64): [10 < 24 < 57]
I think any way of doing this without creating a function would be pretty roundabout, though it's actually not too bad with a function. Additionally, your conditions don't really mesh with each other, but I assume that's a typo. If your conditions are relatively simple, you can define your function on the fly to keep your code compact:
df['new column'] = df['a'].apply(lambda x: 10 if x < 5 else 24 if x < 10 else 57)
that can get a little hairy if your conditions are more complicatied - it's easier to manage if you define the function more explicitly:
def f(x):
if x > 1 and x < 5: return 10
elif x >= 5 and x < 10: return 14
else: return 57
df['new column'] = df['a'].apply(f)
if your really want to avoid functions, the best i can think of is creating a new list for your new column, populating it by iterating through your data, and then adding it to your dataframe:
newcol = []
for a in df['a'].values:
if x > 1 and x < 5: newcol.append(10)
elif x >= 5 and x < 10: newcol.append(24)
else: newcol.append(57)
df['newcol'] = newcol

Categories

Resources