Pandas dataframe range check using between and rolling - python

I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'

You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0

Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0

Related

Pandas astype int not removing decimal points from values

I tried converting the values in some columns of a DataFrame of floats to integers by using round then astype. However, the values still contained decimal places. What is wrong with my code?
nums = np.arange(1, 11)
arr = np.array(nums)
arr = arr.reshape((2, 5))
df = pd.DataFrame(arr)
df += 0.1
df
Original df:
0 1 2 3 4
0 1.1 2.1 3.1 4.1 5.1
1 6.1 7.1 8.1 9.1 10.1
Rounding then to int code:
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
df
Output:
0 1 2 3 4
0 1.1 2.1 3.0 4.0 5.0
1 6.1 7.1 8.0 9.0 10.0
Expected output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
The problem is for the .iloc it assign the value and did not change the column type
l = df.columns[2:]
df[l] = df[l].astype(int)
df
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
One way to solve that is to use .convert_dtypes()
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df = df.convert_dtypes()
print(df)
output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
It will help you to coerce all dtypes of your dataframe to a better fit.
had the same issue, was able to resolve with converting numbers to str and applying an lambda to cut of zeros.
df['converted'] = df['floats'].astype(str)
def cut_zeros(row):
if row[-2:]=='.0':
row=row[:-2]
else:row
return row
df['converted'] = df.apply(lambda row: cut_zeros(row['converted']),axis=1)

In pandas, how to assign the result of a groupby aggregate to the next group in the original df?

Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')
Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64
You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3

Using groupby, shift and rolling in Pandas

I am trying to calculate rolling averages within groups. For this task I want a rolling average from the rows above so thought the easiest way would be to use shift() and then do rolling(). The problem is that shift() shifts the data from previous groups which makes first row in group 2 and 3 incorrect. Column 'ma' should have NaN in rows 4 and 7. How can I achieve this?
import pandas as pd
df = pd.DataFrame(
{"Group": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"Value": [2.5, 2.9, 1.6, 9.1, 5.7, 8.2, 4.9, 3.1, 7.5]
})
df = df.sort_values(['Group'])
df.reset_index(inplace=True)
df['ma'] = df.groupby('Group', as_index=False)['Value'].shift(1).rolling(3, min_periods=1).mean()
print(df)
I get this:
index Group Value ma
0 0 1 2.5 NaN
1 3 1 9.1 2.50
2 6 1 4.9 5.80
3 1 2 2.9 5.80
4 4 2 5.7 6.00
5 7 2 3.1 4.30
6 2 3 1.6 4.30
7 5 3 8.2 3.65
8 8 3 7.5 4.90
I tried answers from couple similar questions but nothing seems to work.
If I understand the question correctly, then the solution you require can be achieved in 2 steps using the following:
df['sa'] = df.groupby('Group', as_index=False)['Value'].transform(lambda x: x.shift(1))
df['ma'] = df.groupby('Group', as_index=False)['sa'].transform(lambda x: x.rolling(3, min_periods=1).mean())
I got the below output, where 'ma' is the desired column
index Group Value sa ma
0 0 1 2.5 NaN NaN
1 3 1 9.1 2.5 2.5
2 6 1 4.9 9.1 5.8
3 1 2 2.9 NaN NaN
4 4 2 5.7 2.9 2.9
5 7 2 3.1 5.7 4.3
6 2 3 1.6 NaN NaN
7 5 3 8.2 1.6 1.6
8 8 3 7.5 8.2 4.9
Edit: Example with one groupby
def shift_ma(x):
return x.shift(1).rolling(3, min_periods=1).mean()
df['ma'] = df.groupby('Group', as_index=False)['Value'].apply(shift_ma).reset_index(drop=True)

Right way to update the data in a table?

I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?
Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333

DataFrame manipulation function/method

I have a df looks like
A B
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0
is there any function or method to calculate cumsum piece by piece, I mean for every consecutive B value 1, calculate cumsum, in the above example it should be
A B C
1.2 1 1.2
1.3 1 2.5
1.1 1 3.6
1.0 0 0
1.0 0 0
1.5 1 1.5
1.6 1 3.1
0.7 1 3.8
1.1 0 0
many thanks,
from io import StringIO
import pandas as pd
import numpy as np
text = """a b
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
c = df["a"].cumsum()
mask = ~df["b"].astype(bool)
s = pd.Series(np.nan, index=df.index)
s[mask] = c[mask]
c -= s.ffill().fillna(0)
print(c)
output:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64
Another approach (which may be slightly more general) is to groupby the consecutive entries in B.
First we enumerate the groups:
In [11]: (df.B != df.B.shift())
Out[11]:
0 True
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 True
Name: B, dtype: bool
In [12]: enumerate_B_changes = (df.B != df.B.shift()).astype(int).cumsum()
In [13]: enumerate_B_changes
Out[13]:
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 4
dtype: int64
And then we can groupby this Series, and cumsum:
In [14]: df.groupby(enumerate_B_changes)['A'].cumsum()
Out[14]:
0 1.2
1 2.5
2 3.6
3 1.0
4 2.0
5 1.5
6 3.1
7 3.8
8 1.1
dtype: float64
However we have to multiply by df['B'] in this case to account for 0s in column B.
In [15]: df.groupby(enumerate_B_changes)['A'].cumsum() * df['B']
Out[15]:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64
If we wanted a different operation for entires neither 0 or 1 we could do something different here.
I'm not super versed in numpy, however the code below should help.
It goes through and if b is 1 continues to add to the cummulative sum, otherwise it resets it.
df = [
(1.2, 1),
(1.3, 1),
(1.1, 1),
(1.0, 0),
(1.0, 0),
(1.5, 1),
(1.6, 1),
(0.7, 1),
(1.1, 0)]
c=[]
cumsum=0
for a,b in df:
if b == 1:
cumsum +=a
c.append(cumsum)
else:
cumsum = 0
c.append(0)
print c
And it outputs (with rounding issues, which shouldn't happen in numpy):
[1.2, 2.5, 3.6000000000000001, 0, 0, 1.5, 3.1000000000000001, 3.7999999999999998, 0]

Categories

Resources