I have pandas Series and want to compute means between elements that are neighbours.
For example [1 2 3 4 5 6 7 8 9] would give the result [1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5]
Try using rolling for this:
S = pd.Series(range(1,10))
S1 = S.rolling(2).mean().dropna()
Output:
1 1.5
2 2.5
3 3.5
4 4.5
5 5.5
6 6.5
7 7.5
8 8.5
Related
Assuming you have a conventional pandas dataframe
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
There I would like to calculate the mean between each row.
The above pandas dataframe looks like following:
>>>df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Is there an elegant solution to now calculate the mean between each two lines in order to get the following output?
>>>df2_mean
a b c
0 1 2 3
1 2.5 3.5 4.5
2 4 5 6
3 5.5 6.5 7.5
4 7 8 9
Use DataFrame.rolling with mean and concat to original:
df = pd.concat([df2.rolling(2).mean().dropna(how='all'), df2]).sort_index(ignore_index=True)
print (df)
a b c
0 1.0 2.0 3.0
1 2.5 3.5 4.5
2 4.0 5.0 6.0
3 5.5 6.5 7.5
4 7.0 8.0 9.0
I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'
You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0
Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0
I am trying to calculate rolling averages within groups. For this task I want a rolling average from the rows above so thought the easiest way would be to use shift() and then do rolling(). The problem is that shift() shifts the data from previous groups which makes first row in group 2 and 3 incorrect. Column 'ma' should have NaN in rows 4 and 7. How can I achieve this?
import pandas as pd
df = pd.DataFrame(
{"Group": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"Value": [2.5, 2.9, 1.6, 9.1, 5.7, 8.2, 4.9, 3.1, 7.5]
})
df = df.sort_values(['Group'])
df.reset_index(inplace=True)
df['ma'] = df.groupby('Group', as_index=False)['Value'].shift(1).rolling(3, min_periods=1).mean()
print(df)
I get this:
index Group Value ma
0 0 1 2.5 NaN
1 3 1 9.1 2.50
2 6 1 4.9 5.80
3 1 2 2.9 5.80
4 4 2 5.7 6.00
5 7 2 3.1 4.30
6 2 3 1.6 4.30
7 5 3 8.2 3.65
8 8 3 7.5 4.90
I tried answers from couple similar questions but nothing seems to work.
If I understand the question correctly, then the solution you require can be achieved in 2 steps using the following:
df['sa'] = df.groupby('Group', as_index=False)['Value'].transform(lambda x: x.shift(1))
df['ma'] = df.groupby('Group', as_index=False)['sa'].transform(lambda x: x.rolling(3, min_periods=1).mean())
I got the below output, where 'ma' is the desired column
index Group Value sa ma
0 0 1 2.5 NaN NaN
1 3 1 9.1 2.5 2.5
2 6 1 4.9 9.1 5.8
3 1 2 2.9 NaN NaN
4 4 2 5.7 2.9 2.9
5 7 2 3.1 5.7 4.3
6 2 3 1.6 NaN NaN
7 5 3 8.2 1.6 1.6
8 8 3 7.5 8.2 4.9
Edit: Example with one groupby
def shift_ma(x):
return x.shift(1).rolling(3, min_periods=1).mean()
df['ma'] = df.groupby('Group', as_index=False)['Value'].apply(shift_ma).reset_index(drop=True)
I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?
Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333
I would to find out intersection of 2 pandas DataFrame according to 2 columns 'x' and 'y' and combine them into 1 DataFrame. The data are:
df[1]:
x y id fa
0 4 5 9283222 3.1
1 4 5 9283222 3.1
2 10 12 9224221 3.2
3 4 5 9284332 1.2
4 6 1 51249 11.2
df[2]:
x y id fa
0 4 5 19283222 1.1
1 9 3 39224221 5.2
2 10 12 29284332 6.2
3 6 1 51242 5.2
4 6 2 51241 9.2
5 1 1 51241 9.2
The expected output is something like (can ignore index):
x y id fa
0 4 5 9283222 3.1
1 4 5 9283222 3.1
2 10 12 9224221 3.2
3 4 5 9284332 1.2
4 6 1 51249 11.2
0 4 5 19283222 1.1
2 10 12 29284332 6.2
3 6 1 51242 5.2
Thank you very much!
You can find out the intersection by joining the x,y columns from df1 and df2, with which you can filter df1 and df2 by inner join, and then concatenating the two results with pd.concat should give what you need:
intersection = df1[['x', 'y']].merge(df2[['x', 'y']]).drop_duplicates()
pd.concat([df1.merge(intersection), df2.merge(intersection)])
The simpliest solution:
df1.columns.intersection(df2.columns)