How to sum all values with index greater than X? - python

Let's say I have this series:
>>> s = pd.Series({1:10,2:5,3:8,4:12,5:7,6:3})
>>> s
1 10
2 5
3 8
4 12
5 7
6 3
I want to sum all the values for which the index is greater than X. So if e.g. X = 3, I want to get this:
>>> X = 3
>>> s.some_magic(X)
1 10
2 5
3 8
>3 22
I managed to do it in this rather clumsy way:
lt = s[s.index.values <= 3]
gt = s[s.index.values > 3]
gt_s = pd.Series({'>3':sum(gt)})
lt.append(gt_s)
and got the desired result, but I believe there should be an easier and more elegant way... or is there?

s.groupby(np.where(s.index > 3, '>3', s.index)).sum()
Or,
s.groupby(s.index.to_series().mask(s.index > 3, '>3')).sum()
Out:
1 10
2 5
3 8
>3 22
dtype: int64

Here's a possible solution:
import pandas as pd
s = pd.Series({1: 10, 2: 5, 3: 8, 4: 12, 5: 7, 6: 3})
iv = s.index.values
print s[iv <= 3].append(pd.Series({'>3': s[iv > 3].sum()}))

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

pandas create a DataFrame by multiplying every element in a list with every other element

I need to populate a dataframe with a matrix built from a single list, but the math and python syntax are beyond me. I essentially need to perform some math operations as if the same list were both the rows and the columns.
So it should look something like this....
#Input
list = [1,2,3,4]
create a matrix using some math on the list, like matrix[i,j] = list[i] * list[j]
#output
np.matrix([[1,2,3,4], [2,4,6,8], [3,6,9,12], [4,8,12,16]])
df = pd.dataframe[np.matrix]
Broadcasted multiplication will work here:
arr = np.array([1, 2, 3, 4])
pd.DataFrame(arr * arr[:,None])
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
Alternatively, most numpy arithmetic functions define an .outer unfunc:
pd.DataFrame(np.multiply.outer(arr, arr))
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
data = [1,2,3,4]
Nested for loops would work:
import numpy as np
a = []
for n in data:
row = []
for m in data:
math = some_operation_on(m,n)
row.append(math)
a.append(row)
a = np.array(a)
For simple operations like your example use numpy.meshgrid.
In [21]: a = [1,2,3,4]
In [22]: x,y = np.meshgrid(a,a)
In [23]: x*y
Out[23]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])

Pandas Sampling Every Time Condition is Met

Given a pandas dataframe, I want to get the indices of each row when the sum of the previous row's column value (or current row's column value) is equal to or greater than n, then the sum restarts back to zero. So for example, if our dataframe has values:
index colB
1 10
2 20
3 5
4 5
5 15
6 5
7 7
8 3
and say n=10, then the indices I want are [1, 2, 4, 5, 7] since the previous rows (or current row) for ColB add up to 10.
So far, I can do a for-loop on this dataframe to get the indices I want, but when there are many rows, it is very slow. Therefore, I am seeking help on a faster method. Thanks!
There may be a clever way with some combination of cumsum() but this is a tough problem because the value needs to reset after the sum is greater than n. So it's kind of like a rolling sum with a window no greater than n.
I would probably use a custom function for this.
import pandas as pd
def go(s, n=10):
r = []
c = 0
s = s.tolist()
cv = 0
for v in s:
if v >= n:
r.append(c)
c += 1
else:
cv += v
if cv >= n:
r.append(c)
c += 1
cv = 0
else:
r.append(c)
return r
df = pd.DataFrame.from_dict({
'colB': {0: 10, 1: 20, 2: 5, 3: 5, 4: 15, 5: 5, 6: 1, 7: 3, 8: 1, 9: 12}})
df['group'] = df.apply(go, args=(10,))
indices = df['group'].drop_duplicates().index
Note that I modified your example numbers a bit.
If df is this:
colB
0 10
1 20
2 5
3 5
4 15
5 5
6 1
7 3
8 1
9 12
indices is:
Int64Index([0, 1, 2, 4, 5, 9], dtype='int64')

Building a function that divides dataframe into groups

I am intrested in creating a function that does the folloing:
accepts 2 parameters: a DataFrame and an integer.
adds a column to the DF called "group"
giving each row an integer based on his integer location. the number of groups should be as the number of integer given to the function.
if the number of rows is not dividable by the integer given, the remaning rows should be splitted as evenly as possible between the groups. this is the part im having problems with.
Here is a menual exemple i made to clarify my intentions:
I would to get from this DF:
d = {'value': [1,2,3,4,5,6,7,8,9,10,11,12,13],}
df_init = pd.DataFrame(data=d)
By this function:
wanted function(df_init,5)
To this finel DF:
s = {'value': [1,2,3,4,5,6,7,8,9,10,11,12,13],'group':[1,1,1,2,2,2,3,3,3,4,4,5,5]}
df_finel = pd.DataFrame(data=d)
If I can make the question any clearer, please tell me how and ill fix it.
Use np.array_split
In [5481]: [i for i, x in enumerate(np.array_split(np.arange(len(df)), 5), 1) for _ in x]
Out[5481]: [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5]
Assign it
In [5487]: df['group'] = [i for i, x in
enumerate(np.array_split(np.arange(len(df)), 5), 1) for _ in x]
In [5488]: df
Out[5488]:
value group
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4
10 11 4
11 12 5
12 13 5
Details
Original df
In [5491]: df
Out[5491]:
value
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
The act
In [5492]: np.array_split(np.arange(len(df)), 5)
Out[5492]:
[array([0, 1, 2]),
array([3, 4, 5]),
array([6, 7, 8]),
array([ 9, 10]),
array([11, 12])]

Categories

Resources