I have a dataframe df as below:
# Import pandas library
import pandas as pd
# initialize list elements
data = [10,-20,30,40,-50,60,12,-12,11,1,90,-20,-10,-5,-4]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Numbers'])
# print dataframe.
df
I want the sum of count of max consecutive positive and negative numbers.
I am able to get count of max consucutive positive and negative numbers, but unable to sum using below code.
my code:
streak = df['Numbers'].to_list()
from collections import defaultdict
from itertools import groupby
counter = defaultdict(list)
for key, val in groupby(streak, lambda ele: "plus" if ele >= 0 else "minus"):
counter[key].append(len(list(val)))
lst = []
for key in ('plus', 'minus'):
lst.append(counter[key])
print("Max Pos Count " + str(max(lst[0])))
print("Max Neg Count : " + str(max(lst[1])))
Current Output:
Max Pos Count 3
Max Neg Count : 4
I am struggling to get sum of max consuctive positive and negative.
Expected Output:
Sum Pos Max Consecutive: 102
Sum Neg Max Consecutive: -39
The logic is unclear, the way I understand it is:
group by successive negative/positive values
get the longest stretch per group
compute the sum
You can use:
m = df['Numbers'].gt(0).map({True: 'positive', False: 'negative'})
df2 = df.groupby([m, m.ne(m.shift()).cumsum()])['Numbers'].agg(['count', 'sum'])
out = df2.loc[df2.groupby(level=0)['count'].idxmax(), 'sum'].droplevel(1)
Output:
Numbers
negative -39
positive 102
Name: sum, dtype: int64
Intermediate df2:
count sum
Numbers Numbers
negative 2 1 -20
4 1 -50
6 1 -12
8 4 -39 # longest negative stretch
positive 1 1 10
3 2 70
5 2 72
7 3 102 # longest positive stretch
Related
So, I have data like this
Index c1
sls1 6
sls2 4
sls3 7
sls4 5
sls5 5
I want to find a collection of indexes provided that the value of column c2 on some indexes amounts to less than equal to 10 with looping. Then I save the index set as a list on a new data frame, which is output.
output = []
output
[sls1, sls2]
[sls3]
[sls4, sls5]
The first row is sls1, sls2 because the number of values from both indices is equal to 10, while the second row of sls3 only because the value of column c1 in index sls3 is 7 where if added up with the next index values will amount to more than 10. And so on
Thank You
There is no vectorized way to compute a cumulated sum with restart on a threshold, you'll have to use a loop.
Then combine this with groupby.agg:
def group(s, thresh=10):
out = []
g = 0
curr_sum = 0
for v in s:
curr_sum += v
if curr_sum > thresh:
g += 1
curr_sum = v
out.append(g)
return pd.Series(out, index=s.index)
out = df.groupby(group(df['c1']))['Index'].agg(list)
Output:
0 [sls1, sls2]
1 [sls3]
2 [sls4, sls5]
Name: Index, dtype: object
Problem
I want to pick a subset of fixed size from a list of items such that the count of the most frequent occurrence of the labels of the selected items is minimized. In English, I have a DataFrame consisting of a list of 10000 items, generated as follows.
import random
import pandas as pd
def RandLet():
alphabet = "ABCDEFG"
return alphabet[random.randint(0, len(alphabet) - 1)]
items = pd.DataFrame([{"ID": i, "Label1": RandLet(), "Label2": RandLet(), "Label3": RandLet()} for i in range(0, 10000)])
items.head(3)
Each item has 3 labels. The labels are letters within ABCDEFG, and the order of the labels doesn't matter. An item may be tagged multiple times with the same label.
[Example of the first 3 rows]
ID Label1 Label2 Label3
0 0 G B D
1 1 C B C
2 2 C A B
From this list, I want to pick 1000 items in a way that minimizes the number of occurrences of the most frequently appearing label within those items.
For example, if my DataFrame only consisted of the above 3 items, and I only wanted to pick 2 items, and I picked items with ID #1 and #2, the label 'C' appears 3 times, 'B' appears 2 times, 'A' appears 1 time, and all other labels appear 0 times - The maximum of these is 3. However, I could have done better by picking items #0 and #2, in which label 'B' appears the most frequently, coming in as a count of 2. Since 2 is less than 3, picking items #0 and #2 is better than picking items #1 and #2.
In the case where there are multiple ways to pick 1000 items such that the count of the maximum label occurrence is minimized, returning any of those selections is fine.
What I've got
To me, this feels similar a knapsack problem in len("ABCDEFG") = 7 dimensions. I want to put 1000 items in the knapsack, and each item's size in the relevant dimension is the sum of the occurrences of the label for that particular item. To that extent, I've built this function to convert my list of items into a list of sizes for the knapsack.
def ReshapeItems(items):
alphabet = "ABCDEFG"
item_rebuilder = []
for i, row in items.iterrows():
letter_counter = {}
for letter in alphabet:
letter_count = sum(row[[c for c in items.columns if "Label" in c]].apply(lambda x: 1 if x == letter else 0))
letter_counter[letter] = letter_count
letter_counter["ID"] = row["ID"]
item_rebuilder.append(letter_counter)
items2 = pd.DataFrame(item_rebuilder)
return items2
items2 = ReshapeItems(items)
items2.head(3)
[Example of the first 3 rows of items2]
A B C D E F G ID
0 0 1 0 1 0 0 1 0
1 0 1 2 0 0 0 0 1
2 1 1 1 0 0 0 0 2
Unfortunately, at that point, I am completely stuck. I think that the point of knapsack problems is to maximize some sort of value, while keeping the sum of the selected items sizes under some limit - However, here my problem is the opposite, I want to minimize the sum of the selected size such that my value is at least some amount.
What I'm looking for
Although a function that takes in items or items2 and returns a subset of these items that meets my specifications would be ideal, I'd be happy to accept any sufficiently detailed answer that points me in the right direction.
Using a different approach, here is my take on your interesting question.
def get_best_subset(
df: pd.DataFrame, n_rows: int, key_cols: list[str], iterations: int = 50_000
) -> tuple[int, pd.DataFrame]:
"""Subset df in such a way that the frequency
of most frequent values in key columns is minimum.
Args:
df: input dataframe.
n_rows: number of rows in subset.
key_cols: columns to consider.
iterations: max number of tries. Defaults to 50_000.
Returns:
Minimum frequency, subset of n rows of input dataframe.
"""
lowest_frequency: int = df.shape[0] * df.shape[1]
best_df = pd.DataFrame([])
# Iterate through possible subsets
for _ in range(iterations):
sample_df = df.sample(n=n_rows)
# Count values in each column, concat and sum counts, get max count
frequency = (
pd.concat([sample_df[col].value_counts() for col in key_cols])
.pipe(lambda df_: df_.groupby(df_.index).sum())
.max()
)
if frequency < lowest_frequency:
lowest_frequency = frequency
best_df = sample_df
return lowest_frequency, best_df.sort_values(by=["ID"]).reset_index(drop=True)
And so, with the toy dataframe constructor you provided:
lowest_frequency, best_df = get_best_subset(
items, 1_000, ["Label1", "Label2", "Label3"]
)
print(lowest_frequency)
# 431
print(best_df)
# Output
ID Label1 Label2 Label3
0 39 F D A
1 46 B G E
2 52 D D B
3 56 D F B
4 72 C D E
.. ... ... ... ...
995 9958 A F E
996 9961 E E E
997 9966 G E C
998 9970 B C B
999 9979 A C G
[1000 rows x 4 columns]
I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727
I have a data frame df like this
x
0 8.86
1 1.12
2 0.56
3 5.99
4 3.08
5 4.15
I need to perform some sort of groupby operation on x to aggregate x every time its sum reaches 10. If the index of df were a datetime object, I could use pd.Grouper as below
grouped = df.groupby(pd.Grouper(freq="min")
grouped["x"].sum()
which would group by the datetime index and then sum x every minute. In my case I don't have a datetime target to use, so df.groupby(pd.Grouper(freq=10)) yields ValueError: Invalid frequency: 10.
The desired output dataframe, after applying groupby() and sum() operations would look like this
y
0 10.54
1 13.22
because elements 0-2 of df sum to 10.54 and elements 3-5 sum to 13.22
How can I group x by its sum, every time the sum reaches 10?
Here's one approach:
# cumulative sum and modulo 10
s = df.x.cumsum().mod(10)
# if value lower than 10, we've reached the value
m = s.diff().lt(0)
# groupby de cumsum
df.x.groupby(m.cumsum().shift(fill_value=0)).sum()
x
0 10.54
1 13.22
Name: x, dtype: float64
You can do this with a for-loop and rolling sums.
data_slices = [] # Store each sample
rollingSum = 0
last_t = 0
for t in range(len(df)):
rollingSum += df['x'][t] # Add the t index value to sum
if rollingSum >= 10:
data_slice = df['x'][last_t:t] # Slice of x column that sums over 10
data_slices.append(data_slice)
rollingSum = 0 # Reset the sum
last_t = t # Set this as the start index of next slice
grouped_data = pd.concat(data_slices, axis=0)
I'm trying to calculate statistics (min, max, avg...) of streaks of consecutive higher values of a column. I'm rather new to pandas and stats, searched a bit but could not find an answer.
The data is financial data, with OHLC values in column, e.g.
Open High Low Close
Date
2013-10-20 1.36825 1.38315 1.36502 1.38029
2013-10-27 1.38072 1.38167 1.34793 1.34858
2013-11-03 1.34874 1.35466 1.32941 1.33664
2013-11-10 1.33549 1.35045 1.33439 1.34950
....
For example the average consecutive higher Low streak.
LATER EDIT
I think I didn't explain well. An item that was counted in a sequence can't be counted again. So for the sequence:
1,2,3,4,1,2,3,3,2,1
There are 4 streaks: 1,2,3,4 | 1,2,3,3 | 2 | 1
max = 4
min = 1
avg = (4+4+1+1)/4 = 2.5
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,1,2,3,3,2,1])
def ascends(s):
diff = np.r_[0, (np.diff(s.values)>=0).astype(int), 0]
diff2 = np.diff(diff)
descends = np.where(np.logical_not(diff)[1:] & np.logical_not(diff)[:-1])[0]
starts = np.sort(np.r_[np.where(diff2 > 0)[0], descends])
ends = np.sort(np.r_[np.where(diff2 < 0)[0], descends])
return ends - starts + 1
b = ascends(s)
print b
print b.max()
print b.min()
print b.mean()
(reference)
Output:
[4 4 1 1]
4
1
2.5