Find a Collection of Indexes Provided that the Value - python

So, I have data like this
Index c1
sls1 6
sls2 4
sls3 7
sls4 5
sls5 5
I want to find a collection of indexes provided that the value of column c2 on some indexes amounts to less than equal to 10 with looping. Then I save the index set as a list on a new data frame, which is output.
output = []
output
[sls1, sls2]
[sls3]
[sls4, sls5]
The first row is sls1, sls2 because the number of values from both indices is equal to 10, while the second row of sls3 only because the value of column c1 in index sls3 is 7 where if added up with the next index values will amount to more than 10. And so on
Thank You

There is no vectorized way to compute a cumulated sum with restart on a threshold, you'll have to use a loop.
Then combine this with groupby.agg:
def group(s, thresh=10):
out = []
g = 0
curr_sum = 0
for v in s:
curr_sum += v
if curr_sum > thresh:
g += 1
curr_sum = v
out.append(g)
return pd.Series(out, index=s.index)
out = df.groupby(group(df['c1']))['Index'].agg(list)
Output:
0 [sls1, sls2]
1 [sls3]
2 [sls4, sls5]
Name: Index, dtype: object

Related

Pick subset of items minimizing the count of the most frequent of the selected item's labels

Problem
I want to pick a subset of fixed size from a list of items such that the count of the most frequent occurrence of the labels of the selected items is minimized. In English, I have a DataFrame consisting of a list of 10000 items, generated as follows.
import random
import pandas as pd
def RandLet():
alphabet = "ABCDEFG"
return alphabet[random.randint(0, len(alphabet) - 1)]
items = pd.DataFrame([{"ID": i, "Label1": RandLet(), "Label2": RandLet(), "Label3": RandLet()} for i in range(0, 10000)])
items.head(3)
Each item has 3 labels. The labels are letters within ABCDEFG, and the order of the labels doesn't matter. An item may be tagged multiple times with the same label.
[Example of the first 3 rows]
ID Label1 Label2 Label3
0 0 G B D
1 1 C B C
2 2 C A B
From this list, I want to pick 1000 items in a way that minimizes the number of occurrences of the most frequently appearing label within those items.
For example, if my DataFrame only consisted of the above 3 items, and I only wanted to pick 2 items, and I picked items with ID #1 and #2, the label 'C' appears 3 times, 'B' appears 2 times, 'A' appears 1 time, and all other labels appear 0 times - The maximum of these is 3. However, I could have done better by picking items #0 and #2, in which label 'B' appears the most frequently, coming in as a count of 2. Since 2 is less than 3, picking items #0 and #2 is better than picking items #1 and #2.
In the case where there are multiple ways to pick 1000 items such that the count of the maximum label occurrence is minimized, returning any of those selections is fine.
What I've got
To me, this feels similar a knapsack problem in len("ABCDEFG") = 7 dimensions. I want to put 1000 items in the knapsack, and each item's size in the relevant dimension is the sum of the occurrences of the label for that particular item. To that extent, I've built this function to convert my list of items into a list of sizes for the knapsack.
def ReshapeItems(items):
alphabet = "ABCDEFG"
item_rebuilder = []
for i, row in items.iterrows():
letter_counter = {}
for letter in alphabet:
letter_count = sum(row[[c for c in items.columns if "Label" in c]].apply(lambda x: 1 if x == letter else 0))
letter_counter[letter] = letter_count
letter_counter["ID"] = row["ID"]
item_rebuilder.append(letter_counter)
items2 = pd.DataFrame(item_rebuilder)
return items2
items2 = ReshapeItems(items)
items2.head(3)
[Example of the first 3 rows of items2]
A B C D E F G ID
0 0 1 0 1 0 0 1 0
1 0 1 2 0 0 0 0 1
2 1 1 1 0 0 0 0 2
Unfortunately, at that point, I am completely stuck. I think that the point of knapsack problems is to maximize some sort of value, while keeping the sum of the selected items sizes under some limit - However, here my problem is the opposite, I want to minimize the sum of the selected size such that my value is at least some amount.
What I'm looking for
Although a function that takes in items or items2 and returns a subset of these items that meets my specifications would be ideal, I'd be happy to accept any sufficiently detailed answer that points me in the right direction.
Using a different approach, here is my take on your interesting question.
def get_best_subset(
df: pd.DataFrame, n_rows: int, key_cols: list[str], iterations: int = 50_000
) -> tuple[int, pd.DataFrame]:
"""Subset df in such a way that the frequency
of most frequent values in key columns is minimum.
Args:
df: input dataframe.
n_rows: number of rows in subset.
key_cols: columns to consider.
iterations: max number of tries. Defaults to 50_000.
Returns:
Minimum frequency, subset of n rows of input dataframe.
"""
lowest_frequency: int = df.shape[0] * df.shape[1]
best_df = pd.DataFrame([])
# Iterate through possible subsets
for _ in range(iterations):
sample_df = df.sample(n=n_rows)
# Count values in each column, concat and sum counts, get max count
frequency = (
pd.concat([sample_df[col].value_counts() for col in key_cols])
.pipe(lambda df_: df_.groupby(df_.index).sum())
.max()
)
if frequency < lowest_frequency:
lowest_frequency = frequency
best_df = sample_df
return lowest_frequency, best_df.sort_values(by=["ID"]).reset_index(drop=True)
And so, with the toy dataframe constructor you provided:
lowest_frequency, best_df = get_best_subset(
items, 1_000, ["Label1", "Label2", "Label3"]
)
print(lowest_frequency)
# 431
print(best_df)
# Output
ID Label1 Label2 Label3
0 39 F D A
1 46 B G E
2 52 D D B
3 56 D F B
4 72 C D E
.. ... ... ... ...
995 9958 A F E
996 9961 E E E
997 9966 G E C
998 9970 B C B
999 9979 A C G
[1000 rows x 4 columns]

Get rows before and after from an index in pandas dataframe

I want to get a specific amount of rows before and after a specific index. However, when I try to get the rows, and the range is greater than the number of indices, it does not return anything. Given this, I would like you to continue looking for rows, as I show below:
df = pd.DataFrame({'column': range(1, 6)})
column
0 1
1 2
2 3
3 4
4 5
index = 2
df.iloc[idx]
3
# Now I want to get three values before and after that index.
# Something like this:
def get_before_after_rows(index):
rows_before = df[(index-1): (index-1)-2]
rows_after = df[(index+1): (index+1)-2]
return rows_before, rows_after
rows_before, rows_after = get_before_after_rows(index)
rows_before
column
0 1
1 2
4 5
rows_after
column
0 1
3 4
4 5
You are mixing iloc and loc which is very dangerous. It works in your example because the index is sequentially numbered starting from zero so these two functions behave identically.
Anyhow, what you want is basically taking rows with wrap-around:
def get_around(df: pd.DataFrame, index: int, n: int) -> (pd.DataFrame, pd.DataFrame):
"""Return n rows before and n rows after the specified positional index"""
idx = index - np.arange(1, n+1)
before = df.iloc[idx].sort_index()
idx = (index + np.arange(1, n+1)) % len(df)
after = df.iloc[idx].sort_index()
return before, after
# Get 3 rows before and 3 rows after the *positional index* 2
before, after = get_around(df, 2, 3)

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.

How to groupby a column every time its sum reaches a specified amount?

I have a data frame df like this
x
0 8.86
1 1.12
2 0.56
3 5.99
4 3.08
5 4.15
I need to perform some sort of groupby operation on x to aggregate x every time its sum reaches 10. If the index of df were a datetime object, I could use pd.Grouper as below
grouped = df.groupby(pd.Grouper(freq="min")
grouped["x"].sum()
which would group by the datetime index and then sum x every minute. In my case I don't have a datetime target to use, so df.groupby(pd.Grouper(freq=10)) yields ValueError: Invalid frequency: 10.
The desired output dataframe, after applying groupby() and sum() operations would look like this
y
0 10.54
1 13.22
because elements 0-2 of df sum to 10.54 and elements 3-5 sum to 13.22
How can I group x by its sum, every time the sum reaches 10?
Here's one approach:
# cumulative sum and modulo 10
s = df.x.cumsum().mod(10)
# if value lower than 10, we've reached the value
m = s.diff().lt(0)
# groupby de cumsum
df.x.groupby(m.cumsum().shift(fill_value=0)).sum()
x
0 10.54
1 13.22
Name: x, dtype: float64
You can do this with a for-loop and rolling sums.
data_slices = [] # Store each sample
rollingSum = 0
last_t = 0
for t in range(len(df)):
rollingSum += df['x'][t] # Add the t index value to sum
if rollingSum >= 10:
data_slice = df['x'][last_t:t] # Slice of x column that sums over 10
data_slices.append(data_slice)
rollingSum = 0 # Reset the sum
last_t = t # Set this as the start index of next slice
grouped_data = pd.concat(data_slices, axis=0)

DataFrame.sum returns Series and not a number

My basic task is to take vector x=[x1,x2,x3,x4] (which in my case is presented by a row of a Pandas dataframe, lets say a row with an index = 1), multiply it by scalar k and to sum up the results -> x1*k + x2*k + x3*k + x4*k.
I did not find a function that would do it in one step (Is there such a function/operation?), so i do it in two steps. First i multiply my vector x by scalar k, and then i sum up the results:
x_by_k = my_df.loc[[1]]*k
sum = x_by_k.sum(axis=1)
One of the problems i have here is that the resulting sum is of Series type, although effectively it is a number.
Is there a way to perform this sum operation with a number as an output?
Can i do the above described in one step?
IIUC select row in df by ix, then sum and multiple by k:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
k = 2
sum = df.ix[1].sum()* k
print (sum)
30

Categories

Resources