Custom function + groupby Pandas with different conditions on grouped by variables - python

I want to generate some weights using groupby on a data that originally looks like this :
V1 V2 MONTH CHOICES PRIORITY
X T1 M1 C1 1
X T1 M1 C2 0
X T1 M1 C3 0
X T2 M1 C1 1
X T2 M1 C5 0
X T2 M1 C6 0
X T2 M1 C2 1
X T1 M2 C1 1
X T1 M2 C2 0
X T1 M2 C3 0
X T2 M2 C1 0
X T2 M2 C5 1
X T2 M2 C6 0
X T2 M2 C2 1
Basically, when the MONTH is different than M1, I want to have flagged choices with weights equal to double any non flagged choice.
Example : if you have (C1, C2, C3) and C1 is the only one flagged, weights would be : 0.5 / 0.25 / 0.25.
On the same time, for the first month, I want the weights to be solely focused on flagged choices. Previous example would become (1/0/0).
Precision about the data :
For a given tuple (V1,V2,MONTH), we can have at most two choices flagged as priorities (no priority at all is a possibility).
Here's what I've tried :
def weights_preferences(data):
if (data.MONTH.values != 'M1'):
data['WEIGHTS'] = 1/(len(data)+data[data.PRIORITY==1].shape[0])
data['WEIGHTS'] = data.apply(lambda x : 2*x.WEIGHTS if x.PRIORITY==1 else x.WEIGHTS, axis=1)
elif data.MONTH.values == 'M1' & data[data.PRIORITY==1].shape[0]==0 :
data['WEIGHTS'] = 1/(len(data))
else :
if data[data.PREFERENCE==1].shape[0]==1 :
data['WEIGHTS'] = [1 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
else :
data['WEIGHTS'] = [0.5 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
return data
tmp = tmp.groupby(['V1','V2','MONTH']).apply(weights_preferences)
The problem is that since I groupby 'MONTH', it seems that the value no longer appears in data on which 'weights_preferences' is applied.
P.S : Output would look like this
V1 V2 MONTH CHOICES PRIORITY WEIGHTS
X T1 M1 C1 1 1
X T1 M1 C2 0 0
X T1 M1 C3 0 0
X T2 M1 C1 1 0.5
X T2 M1 C5 0 0
X T2 M1 C6 0 0
X T2 M1 C2 1 0.5
X T1 M2 C1 1 0.5
X T1 M2 C2 0 0.25
X T1 M2 C3 0 0.25
X T2 M2 C1 0 0.16
X T2 M2 C5 1 0.33
X T2 M2 C6 0 0.16
X T2 M2 C2 1 0.33
Any suggestions are very welcomed !
Thanks.

Related

Negative values with GCD python program

Iam doing GCD tool in python but for some reason every time I use 2 negative values I get back negative value back. Beceause this is for school i cant use things like obs() ort math etc.. Can some help me with that?
from sys import stdin
a = 0
b = 0
a0 = 0
b0 = 0
a1 = 0
b1 = 0
n = 0
na = 0
nb = 0
q = 0
for line in stdin:
input = line.lstrip().rstrip().split()
if line == '' or len(input) != 2:
break
a, b = [int(x) for x in line.lstrip().rstrip().split()]
if a > b:
a, b = b, a
#
# a | b
# +---------+
# a | 1 | 0 | "( a0, b0 )"
# b | 0 | 1 | "( a1, b1 )"
# n | na | nb | q
# | | |
#
#
a0 = 1
b0 = 0
a1 = 0
b1 = 1
n = a % b
q = a / b
na = a0 - q * a1
nb = b0 - q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
while n != 0:
n = a % b
q = a // b
na = a0 + q * a1
nb = b0 + q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
print(a)
I tried messing around with operators. I expect to -888 -2 be 2 not -2 (I need to fix the code not to edit results )
Edit 1 : Here are some examples of what i need
Input Output
7 11 1
888 2 2
905 5 5
-7 11 1
-888 -2 2
905 -5 5

Calculation between python pandas DataFrames, and store the result to dictionary

I'm trying some mapping and calculation between dataframes.
Is there any examples or anyone can help how to use python code to do this?
I've 2 dataframes:
products components
c1 c2 c3 c4
p1 1 0 1 0
p2 0 1 1 0
p3 1 0 0 1
p4 0 1 0 1
items cost
components i1 i2 i3 i4
c1 0 10 30 0
c2 20 10 0 0
c3 0 0 10 15
c4 20 0 0 30
The end results should be a dictionary contains the sum of the cost for each components and find the maximum:
{p1: [c1,c3] } -> {p1: [i2+i3,i3+i4] } -> {p1: [40,25] } -> {p1: 40 }
{p2: [c2,c3] } -> {p2: [i1+i2,i3+i4] } -> {p2: [30,25] } -> {p2: 30 }
{p3: [c1,c4] } -> {p3: [i2+i3,i1+i4] } -> {p3: [40,50] } -> {p3: 50 }
{p4: [c2,c4] } -> {p4: [i1+i2,i1+i4] } -> {p4: [30,50] } -> {p4: 50 }
Try (df1 is your first DataFrame, df2 the second):
print(df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1))
Prints:
p1 40
p2 30
p3 50
p4 50
dtype: int64
To store the result to dictionary:
out = dict(
df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1)
)
print(out)
Prints:
{'p1': 40, 'p2': 30, 'p3': 50, 'p4': 50}
You can use itertools.compress() on products data to find components required:
prod_df["comps"] = prod_df.apply(lambda x: list(itertools.compress(x.index[1:], x.values[1:])), axis=1)
[Out]:
product c1 c2 c3 c4 comps
0 p1 1 0 1 0 [c1, c3, comps]
1 p2 0 1 1 0 [c2, c3, comps]
2 p3 1 0 0 1 [c1, c4, comps]
3 p4 0 1 0 1 [c2, c4, comps]
Then select respective component rows from components data and sum each row and filter max row:
prod_df["max_cost"] = prod_df.apply(lambda x: comp_df[comp_df["component"].isin(x["comps"])].iloc[:,1:].sum(axis=1).max(), axis=1)
[Out]:
product max_cost
0 p1 40
1 p2 30
2 p3 50
3 p4 50
Datasets used:
prod_data = [
("p1",1,0,1,0),
("p2",0,1,1,0),
("p3",1,0,0,1),
("p4",0,1,0,1),
]
prod_columns = ["product","c1","c2","c3","c4"]
prod_df = pd.DataFrame(data=prod_data, columns=prod_columns)
comp_data = [
("c1",0,10,30,0),
("c2",20,10,0,0),
("c3",0,0,10,15),
("c4",20,0,0,30),
]
comp_columns = ["component","i1","i2","i3","i4"]
comp_df = pd.DataFrame(data=comp_data, columns=comp_columns)

Sampling with fixed column ratio in pandas

I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column.
I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?
Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M
This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))
Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M

how to select partial data from a pandas dataframe

I have a dataframe,
flag=np.sign(value)
ticker value flag cumsum(flag)
A1 -1.5 -1 -1
B1 1.4 1 0
C3 -2.4 -1 -1
D4 -1.8 -1 -2
E6 -1.6 -1 -3
I have a variable, cutoff = 1 (it is always +ve, its a modulus)
how can I best select the tickers where abs(cumsum(flag)) <= 1
i.e., expected output is [A1, B1, C3]
i.e I want to keep going down the cumsum list until I find the LAST 1 or -1
I tried a loop:
ticker_list_keep = []
for y in range(0, len(df['cumsum']), 1):
if abs(df['cumsum'][y]) < abs(capacity) + 1:
ticker_list_keep.append(df.index[y])
but this would give me only A1 and C3, and would miss out B2
Thanks
per note on comments:
#Vaishali - The question is not a duplicate. I wanted ALL the values in the ticker list, up until we get to the final -1 in the cumsum list.
Above, we get to the final abs(val)=1 at C3, so my list is C3, B1,A1.
Your solution the thread you pointed me to gives only A1 and C3.
You notice A1 is not the final -1 in the cumsum list, therefore A1 alone doesn't suffice. We note C3 is where the final +/-1 occurs therefore our required list is A1,B1,C3
Thanks!!
You can find last valid index based on your condition and create a slice.
idx = df[df['cumsum(flag)'].abs() <= 1].last_valid_index()
df.loc[:idx, :]
ticker value flag cumsum(flag)
0 A1 -1.5 -1 -1
1 B1 1.4 1 0
2 C3 -2.4 -1 -1

conditional sums for pandas aggregate

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
And return:
A_id_grouped sum_up sum_down ... over_200_up
1: a1 1 0 ... 0
2: a2 0 1 0
3: a3 2 0 ... 1
4: a4 0 0 0
5: a5 0 0 ... 0
Before I did it with the R code (using data.table)
>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+ sum_down = sum(B == "down"),
+ ...,
+ over_200_up = sum(up == "up" & < 200), by=list(A)];
However all of my recent attempts with Python have failed me:
DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
"C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
})
Thank you in advance! it seems like a simple question however I couldn't find it anywhere.
To complement unutbu's answer, here's an approach using apply on the groupby object.
>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
sum_up=(x.B == 'up').sum(),
sum_down=(x.B == 'down').sum(),
over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
over_200_up sum_down sum_up
A_id
a1 0 0 1
a2 0 1 0
a3 1 0 2
a4 0 0 0
a5 0 0 0
There might be a better way; I'm pretty new to pandas, but this works:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
'B': 'up down up up left right'.split(),
'C': [100, 102, 100, 250, 100, 102]})
df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])
def sum_up(grp):
return np.sum(grp=='up')
def sum_down(grp):
return np.sum(grp=='down')
def over_200_up(grp):
return np.sum(grp)
result = grouped.agg({'B': [sum_up, sum_down],
'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)
yields
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:
df = df.set_index('A_id')
outcome = {'sum_up' : df.B.eq('up'),
'sum_down': df.B.eq('down'),
'over_200_up' : df.B.eq('up') & df.C.gt(200)}
outcome = pd.DataFrame(outcome).groupby(level=0).sum()
outcome
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:
(df
.set_index(['A_id', 'B'], append = True)
.C
.unstack('B')
.assign(gt_200 = lambda df: df.up.gt(200))
.groupby(level='A_id')
.agg(sum_up=('up', 'count'),
sum_down =('down', 'count'),
over_200_up = ('gt_200', 'sum')
)
)
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Here, what I have recently learned using df assign and numpy's where method:
df3=
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})
outcome=
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas
select a,
sum(case when B='up' then 1 else 0 end) as sum_up
....
from table
group by a

Categories

Resources