I want to generate some weights using groupby on a data that originally looks like this :
V1 V2 MONTH CHOICES PRIORITY
X T1 M1 C1 1
X T1 M1 C2 0
X T1 M1 C3 0
X T2 M1 C1 1
X T2 M1 C5 0
X T2 M1 C6 0
X T2 M1 C2 1
X T1 M2 C1 1
X T1 M2 C2 0
X T1 M2 C3 0
X T2 M2 C1 0
X T2 M2 C5 1
X T2 M2 C6 0
X T2 M2 C2 1
Basically, when the MONTH is different than M1, I want to have flagged choices with weights equal to double any non flagged choice.
Example : if you have (C1, C2, C3) and C1 is the only one flagged, weights would be : 0.5 / 0.25 / 0.25.
On the same time, for the first month, I want the weights to be solely focused on flagged choices. Previous example would become (1/0/0).
Precision about the data :
For a given tuple (V1,V2,MONTH), we can have at most two choices flagged as priorities (no priority at all is a possibility).
Here's what I've tried :
def weights_preferences(data):
if (data.MONTH.values != 'M1'):
data['WEIGHTS'] = 1/(len(data)+data[data.PRIORITY==1].shape[0])
data['WEIGHTS'] = data.apply(lambda x : 2*x.WEIGHTS if x.PRIORITY==1 else x.WEIGHTS, axis=1)
elif data.MONTH.values == 'M1' & data[data.PRIORITY==1].shape[0]==0 :
data['WEIGHTS'] = 1/(len(data))
else :
if data[data.PREFERENCE==1].shape[0]==1 :
data['WEIGHTS'] = [1 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
else :
data['WEIGHTS'] = [0.5 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
return data
tmp = tmp.groupby(['V1','V2','MONTH']).apply(weights_preferences)
The problem is that since I groupby 'MONTH', it seems that the value no longer appears in data on which 'weights_preferences' is applied.
P.S : Output would look like this
V1 V2 MONTH CHOICES PRIORITY WEIGHTS
X T1 M1 C1 1 1
X T1 M1 C2 0 0
X T1 M1 C3 0 0
X T2 M1 C1 1 0.5
X T2 M1 C5 0 0
X T2 M1 C6 0 0
X T2 M1 C2 1 0.5
X T1 M2 C1 1 0.5
X T1 M2 C2 0 0.25
X T1 M2 C3 0 0.25
X T2 M2 C1 0 0.16
X T2 M2 C5 1 0.33
X T2 M2 C6 0 0.16
X T2 M2 C2 1 0.33
Any suggestions are very welcomed !
Thanks.
Related
Iam doing GCD tool in python but for some reason every time I use 2 negative values I get back negative value back. Beceause this is for school i cant use things like obs() ort math etc.. Can some help me with that?
from sys import stdin
a = 0
b = 0
a0 = 0
b0 = 0
a1 = 0
b1 = 0
n = 0
na = 0
nb = 0
q = 0
for line in stdin:
input = line.lstrip().rstrip().split()
if line == '' or len(input) != 2:
break
a, b = [int(x) for x in line.lstrip().rstrip().split()]
if a > b:
a, b = b, a
#
# a | b
# +---------+
# a | 1 | 0 | "( a0, b0 )"
# b | 0 | 1 | "( a1, b1 )"
# n | na | nb | q
# | | |
#
#
a0 = 1
b0 = 0
a1 = 0
b1 = 1
n = a % b
q = a / b
na = a0 - q * a1
nb = b0 - q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
while n != 0:
n = a % b
q = a // b
na = a0 + q * a1
nb = b0 + q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
print(a)
I tried messing around with operators. I expect to -888 -2 be 2 not -2 (I need to fix the code not to edit results )
Edit 1 : Here are some examples of what i need
Input Output
7 11 1
888 2 2
905 5 5
-7 11 1
-888 -2 2
905 -5 5
I'm trying some mapping and calculation between dataframes.
Is there any examples or anyone can help how to use python code to do this?
I've 2 dataframes:
products components
c1 c2 c3 c4
p1 1 0 1 0
p2 0 1 1 0
p3 1 0 0 1
p4 0 1 0 1
items cost
components i1 i2 i3 i4
c1 0 10 30 0
c2 20 10 0 0
c3 0 0 10 15
c4 20 0 0 30
The end results should be a dictionary contains the sum of the cost for each components and find the maximum:
{p1: [c1,c3] } -> {p1: [i2+i3,i3+i4] } -> {p1: [40,25] } -> {p1: 40 }
{p2: [c2,c3] } -> {p2: [i1+i2,i3+i4] } -> {p2: [30,25] } -> {p2: 30 }
{p3: [c1,c4] } -> {p3: [i2+i3,i1+i4] } -> {p3: [40,50] } -> {p3: 50 }
{p4: [c2,c4] } -> {p4: [i1+i2,i1+i4] } -> {p4: [30,50] } -> {p4: 50 }
Try (df1 is your first DataFrame, df2 the second):
print(df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1))
Prints:
p1 40
p2 30
p3 50
p4 50
dtype: int64
To store the result to dictionary:
out = dict(
df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1)
)
print(out)
Prints:
{'p1': 40, 'p2': 30, 'p3': 50, 'p4': 50}
You can use itertools.compress() on products data to find components required:
prod_df["comps"] = prod_df.apply(lambda x: list(itertools.compress(x.index[1:], x.values[1:])), axis=1)
[Out]:
product c1 c2 c3 c4 comps
0 p1 1 0 1 0 [c1, c3, comps]
1 p2 0 1 1 0 [c2, c3, comps]
2 p3 1 0 0 1 [c1, c4, comps]
3 p4 0 1 0 1 [c2, c4, comps]
Then select respective component rows from components data and sum each row and filter max row:
prod_df["max_cost"] = prod_df.apply(lambda x: comp_df[comp_df["component"].isin(x["comps"])].iloc[:,1:].sum(axis=1).max(), axis=1)
[Out]:
product max_cost
0 p1 40
1 p2 30
2 p3 50
3 p4 50
Datasets used:
prod_data = [
("p1",1,0,1,0),
("p2",0,1,1,0),
("p3",1,0,0,1),
("p4",0,1,0,1),
]
prod_columns = ["product","c1","c2","c3","c4"]
prod_df = pd.DataFrame(data=prod_data, columns=prod_columns)
comp_data = [
("c1",0,10,30,0),
("c2",20,10,0,0),
("c3",0,0,10,15),
("c4",20,0,0,30),
]
comp_columns = ["component","i1","i2","i3","i4"]
comp_df = pd.DataFrame(data=comp_data, columns=comp_columns)
I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column.
I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?
Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M
This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))
Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M
I have a dataframe,
flag=np.sign(value)
ticker value flag cumsum(flag)
A1 -1.5 -1 -1
B1 1.4 1 0
C3 -2.4 -1 -1
D4 -1.8 -1 -2
E6 -1.6 -1 -3
I have a variable, cutoff = 1 (it is always +ve, its a modulus)
how can I best select the tickers where abs(cumsum(flag)) <= 1
i.e., expected output is [A1, B1, C3]
i.e I want to keep going down the cumsum list until I find the LAST 1 or -1
I tried a loop:
ticker_list_keep = []
for y in range(0, len(df['cumsum']), 1):
if abs(df['cumsum'][y]) < abs(capacity) + 1:
ticker_list_keep.append(df.index[y])
but this would give me only A1 and C3, and would miss out B2
Thanks
per note on comments:
#Vaishali - The question is not a duplicate. I wanted ALL the values in the ticker list, up until we get to the final -1 in the cumsum list.
Above, we get to the final abs(val)=1 at C3, so my list is C3, B1,A1.
Your solution the thread you pointed me to gives only A1 and C3.
You notice A1 is not the final -1 in the cumsum list, therefore A1 alone doesn't suffice. We note C3 is where the final +/-1 occurs therefore our required list is A1,B1,C3
Thanks!!
You can find last valid index based on your condition and create a slice.
idx = df[df['cumsum(flag)'].abs() <= 1].last_valid_index()
df.loc[:idx, :]
ticker value flag cumsum(flag)
0 A1 -1.5 -1 -1
1 B1 1.4 1 0
2 C3 -2.4 -1 -1
I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
And return:
A_id_grouped sum_up sum_down ... over_200_up
1: a1 1 0 ... 0
2: a2 0 1 0
3: a3 2 0 ... 1
4: a4 0 0 0
5: a5 0 0 ... 0
Before I did it with the R code (using data.table)
>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+ sum_down = sum(B == "down"),
+ ...,
+ over_200_up = sum(up == "up" & < 200), by=list(A)];
However all of my recent attempts with Python have failed me:
DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
"C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
})
Thank you in advance! it seems like a simple question however I couldn't find it anywhere.
To complement unutbu's answer, here's an approach using apply on the groupby object.
>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
sum_up=(x.B == 'up').sum(),
sum_down=(x.B == 'down').sum(),
over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
over_200_up sum_down sum_up
A_id
a1 0 0 1
a2 0 1 0
a3 1 0 2
a4 0 0 0
a5 0 0 0
There might be a better way; I'm pretty new to pandas, but this works:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
'B': 'up down up up left right'.split(),
'C': [100, 102, 100, 250, 100, 102]})
df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])
def sum_up(grp):
return np.sum(grp=='up')
def sum_down(grp):
return np.sum(grp=='down')
def over_200_up(grp):
return np.sum(grp)
result = grouped.agg({'B': [sum_up, sum_down],
'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)
yields
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:
df = df.set_index('A_id')
outcome = {'sum_up' : df.B.eq('up'),
'sum_down': df.B.eq('down'),
'over_200_up' : df.B.eq('up') & df.C.gt(200)}
outcome = pd.DataFrame(outcome).groupby(level=0).sum()
outcome
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:
(df
.set_index(['A_id', 'B'], append = True)
.C
.unstack('B')
.assign(gt_200 = lambda df: df.up.gt(200))
.groupby(level='A_id')
.agg(sum_up=('up', 'count'),
sum_down =('down', 'count'),
over_200_up = ('gt_200', 'sum')
)
)
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Here, what I have recently learned using df assign and numpy's where method:
df3=
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})
outcome=
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas
select a,
sum(case when B='up' then 1 else 0 end) as sum_up
....
from table
group by a