conditional sums for pandas aggregate - python

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
And return:
A_id_grouped sum_up sum_down ... over_200_up
1: a1 1 0 ... 0
2: a2 0 1 0
3: a3 2 0 ... 1
4: a4 0 0 0
5: a5 0 0 ... 0
Before I did it with the R code (using data.table)
>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+ sum_down = sum(B == "down"),
+ ...,
+ over_200_up = sum(up == "up" & < 200), by=list(A)];
However all of my recent attempts with Python have failed me:
DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
"C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
})
Thank you in advance! it seems like a simple question however I couldn't find it anywhere.

To complement unutbu's answer, here's an approach using apply on the groupby object.
>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
sum_up=(x.B == 'up').sum(),
sum_down=(x.B == 'down').sum(),
over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
over_200_up sum_down sum_up
A_id
a1 0 0 1
a2 0 1 0
a3 1 0 2
a4 0 0 0
a5 0 0 0

There might be a better way; I'm pretty new to pandas, but this works:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
'B': 'up down up up left right'.split(),
'C': [100, 102, 100, 250, 100, 102]})
df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])
def sum_up(grp):
return np.sum(grp=='up')
def sum_down(grp):
return np.sum(grp=='down')
def over_200_up(grp):
return np.sum(grp)
result = grouped.agg({'B': [sum_up, sum_down],
'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)
yields
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0

An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:
df = df.set_index('A_id')
outcome = {'sum_up' : df.B.eq('up'),
'sum_down': df.B.eq('down'),
'over_200_up' : df.B.eq('up') & df.C.gt(200)}
outcome = pd.DataFrame(outcome).groupby(level=0).sum()
outcome
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:
(df
.set_index(['A_id', 'B'], append = True)
.C
.unstack('B')
.assign(gt_200 = lambda df: df.up.gt(200))
.groupby(level='A_id')
.agg(sum_up=('up', 'count'),
sum_down =('down', 'count'),
over_200_up = ('gt_200', 'sum')
)
)
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0

Here, what I have recently learned using df assign and numpy's where method:
df3=
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})
outcome=
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas
select a,
sum(case when B='up' then 1 else 0 end) as sum_up
....
from table
group by a

Related

Negative values with GCD python program

Iam doing GCD tool in python but for some reason every time I use 2 negative values I get back negative value back. Beceause this is for school i cant use things like obs() ort math etc.. Can some help me with that?
from sys import stdin
a = 0
b = 0
a0 = 0
b0 = 0
a1 = 0
b1 = 0
n = 0
na = 0
nb = 0
q = 0
for line in stdin:
input = line.lstrip().rstrip().split()
if line == '' or len(input) != 2:
break
a, b = [int(x) for x in line.lstrip().rstrip().split()]
if a > b:
a, b = b, a
#
# a | b
# +---------+
# a | 1 | 0 | "( a0, b0 )"
# b | 0 | 1 | "( a1, b1 )"
# n | na | nb | q
# | | |
#
#
a0 = 1
b0 = 0
a1 = 0
b1 = 1
n = a % b
q = a / b
na = a0 - q * a1
nb = b0 - q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
while n != 0:
n = a % b
q = a // b
na = a0 + q * a1
nb = b0 + q * b1
a = b
a0 = a1
b0 = b1
b = n
a1 = na
b1 = nb
print(a)
I tried messing around with operators. I expect to -888 -2 be 2 not -2 (I need to fix the code not to edit results )
Edit 1 : Here are some examples of what i need
Input Output
7 11 1
888 2 2
905 5 5
-7 11 1
-888 -2 2
905 -5 5

Calculation between python pandas DataFrames, and store the result to dictionary

I'm trying some mapping and calculation between dataframes.
Is there any examples or anyone can help how to use python code to do this?
I've 2 dataframes:
products components
c1 c2 c3 c4
p1 1 0 1 0
p2 0 1 1 0
p3 1 0 0 1
p4 0 1 0 1
items cost
components i1 i2 i3 i4
c1 0 10 30 0
c2 20 10 0 0
c3 0 0 10 15
c4 20 0 0 30
The end results should be a dictionary contains the sum of the cost for each components and find the maximum:
{p1: [c1,c3] } -> {p1: [i2+i3,i3+i4] } -> {p1: [40,25] } -> {p1: 40 }
{p2: [c2,c3] } -> {p2: [i1+i2,i3+i4] } -> {p2: [30,25] } -> {p2: 30 }
{p3: [c1,c4] } -> {p3: [i2+i3,i1+i4] } -> {p3: [40,50] } -> {p3: 50 }
{p4: [c2,c4] } -> {p4: [i1+i2,i1+i4] } -> {p4: [30,50] } -> {p4: 50 }
Try (df1 is your first DataFrame, df2 the second):
print(df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1))
Prints:
p1 40
p2 30
p3 50
p4 50
dtype: int64
To store the result to dictionary:
out = dict(
df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1)
)
print(out)
Prints:
{'p1': 40, 'p2': 30, 'p3': 50, 'p4': 50}
You can use itertools.compress() on products data to find components required:
prod_df["comps"] = prod_df.apply(lambda x: list(itertools.compress(x.index[1:], x.values[1:])), axis=1)
[Out]:
product c1 c2 c3 c4 comps
0 p1 1 0 1 0 [c1, c3, comps]
1 p2 0 1 1 0 [c2, c3, comps]
2 p3 1 0 0 1 [c1, c4, comps]
3 p4 0 1 0 1 [c2, c4, comps]
Then select respective component rows from components data and sum each row and filter max row:
prod_df["max_cost"] = prod_df.apply(lambda x: comp_df[comp_df["component"].isin(x["comps"])].iloc[:,1:].sum(axis=1).max(), axis=1)
[Out]:
product max_cost
0 p1 40
1 p2 30
2 p3 50
3 p4 50
Datasets used:
prod_data = [
("p1",1,0,1,0),
("p2",0,1,1,0),
("p3",1,0,0,1),
("p4",0,1,0,1),
]
prod_columns = ["product","c1","c2","c3","c4"]
prod_df = pd.DataFrame(data=prod_data, columns=prod_columns)
comp_data = [
("c1",0,10,30,0),
("c2",20,10,0,0),
("c3",0,0,10,15),
("c4",20,0,0,30),
]
comp_columns = ["component","i1","i2","i3","i4"]
comp_df = pd.DataFrame(data=comp_data, columns=comp_columns)

Summing by string names Pandas

I'm working with a data frame like this, but bigger and with more zone. I am trying to sum the value of the rows by their names. The total sum of the R or C zones goes in total column while the total sum of either M zones goes in total1 .
Input:
total, total1 are the desired output.
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1
1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
3 C2 40 4 C4 60 6 0 6 0 10 0
3 C1 100 8 0 0 0 0 100 0 8 0
5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
You can use filter for DataFrames for Zones and Values:
z = df.filter(like='Zone')
v = df.filter(like='Value')
Then create boolean DataFrames by contains with apply if want check substrings:
m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))
#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])
Last filter by where v and sum per rows:
df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)
print (df)
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 t t1
0 1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
2 3 C2 40 4 C4 60 6 0 6 0 10 0
3 3 C1 100 8 0 0 0 0 100 0 8 0
4 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
Solution1 (simpler code but slower and less flexible)
total = []
total1 = []
for i in range(df.shape[0]):
temp = df.iloc[i].tolist()
if "R2" in temp:
total.append(temp[temp.index("R2")+1])
else:
total.append(0)
if ("C1" in temp) & ("C4" in temp):
total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
else:
total1.append(0)
df["Total"] = total
df["Total1"] = total1
Solution2 (faster than solution1 and easier to customize but possibly memory intensive)
# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]
# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2
original = df.copy()
# bucket1 check
for zone in zones:
df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0
original['Total'] = df[vals].sum(axis=1)
df = original.copy()
# bucket2 check
for zone in zones:
df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0
df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)
When I expanded original dataframe to 100k rows, solution 1 took 11.4 s ± 82.1 ms per loop, while solution 2 took 3.53 s ± 29.8 ms per loop. The difference is because solution 2 does not for-looping over row direction.

change table format of the output

I would like to change the format of my output for the following code.
import pandas as pd
x= pd.read_csv('x.csv')
y= pd.read_csv('y.csv')
z= pd.read_csv('z.csv')
list = pd.merge(x, y, how='left', on=['xx'])
list = pd.merge(list, z, how='left', on=['xx'])
columns_to_keep = ['yy','zz', 'uu']
list = list.set_index(['xx'])
list = list[columns_to_keep]
list = list.sort_index(axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True, by=None)
with open('write.csv','w') as f:
list.to_csv(f,header=True, index=True, index_label='xx')
from this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2 a2
2 1/8/2007 3 a3
3 12/14/2007 4 a4
4 3/6/2008 5 a5
4 4/14/2009 6 a6
4 5/30/2008 7 a7
4 5/30/2008 8 a8
5 6/17/2007 9 a9
to this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2;3 a2;a3
3 12/14/2007 4 a4
4 3/6/2008 5;6;7;8 a5;a6;a7;a8
5 6/17/2007 9 a9
I think the following should work on the final dataframe (list), though I would suggest not to use "list" as a name as it is a built in function in python and you might want to use that function somewhere else. So in my code I will use "df" instead of "list":
ind = list(set(df.index.get_values()))
finaldf = pd.DataFrame(columns = list(df.columns))
for val in ind:
tempDF = df.loc[val]
print tempDF
for i in range(tempDF.shape[0]):
for jloc,j in enumerate(list(df.columns)):
if i != 0 and j != 'date':
finaldf.loc[val,j] += (";"+str(tempDF.iloc[i,jloc]))
elif i == 0:
finaldf.loc[val,j] = str(tempDF.iloc[i,jloc])
print finaldf

How to multiply every column of one Pandas Dataframe with every column of another Dataframe efficiently?

I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.
The dataframes are one-hot encoded, so they look like this:
col_1, col_2, col_3, ...
0 1 0
1 0 0
0 0 1
...
I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.
One of the dataframes has 500 columns, the other has 100 columns.
This is the fastest version that I've been able to write so far:
interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)
I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?
EDIT: example
df_1:
1col_1, 1col_2, 1col_3
0 1 0
1 0 0
0 0 1
df_2:
2col_1, 2col_2
0 1
1 0
0 0
interact_pd:
1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2
0 0 0 0 1 0
1 0 0 0 0 0
0 0 0 0 0 0
# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
Timing
code
from string import ascii_letters
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))
def pir1(df_1, df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
results
You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):
df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
1col_1_2col_1 1col_2_2col_1 1col_3_2col_1 1col_1_2col_2 \
0 0 0 0 0
1 1 0 0 0
2 0 0 0 0
1col_2_2col_2 1col_3_2col_2
0 1 0
1 0 0
2 0 0
--> See benchmark for comparisons with other answers to choose the best option for your dataset.
Benchmark
Functions:
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
def Test3(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
return df
def Test4(df_1,df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def jeanrjc_imp(df_1, df_2):
df = pd.concat([df_1.mul(‌​i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns)
return df
Code:
Sorry, ugly code, the plot at the end matters :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
resa[r] = []
resb[r] = []
resc[r] = []
for i in [5, 10, 30, 50, 150, 200]:
a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
resa[r].append(a.best)
resb[r].append(b.best)
resc[r].append(c.best)
X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
for i in r:
a.plot(X, r[i], label=i)
a.set_xlabel("df_1 columns #")
a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()
With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.
Conclusion:
Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!
HTH
You can use numpy.
Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:
import pandas as pd
import numpy as np
A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]
DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T
def Test1(DA,DB):
E = pd.DataFrame(index=DA.index)
DAC = [column for column in DA]
for column in DB:
C = DA[DAC].multiply(DB[column], axis="index")
E = E.join(C, lsuffix='_' + str(column))
return E
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
print Test1(DA,DB)
print Test2(DA,DB)
Output:
0_1 1_1 2_1 0 1 2 0_3 1_3 2_3 0 1 2 0 1 2
0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
2 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
1col_1_2col_1 1col_1_2col_2 1col_1_2col_3 1col_2_2col_1 1col_2_2col_2 \
0 0 0 0 1 0
1 0 0 0 0 0
2 1 1 0 1 1
3 0 0 0 0 0
1col_2_2col_3 1col_3_2col_1 1col_3_2col_2 1col_3_2col_3 1col_4_2col_1 \
0 0 1 0 0 1
1 0 0 1 1 0
2 0 0 0 0 0
3 0 0 0 0 1
1col_4_2col_2 1col_4_2col_3 1col_5_2col_1 1col_5_2col_2 1col_5_2col_3
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 1 0 0 0
Performance of your function:
%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop
Performance of my function:
%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop
It's not beautiful, but it's efficient.

Categories

Resources