Function equivalent of Excel's SUMIFS()

Function equivalent of Excel's SUMIFS() - python

I have a sales table with columns item, week, and sales. I wanted to create a week to date sales column (wtd sales) that is a weekly roll-up of sales per item.
I have no idea how to create this in Python.
I'm stuck at groupby(), which probably is not the answer. Can anyone help?
output_df['wtd sales'] = input_df.groupby(['item'])['sales'].transform(wtd)

As I stated in my comment, you are looking for cumsum():
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df.groupby(['items'])['sales'].cumsum()
Which results in:
0 100
1 201
2 303
3 433
4 10
5 21
6 33
7 46
Name: sales, dtype: int64
I'm using:
pd.__version__
'1.5.1'
Putting it all together:
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df['wtds'] = df.groupby(['items'])['sales'].cumsum()
Resulting in:
items weeks sales wtds
0 A 1 100 100
1 A 2 101 201
2 A 3 102 303
3 A 4 130 433
4 B 1 10 10
5 B 2 11 21
6 B 3 12 33
7 B 4 13 46

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:

Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20

Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')

Filter rows of DataFrame by list of tuples

Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.

I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())

You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80

May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80

Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()

Random selection of one value among different columns?

Suppose I have the following data frame
from pandas import DataFrame
Cars = { 'value': [10, 31, 661, 1, 51, 61, 551],
'action1': [1, 1, 1, 1, 1, 1, 1],
'price1': [ 12,0, 15,3, 0, 12,0],
'action2': [2, 2, 2, 2, 2, 2, 2],
'price2': [ 0, 16, 19, 0, 1, 10,0],
'action3': [3, 3, 3, 3, 3, 3, 3],
'price3': [ 14, 36, 9, 0, 0, 0,0]
}
df = DataFrame(Cars,columns= ['value', 'action1', 'price1', 'action2', 'price2', 'action3', 'price3'])
print (df)
How can I select randomly value (action and price) among 3 columns? As a result I want to have a dataframe that will look something like this one?
RandCars = {'value': [10, 31, 661, 1, 51, 61, 551],
'action': [1, 3, 1, 3, 1, 2, 2],
'price': [ 12, 36, 15, 0, 3, 10, 0]
}
df2 = DataFrame(RandCars, columns = ['value','action', 'price'])
print(df2)

Use:
#get columns names not starting by action or price
cols = df.columns[~df.columns.str.startswith(('action','price'))]
print (cols)
Index(['value'], dtype='object')
#convert filtered columns to 2 numpy arrays
arr1 = df.filter(regex='^action').values
arr2 = df.filter(regex='^price').values
#pandas 0.24+
#arr1 = df.filter(regex='^action').to_numpy()
#arr2 = df.filter(regex='^price').to_numpy()
i, c = arr1.shape
#create random positions of both DataFrames to new df
idx = np.random.choice(np.arange(c), i)
df3 = pd.DataFrame({'action': arr1[np.arange(len(df)), idx],
'price': arr2[np.arange(len(df)), idx]},
index=df.index)
print (df3)
action price
0 2 0
1 3 36
2 3 9
3 1 3
4 3 0
5 1 12
6 1 0
#add all another columns by join
df4 = df[cols].join(df3)
print (df4)
value action price
0 10 2 0
1 31 3 36
2 661 3 9
3 1 1 3
4 51 3 0
5 61 1 12
6 551 1 0

Reorder Multiindex Pandas Dataframe

I would like to reorder the columns in a dataframe, and keep the underlying values in the right columns.
For example this is the dataframe I have
cols = [ ['Three', 'Two'],['A', 'D', 'C', 'B']]
header = pd.MultiIndex.from_product(cols)
df = pd.DataFrame([[1,4,3,2,5,8,7,6]]*4,index=np.arange(1,5),columns=header)
df.loc[:,('One','E')] = 9
df.loc[:,('One','F')] = 10
>>> df
And I would like to change it as follows:
header2 = pd.MultiIndex(levels=[['One', 'Two', 'Three'], ['E', 'F', 'A', 'B', 'C', 'D']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2], [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]])
df2 = pd.DataFrame([[9,10,1,2,3,4,5,6,7,8]]*4,index=np.arange(1,5), columns=header2)
>>>>df2

First, define a categorical ordering on the top level. Then, call sort_index on the first axis with both levels.
v = pd.Categorical(df.columns.get_level_values(0),
categories=['One', 'Two', 'Three'],
ordered=True)
v2 = pd.Categorical(df.columns.get_level_values(1),
categories=['E', 'F', 'C', 'B', 'A', 'D'],
ordered=True)
df.columns = pd.MultiIndex.from_arrays([v, v2])
df = df.sort_index(axis=1, level=[0, 1])
df
One Two Three
E F C B A D C B A D
1 9 10 7 6 5 8 3 2 1 4
2 9 10 7 6 5 8 3 2 1 4
3 9 10 7 6 5 8 3 2 1 4
4 9 10 7 6 5 8 3 2 1 4

Insert column in data frame with duplicated axis

What would be the workaround (or the more tidy way) to insert a column in a pandas dataframe where some indices are duplicated?
For example, having the following dataframe:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df1 = df1.set_index([0])
df1
1 2
0
1 51 R
2 51 R
3 74 R
4 29 R
1 39 F
2 3 F
3 14 F
4 16 F
how can I insert the column foo from df2 (below) in df1?
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df2 = df2.set_index([0])
df2
foo 2
0
1 5 R
2 5 R
3 7 R
4 2 R
1 3 F
3 1 F
4 1 F
Note that the index 2 is missing from category F.
I would like the result to be something like:
1 foo 2
0
1 51 5 R
2 51 5 R
3 74 7 R
4 29 2 R
1 39 3 F
2 3 NaN F
3 14 1 F
4 16 1 F
I tried the DataFrame.insert method but am getting
df1.insert(2, 'FOO', df2['foo'])
ValueError: cannot reindex from a duplicate axis

The index and column 2 uniquely define a row on both data frames, you can do a join on the two columns (after resetting the index):
df1.reset_index().merge(df2.reset_index(), how='left', on=[0,2]).set_index([0])
# 1 2 foo
#0
#1 51 R 5.0
#2 51 R 5.0
#3 74 R 7.0
#4 29 R 2.0
#1 39 F 3.0
#2 3 F NaN
#3 14 F 1.0
#4 16 F 1.0

df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df1 = df1.set_index([0, 2])
df2 = df2.set_index([0, 2])
df1.join(df2, how='left').reset_index(level=2)
2 1 foo
0
1 R 51 5.0
2 R 51 5.0
3 R 74 7.0
4 R 29 2.0
1 F 39 3.0
2 F 3 NaN
3 F 14 1.0
4 F 16 1.0

You're very close...
As you already know based on your question, you can't do this for reasons clearly stated in the error, because you have a repeated index. If you must have column '0' as the index, then don't set it as the index before your merge, set it after:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df = df1.merge(df2, how='left')
df.set_index([0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function equivalent of Excel's SUMIFS() - python

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Filter rows of DataFrame by list of tuples

Random selection of one value among different columns?

Reorder Multiindex Pandas Dataframe

Insert column in data frame with duplicated axis

Categories

Resources