Here is the sample data file, and I performed the following operation in ipython notebook:
!curl -O http://pbpython.com/extras/sales-funnel.xlsx
df = pd.read_excel('./sales-funnel.xlsx')
df['Status'] = df['Status'].astype('category')
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
table = pd.pivot_table(df,
index=['Manager', 'Status'],
values=['Price', 'Quantity'],
columns=['Product'],
aggfunc={'Price':[np.sum, np.mean], 'Quantity':len},
fill_value=0
)
This is what the data looks like in table:
I want to select (Manager=="Debra Henley") & (Status=="won") and it works with the query method:
table.query('(Manager=="Debra Henley") & (Status=="won")')
But how do you perform the same selection with loc? I tried this but does not work:
table.loc[['Debra Henley', 'won']]
What do you guys usually use when dealing with MultiIndex? What's the best way to do it?
Update: found two solutions so far:
table.xs(('Debra Henley','won'), level=('Manager', 'Status'))
table.loc[[('Debra Henley', 'won')]]
So I guess tuples should be used instead of lists when indexing with MultiIndex?
Your canonical answer is provided by #ScottBoston.
I'll add this for breadth and perspective in addition to #jezrael's IndexSlice approach.
You can also use pd.DataFrame.xs to take a cross-section
table.xs(['Debra Henley', 'won'])
Product
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
For simplier selections (only index or only columns) use xs approach or selecting by tuples.
Another more general solution with slicers:
idx = pd.IndexSlice
#output is df
print (table.loc[[idx['Debra Henley','won']]])
Quantity Price \
len mean
Product CPU Maintenance Monitor Software CPU Maintenance
Manager Status
Debra Henley won 1 0 0 0 65000 0
sum
Product Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 65000 0 0 0
idx = pd.IndexSlice
#output is series
print (table.loc[idx['Debra Henley','won'],:])
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
But it is better for more complicated selections - if need filter index and columns together - one xs doesnt work:
idx = pd.IndexSlice
#select all rows where first level is Debra Henley in index and
#in columns second level is len and sum
print (table.loc[idx['Debra Henley',:], idx[:, ['len', 'sum'], :]])
Quantity Price \
len sum
Product CPU Maintenance Monitor Software CPU
Manager Status
Debra Henley won 1 0 0 0 65000
pending 1 2 0 0 40000
presented 1 0 0 2 30000
declined 2 0 0 0 70000
Product Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 0
pending 10000 0 0
presented 0 0 20000
declined 0 0 0
Yes, you can use:
table.loc[[('Debra Henley', 'won')]]
to return a pandas data frame or you can use:
table.loc[('Debra Henley','won')]
to return a pandas series.
You can refer to the this documentation.
Related
I am working on object detection project where my task is calculate for exactly how many seconds particular class was in the frame. I have a csv file of detected classes with their timestamp that looks like this:
I can input this csv into a pandas dataframe to calculate their timestamp range as finaltimestamp-intialtimestamp. But the catch is here is: suppose one class, let say HP, made an appearance for 5 seconds. After that, a new class kellogs is introduced and then HP reenters the frame.
Following the above final-intial logic fails here as there is a time gap after the same class appears again.
How to deal with this in pandas? I'm aware of .groupby() and .valueCounts() but they can't solve this problem directly.
Example data:
cat time
0 HP 06:35:03
1 HP 06:35:04
2 kellogs 06:35:42
3 kellogs 06:35:43
4 HP 06:35:45
Expected output
cat time
0 HP 00:00:03
1 kellogs 00:00:02
The output above should return as much time that each class was present in the frame. So in the above example, HP has 3 seconds and kellogs 2 seconds.
This can be done by creating a new column to group by that takes into account both the categorical and time information. First, make sure the dataframe is ordered by time:
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values('time')
The wanted column can be created by using shift and cumsum:
df['group'] = (df['cat'].shift(1) != df['cat']).cumsum()
Intermediate result:
cat time group
0 HP 2021-12-21 06:35:03 1
1 HP 2021-12-21 06:35:04 1
2 kellogs 2021-12-21 06:35:42 2
3 kellogs 2021-12-21 06:35:43 2
4 HP 2021-12-21 06:35:45 3
Now, we can use groupby and compute the number of seconds for each group:
df = df.groupby('group').agg( {'cat': 'first', 'time': ['first', 'last']})
df.columns = ["_".join(a) for a in df.columns.to_flat_index()]
df['time'] = df['time_last'] - df['time_first'] + pd.Timedelta(seconds=1)
df = df.rename(columns={'cat_first': 'cat'})
Finally, we sum up the number of seconds for each category:
df = df.groupby('cat')['time'].sum().reset_index()
Result:
cat time
0 HP 0 days 00:00:03
1 kellogs 0 days 00:00:02
Is it possible to conditionally append data to an existing template dataframe? I'll try to make the data below as simple as possible, since I'm asking more for conceptual help than actual code so I better understand the mindset of solving these kinds of problems in the future (but actual code would be great too).
Example Data
I have a dataframe below that shows 4 dummy products SKUs that a client may order. These SKUs never change. Sometimes a client orders large quantities of each SKU, and sometimes they only order one or two SKUs. Due to reporting, I need to fill unordered SKUs with zeroes (probably use ffill?)
Dummy dataframe DF
product_sku
quantity
total_cost
1234
5678
4321
2468
Problem
Currently, my data only returns the SKUs that customers have ordered (a), but I would like unordered SKUs to be returned, with zeros filled in for quantity and total_cost (b)
(a)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
(b)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
4321
0
0
2468
0
0
I'm wondering if there's a way to take that existing dataframe, and simply append any sales that actually occurred, leaving the unordered SKUs as zero or blank (whatever makes more sense).
I just need some help thinking through the steps logically, and wasn't able to find anything like this. I'm still relatively novice at this stuff, so let me know if I'm missing any pertinent information.
Thanks!
one way is to use reindex after putting the column with product's names as index with set_index. With your notation it would be something like
l_products = DF['product_sku'].tolist() #you may have the list differently
b = (a.set_index('product_sku')
.reindex(l_products, fill_value=0)
.reset_index()
)
If you know the SKus a-priori, maintain one DataFrame initizlized with zeros and update the relevant rows. Then you will always have all SKUs.
For example:
import pandas as pd
# initialization
df = pd.DataFrame(0, index = ['1234', '5678', '4321', '2468'],
columns={'quantity', 'total_cost'})
print(df)
# updating
df.loc['1234', :] = {'total_cost': 100, 'quantity': 4}
print(df)
# incrementing quantity
df.loc['1234', 'quantity'] += 5
print(df)
total_cost quantity
1234 0 0
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 4
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 9
5678 0 0
4321 0 0
2468 0 0
I have honestly tried all possible solutions, I think I am nearly there but still something is not working
I have a dataframe with coin names and their tags.
coin
tags
bitcoin
[mineable, pow, sha-256, store-of-value, state-channels]
I want to extract the tags in a binary dataframe. Like that
coin
mineable
Sha 256
scrypt
bitcoin
1
1
0
dogecoin
1
0
1
I have prepared a dataframe like that
coin
mineable
Sha 256
scrypt
bitcoin
mineable
Sha 256
scrypt
dogecoin
mineable
Sha 256
scrypt
the idea was when I run the loop if it finds the the tags in the list it changes it to 1 and when it does not it leaves it (or even better it changes to 0)
for index_tags, row2 in tag_df2.iterrows():#final data set to be changed
for index, row in tags_head.iterrows():#dataset with the tags and the coin names
for my_tags in clean_set: #unique list of tags
if my_tags in (row['tags']):
print ('-----coin name-------------------->>>>',(row['name']))
print (my_tags)
tag_df2.loc[index_tags, my_tags]=1
Now it seems it works iterating through everything but it only finds the first values for the bitcoins and it copies the same to all coins. I add a link to my colab notebook too.
When I print it seems going through the data no problem but when I try to update the dataframe it just copies one to all coins. I hope someone can help me.
https://colab.research.google.com/drive/1sn5lwqiNicoBy2L00EZNmhLgz_SBxsOg?usp=sharing
You can use get_dummies:
# After you have generated `tags` DataFrame with
# tags = df_new[['name','tags']]
pd.get_dummies(tags.set_index('name')['tags'].explode()).sum(level=0)
Output (only showing the first 3 columns here to illustrate the result):
1confirmation-portfolio a16z-portfolio ai-big-data \
name
bitcoin 1 1 0
ethereum 1 1 0
binance coin 0 0 0
dogecoin 0 0 0
cardano 0 0 0
... ... ... ...
australian dollar token 0 0 0
chia network 0 0 0
safemars 0 0 0
lendhub 0 0 0
3x long bitcoin token 0 0 0
I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.
The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0