How would I conditionally append data to an existing dataframe? - python

Is it possible to conditionally append data to an existing template dataframe? I'll try to make the data below as simple as possible, since I'm asking more for conceptual help than actual code so I better understand the mindset of solving these kinds of problems in the future (but actual code would be great too).
Example Data
I have a dataframe below that shows 4 dummy products SKUs that a client may order. These SKUs never change. Sometimes a client orders large quantities of each SKU, and sometimes they only order one or two SKUs. Due to reporting, I need to fill unordered SKUs with zeroes (probably use ffill?)
Dummy dataframe DF
product_sku
quantity
total_cost
1234
5678
4321
2468
Problem
Currently, my data only returns the SKUs that customers have ordered (a), but I would like unordered SKUs to be returned, with zeros filled in for quantity and total_cost (b)
(a)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
(b)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
4321
0
0
2468
0
0
I'm wondering if there's a way to take that existing dataframe, and simply append any sales that actually occurred, leaving the unordered SKUs as zero or blank (whatever makes more sense).
I just need some help thinking through the steps logically, and wasn't able to find anything like this. I'm still relatively novice at this stuff, so let me know if I'm missing any pertinent information.
Thanks!

one way is to use reindex after putting the column with product's names as index with set_index. With your notation it would be something like
l_products = DF['product_sku'].tolist() #you may have the list differently
b = (a.set_index('product_sku')
.reindex(l_products, fill_value=0)
.reset_index()
)

If you know the SKus a-priori, maintain one DataFrame initizlized with zeros and update the relevant rows. Then you will always have all SKUs.
For example:
import pandas as pd
# initialization
df = pd.DataFrame(0, index = ['1234', '5678', '4321', '2468'],
columns={'quantity', 'total_cost'})
print(df)
# updating
df.loc['1234', :] = {'total_cost': 100, 'quantity': 4}
print(df)
# incrementing quantity
df.loc['1234', 'quantity'] += 5
print(df)
total_cost quantity
1234 0 0
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 4
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 9
5678 0 0
4321 0 0
2468 0 0

Related

pandas average across dynamic number of columns

I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)

Complicated function with groupby and between? Python

Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848

handling various currency strings pandas

I have a few hundred thousand rows of data with many different currency forms, some examples being:
116,319,545 SAR
40,381,846 CNY
57,712,170 CNY
158,073,425 RUB2
0 MYR
0 EUR
USD 110,169,240
These values are read into a DataFrame, and I am unsure what the best way (if there is a prebuilt way?) is to just get the integer value out of all the possible cases. There are probably more currencies in the data.
Currently the best approach I have is:
df1['value'].str.replace(r"[a-zA-Z,]",'').astype(int)
But this fails obviously with the entry xxxx RUB2.
EDIT:
In addition to the working answer, it also is reasonable to expect the currency to be important - to extract that the regex is ([A-Z]+\d*)
Given this df
df=pd.DataFrame()
df["col"]=["116,319,545 SAR",
"40,381,846 CNY",
"57,712,170 CNY",
"158,073,425 RUB2",
"0 MYR",
"0 EUR",
"USD 110,169,240"]
You can use regex '(\d+)' after removing commas to get
df.col.str.replace(",","").str.extract('(\d+)').astype(int)
0
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240
Another more manual solution would be to split and replace
df.col.str.split(' ').apply(lambda d: pd.Series(int(x.replace(",","")) for x in d if x.replace(",","").isdigit()).item())
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240

Python Pandas: Select Unique from df based on Rule

I have a pandas dataframe (originally generated from a sql query) that looks like:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
2 100 1000 1/2/2016
3 100 1000 1/3/2016
4 101 1234 9/15/2016
5 101 1234 9/16/2016
etc....
I'd like to get this whittled down to a unique list, returning only the entry with the earliest date available, something like this:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
4 101 1234 9/15/2016
etc....
Any pointers or direction for a fairly new pandas dev? The unique function doesn't appear to be able to handle these types of rules, and looping through the array and working out which one to drop seems like a lot of trouble for a simple task... Is there a function that I'm missing that does this?
Let's use groupby, idxmin, and .loc:
df_out = df2.loc[df2.groupby('AccountId')['EntryDate'].idxmin()]
print(df_out)
Output:
AccountId ItemID EntryDate
index
1 100 1000 2016-01-01
4 101 1234 2016-09-15

Pandas how to perform conditional selection on MultiIndex

Here is the sample data file, and I performed the following operation in ipython notebook:
!curl -O http://pbpython.com/extras/sales-funnel.xlsx
df = pd.read_excel('./sales-funnel.xlsx')
df['Status'] = df['Status'].astype('category')
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
table = pd.pivot_table(df,
index=['Manager', 'Status'],
values=['Price', 'Quantity'],
columns=['Product'],
aggfunc={'Price':[np.sum, np.mean], 'Quantity':len},
fill_value=0
)
This is what the data looks like in table:
I want to select (Manager=="Debra Henley") & (Status=="won") and it works with the query method:
table.query('(Manager=="Debra Henley") & (Status=="won")')
But how do you perform the same selection with loc? I tried this but does not work:
table.loc[['Debra Henley', 'won']]
What do you guys usually use when dealing with MultiIndex? What's the best way to do it?
Update: found two solutions so far:
table.xs(('Debra Henley','won'), level=('Manager', 'Status'))
table.loc[[('Debra Henley', 'won')]]
So I guess tuples should be used instead of lists when indexing with MultiIndex?
Your canonical answer is provided by #ScottBoston.
I'll add this for breadth and perspective in addition to #jezrael's IndexSlice approach.
You can also use pd.DataFrame.xs to take a cross-section
table.xs(['Debra Henley', 'won'])
Product
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
For simplier selections (only index or only columns) use xs approach or selecting by tuples.
Another more general solution with slicers:
idx = pd.IndexSlice
#output is df
print (table.loc[[idx['Debra Henley','won']]])
Quantity Price \
len mean
Product CPU Maintenance Monitor Software CPU Maintenance
Manager Status
Debra Henley won 1 0 0 0 65000 0
sum
Product Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 65000 0 0 0
idx = pd.IndexSlice
#output is series
print (table.loc[idx['Debra Henley','won'],:])
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
But it is better for more complicated selections - if need filter index and columns together - one xs doesnt work:
idx = pd.IndexSlice
#select all rows where first level is Debra Henley in index and
#in columns second level is len and sum
print (table.loc[idx['Debra Henley',:], idx[:, ['len', 'sum'], :]])
Quantity Price \
len sum
Product CPU Maintenance Monitor Software CPU
Manager Status
Debra Henley won 1 0 0 0 65000
pending 1 2 0 0 40000
presented 1 0 0 2 30000
declined 2 0 0 0 70000
Product Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 0
pending 10000 0 0
presented 0 0 20000
declined 0 0 0
Yes, you can use:
table.loc[[('Debra Henley', 'won')]]
to return a pandas data frame or you can use:
table.loc[('Debra Henley','won')]
to return a pandas series.
You can refer to the this documentation.

Categories

Resources