I have a pandas dataframe (originally generated from a sql query) that looks like:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
2 100 1000 1/2/2016
3 100 1000 1/3/2016
4 101 1234 9/15/2016
5 101 1234 9/16/2016
etc....
I'd like to get this whittled down to a unique list, returning only the entry with the earliest date available, something like this:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
4 101 1234 9/15/2016
etc....
Any pointers or direction for a fairly new pandas dev? The unique function doesn't appear to be able to handle these types of rules, and looping through the array and working out which one to drop seems like a lot of trouble for a simple task... Is there a function that I'm missing that does this?
Let's use groupby, idxmin, and .loc:
df_out = df2.loc[df2.groupby('AccountId')['EntryDate'].idxmin()]
print(df_out)
Output:
AccountId ItemID EntryDate
index
1 100 1000 2016-01-01
4 101 1234 2016-09-15
Related
I have a dataframe called data that looks like this:
org_id
commit_date
commit_amt
123
2020-06-01
50000
123
2020-06-01
50000
123
2021-06-01
60000
234
2019-07-01
30000
234
2020-07-01
40000
234
2021-07-01
50000
I want the dataframe to look like this:
org_id
date_1
date_2
date_3
amt_1
amt_2
amt_3
123
2020-06-01
2021-06-01
2022-06-01
50000
50000
60000
234
2019-07-01
2020-07-01
2021-07-01
30000
40000
50000
I've gotten the date columns and org_id column by:
dates = data.groupby('org_id').apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
c_dates = pd.DataFrame() #create empty dataframe
c_dates['org_id'] = dates.index #I had to specify each col bc the
dates df was too hard to work with.
c_dates['date_1'] = dates[0].values.tolist()
c_dates['date_2'] = dates[1].values.tolist()
c_dates['date_3'] = dates[2].values.tolist()
I cannot figure out how to get amt_1, amt_2, and amt_3 columns. I can't just repeat date columns code bc it will miss the repeat 50000 for org_id_123. Bc the c_dates dataframe does not match length of the original data dataframe, I can't just compare c_dates to data.
EXCITING UPDATE!
I haven't totally solved my problem yet, but I have made a bit of progress:
dates = data.groupby(['org_id','commit_amt']).apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
gives me the data I want, however, it is not formatted how I want. It gives results that look like:
org_id
commit_amt
123
50000
2020-06-01
2021-06-01
123
60000
2022-06-01
234
30000
2019-07-01
234
40000
2020-07-01
234
50000
2021-07-01
I would appreciate any help in getting me to the format I want. I ultimately want to be able to take the difference between amt_1 and amt_2, etc.
Hope this makes sense.
P.S. Thanks to the hero who edited this thereby teaching me how to make tables!
EXCITINGER NEWS!! I HAVE SOLVED MY PROBLEM!!!
Long story short, the function I needed was unstack. I am tired now but tomorrow, I will edit this with the solution! w00t!
i think you can use pandas.pivot() , for reshaping your date. but there is problem in using pivot() is you must not have duplicated value.
first i think you drop duplicated rows then use pivot.
data = data.drop_duplicates()
data.pivot(index='org_id', columns=['commit_amt'], values=['commit_date'])
I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
Is it possible to conditionally append data to an existing template dataframe? I'll try to make the data below as simple as possible, since I'm asking more for conceptual help than actual code so I better understand the mindset of solving these kinds of problems in the future (but actual code would be great too).
Example Data
I have a dataframe below that shows 4 dummy products SKUs that a client may order. These SKUs never change. Sometimes a client orders large quantities of each SKU, and sometimes they only order one or two SKUs. Due to reporting, I need to fill unordered SKUs with zeroes (probably use ffill?)
Dummy dataframe DF
product_sku
quantity
total_cost
1234
5678
4321
2468
Problem
Currently, my data only returns the SKUs that customers have ordered (a), but I would like unordered SKUs to be returned, with zeros filled in for quantity and total_cost (b)
(a)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
(b)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
4321
0
0
2468
0
0
I'm wondering if there's a way to take that existing dataframe, and simply append any sales that actually occurred, leaving the unordered SKUs as zero or blank (whatever makes more sense).
I just need some help thinking through the steps logically, and wasn't able to find anything like this. I'm still relatively novice at this stuff, so let me know if I'm missing any pertinent information.
Thanks!
one way is to use reindex after putting the column with product's names as index with set_index. With your notation it would be something like
l_products = DF['product_sku'].tolist() #you may have the list differently
b = (a.set_index('product_sku')
.reindex(l_products, fill_value=0)
.reset_index()
)
If you know the SKus a-priori, maintain one DataFrame initizlized with zeros and update the relevant rows. Then you will always have all SKUs.
For example:
import pandas as pd
# initialization
df = pd.DataFrame(0, index = ['1234', '5678', '4321', '2468'],
columns={'quantity', 'total_cost'})
print(df)
# updating
df.loc['1234', :] = {'total_cost': 100, 'quantity': 4}
print(df)
# incrementing quantity
df.loc['1234', 'quantity'] += 5
print(df)
total_cost quantity
1234 0 0
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 4
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 9
5678 0 0
4321 0 0
2468 0 0
I have results from A/B test that I need to evaluate but in the checking of the data I noticed that there were users that were in both control groups and I need to drop them to not hurt the test. My data looks something like this:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
2 3698129301 2 2019-08-01 165.7 B
3 4214855558 2 2019-08-07 30.5 A
4 797272108 3 2019-08-23 100.4 A
What I need to do is remove every user that was in both A and B groups while leaving the rest intact. So from the example data I need this output:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
4 797272108 3 2019-08-23 100.4 A
I tried to do it in various ways and I can't seems to figure it out and I couldn't find an answer for it anywhere I would really appreciate some help here,
thanks in advance
You can get a list of users that are in just one group like this:
group_counts = df.groupby('visitorId').agg({'group': 'nunique'}) ##list of users with number of groups
to_include = group_counts[group_counts['group'] == 1] ##filter for just users in 1 group
And then filter your original data according to which visitors are in that list:
df = df[df['visitorId'].isin(to_include.index)]
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848