I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
Related
I working with a dataframe which has 20 columns but I'm only going to use three of them in my task, which are named "Price","Retail" and "Profit" and are like this:
cols = ['Price', 'Retail','Profit']
df3.loc[:,cols]
Price Retail Profit
0 861.5 1315.233051 453.733051
1 901.5 1315.233051 413.733051
2 911.0 1315.233051 404.233051
3 901.5 1315.233051 413.733051
4 901.5 1315.233051 413.733051
... ... ... ...
2678 14574.0 21546.730769 6972.730769
2679 35708.5 52026.764706 16318.264706
2680 35708.5 52026.764706 16318.264706
2681 163276.5 250882.500000 87606.000000
2682 7369.5 11785.729730 4416.229730
2683 rows × 3 columns
My goal is to find the lines where the prices are lower than 5000 and sort by the biggest values of profit. How can I make it?
You for example use query and sort_values
df3.query("Price < 5000").sort_values("Profit", ascending=False)
You can do this:
df_pofit_less_5000 = df[df['Price']<5000]
You can start by subsetting the dataframe where price is lower than 5000:
df = df[df["Price"] < 5000]
Then sort by Profit descending:
df = df.sort_values(by = ["Profit"], ascending=False)
I have a data frame called df:
Date Sales
01/01/2020 812
02/01/2020 981
03/01/2020 923
04/01/2020 1033
05/01/2020 988
... ...
How can I get the first occurrence of 7 consecutive days with sales above 1000?
This is what I am doing to find the rows where sales is above 1000:
In [221]: df.loc[df["sales"] >= 1000]
Out [221]:
Date Sales
04/01/2020 1033
08/01/2020 1008
09/01/2020 1091
17/01/2020 1080
18/01/2020 1121
19/01/2020 1098
... ...
You can assign a unique identifier per consecutive days, group by them, and return the first value per group (with a previous filter of values > 1000):
df = df.query('Sales > 1000').copy()
df['grp_date'] = df.Date.diff().dt.days.fillna(1).ne(1).cumsum()
df.groupby('grp_date').head(7).reset_index(drop=True)
where you can change the value of head parameter to the first n rows from consecutive days.
Note: you may need to use pd.to_datetime(df.Date, format='%d/%m/%Y') to convert dates from strings to pandas datetime, and sort them.
Couldn't you just sort by date and grab head 7?
df = df.sort_values('Date')
df.loc[df["sales"] >= 1000].head(7)
If you need the original maybe make a copy first
I have following dataframe in pandas
code tank noz_sale_cumsum noz_1_sub noz_2_sub noz_1_avg noz_2_avg noz_1_flag noz_2_flag
123 1 1234 12 23 23.23 32.45 short ok
123 2 1200 13 53 33.13 22.45 excess ok
columns such as noz_1_sub, noz_2_sub, noz_1_avg, noz_2_avg, noz_1_flag and noz_2_flag are generated dynamically.
My desired dataframe would be following.
code tank noz_no noz_sale_cumsum noz_sub noz_avg noz_flag
123 1 1 1234 12 23.23 short
123 1 2 1234 23 32.45 ok
123 2 1 1200 13 33.13 excess
123 2 2 1200 53 22.45 ok
I am doing following in pandas.
first I am getting all dynamic columns in different arrays
cols_sub = [cols for cols in df.columns if re.search('noz_\d+_sub', cols)]
cols_avg = [cols for cols in df.columns if re.search('noz_\d+_avg', cols)]
cols_flag = [cols for cols in df.columns if re.search('noz_\d+_flag', cols)]
final_df = df.pivot_table(index=['code', 'tank', 'noz_sale_cumsum'], columns=[cols_sub, cols_avg, cols_flag], values=[]).reset_index()
I am not sure about values column and how do I extract number from noz like columns and put it under noz_no column. Any help is appreciated.
You can use melt to convert everything to rows then use pivot_table to convert back some rows to columns.
a = df.melt(id_vars=['code', 'tank', 'noz_sale_cumsum'])
a['noz_no'] = a.variable.map(lambda x: x.split('_')[1])
a['kpi'] = a.variable.map(lambda x: 'noz_' + x.split('_')[2])
a.pivot_table(
values='value',
index=['code', 'tank', 'noz_sale_cumsum', 'noz_no'],
columns=['kpi'], aggfunc='first'
).reset_index()
I have a pandas dataframe (originally generated from a sql query) that looks like:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
2 100 1000 1/2/2016
3 100 1000 1/3/2016
4 101 1234 9/15/2016
5 101 1234 9/16/2016
etc....
I'd like to get this whittled down to a unique list, returning only the entry with the earliest date available, something like this:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
4 101 1234 9/15/2016
etc....
Any pointers or direction for a fairly new pandas dev? The unique function doesn't appear to be able to handle these types of rules, and looping through the array and working out which one to drop seems like a lot of trouble for a simple task... Is there a function that I'm missing that does this?
Let's use groupby, idxmin, and .loc:
df_out = df2.loc[df2.groupby('AccountId')['EntryDate'].idxmin()]
print(df_out)
Output:
AccountId ItemID EntryDate
index
1 100 1000 2016-01-01
4 101 1234 2016-09-15
I have a dataframe where one column has a list of zipcodes and the other has property values corresponding to the zipcode. I want to sum up the property values in each row according to the appropriate zipcode.
So, for example:
zip value
2210 $5,000
2130 $3,000
2210 $2,100
2345 $1,000
I would then add up the values
$5,000 + $2,100 = $7,100
and reap the total property value of for zipcode 2210 as $7,100.
Any help in this regard will be appreciated
You need:
df
zip value
0 2210 5000
1 2130 3000
2 2210 2100
3 2345 1000
df2 = df.groupby(['zip'])['value'].sum()
df2
zip value
2130 3000
2210 7100
2345 1000
Name: value, dtype: int64
You can read more about it here.
Also you will need to remove the $ sign in the column values. For that you can use something along the lines of the following while reading the dataframe initially:
df = pd.read_csv('zip_value.csv', header=0,names=headers,converters={'value': lambda x: float(x.replace('$',''))})
Edit: Changed the code according to comment.
To reset the index after groupby use:
df2 = df.groupby(['zip'])['value'].sum().reset_index()
Then to remove a particular column with zip value ,say, 2135 , you need
df3 = df2[df2['zip']!= 2135]