localizing rows from a dataframe - python

I working with a dataframe which has 20 columns but I'm only going to use three of them in my task, which are named "Price","Retail" and "Profit" and are like this:
cols = ['Price', 'Retail','Profit']
df3.loc[:,cols]
Price Retail Profit
0 861.5 1315.233051 453.733051
1 901.5 1315.233051 413.733051
2 911.0 1315.233051 404.233051
3 901.5 1315.233051 413.733051
4 901.5 1315.233051 413.733051
... ... ... ...
2678 14574.0 21546.730769 6972.730769
2679 35708.5 52026.764706 16318.264706
2680 35708.5 52026.764706 16318.264706
2681 163276.5 250882.500000 87606.000000
2682 7369.5 11785.729730 4416.229730
2683 rows × 3 columns
My goal is to find the lines where the prices are lower than 5000 and sort by the biggest values of profit. How can I make it?

You for example use query and sort_values
df3.query("Price < 5000").sort_values("Profit", ascending=False)

You can do this:
df_pofit_less_5000 = df[df['Price']<5000]

You can start by subsetting the dataframe where price is lower than 5000:
df = df[df["Price"] < 5000]
Then sort by Profit descending:
df = df.sort_values(by = ["Profit"], ascending=False)

Related

pandas average across dynamic number of columns

I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)

python pandas change order and column name after merge

I have merged two dataframes with multiple overlapping columns. I would like to put the columns side by side.
merge = df1.merge(df2)
For example, Current Output:
YEAR_x,DATE_x,MAX_x,MIN_x,YEAR_y,DATE_y,MAX_y,MIN_y
I want the output to be:
YEAR, YEAR_auto, DATE, DATE_auto, MAX, MAX_auto, MIN, MIN_auto
I have more than 150 columns so I don't want to do it manually. How could I do that?
Use pd.merge with suffixes parameter:
merge = df1.merge(df2[set(df2) & set(df1)], suffixes=('', '_auto'))
To sort your columns as df1:
cols = sorted(merge.columns, key=lambda x: df1.columns.get_loc(x.split('_')[0]))
Example:
>>> merge
YEAR DATE MAX MIN YEAR_auto DATE_auto MAX_auto MIN_auto
0 2021 2021-08-06 100 0 2020 2020-08-06 50 20
>>> merge[cols]
YEAR YEAR_auto DATE DATE_auto MAX MAX_auto MIN MIN_auto
0 2021 2020 2021-08-06 2020-08-06 100 50 0 20

Pandas how to get rows with consecutive dates and sales more than 1000?

I have a data frame called df:
Date Sales
01/01/2020 812
02/01/2020 981
03/01/2020 923
04/01/2020 1033
05/01/2020 988
... ...
How can I get the first occurrence of 7 consecutive days with sales above 1000?
This is what I am doing to find the rows where sales is above 1000:
In [221]: df.loc[df["sales"] >= 1000]
Out [221]:
Date Sales
04/01/2020 1033
08/01/2020 1008
09/01/2020 1091
17/01/2020 1080
18/01/2020 1121
19/01/2020 1098
... ...
You can assign a unique identifier per consecutive days, group by them, and return the first value per group (with a previous filter of values > 1000):
df = df.query('Sales > 1000').copy()
df['grp_date'] = df.Date.diff().dt.days.fillna(1).ne(1).cumsum()
df.groupby('grp_date').head(7).reset_index(drop=True)
where you can change the value of head parameter to the first n rows from consecutive days.
Note: you may need to use pd.to_datetime(df.Date, format='%d/%m/%Y') to convert dates from strings to pandas datetime, and sort them.
Couldn't you just sort by date and grab head 7?
df = df.sort_values('Date')
df.loc[df["sales"] >= 1000].head(7)
If you need the original maybe make a copy first

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

Need to add values in a column corresponding to the week (Python/Pandas)

I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90

Categories

Resources