I want to group a dataframe by a column then apply a cumsum over the other ordered by the first column descending
df1:
id PRICE DEMAND
0 120 10
1 232 2
2 120 3
3 232 8
4 323 5
5 323 6
6 323 2
df2:
id PRICE DEMAND
0 323 13
1 232 23
2 120 36
I do it in two instructions but I am feeling it can be done with only one sum
data = data.groupby('PRICE',as_index=False).agg({'DEMAND': 'sum'}).sort_values(by='PRICE', ascending=False)
data['DEMAND'] = data['DEMAND'].cumsum()
What you have seems perfectly fine to me. But if you want to chain everything together, first sort then groupby with sort=False so it doesn't change the order. Then you can sum within group and cumsum the resulting Series
(df.sort_values('PRICE', ascending=False)
.groupby('PRICE', sort=False)['DEMAND'].sum()
.cumsum()
.reset_index())
PRICE DEMAND
0 323 13
1 232 23
2 120 36
Another option would be to sort then cumsum and then drop_duplicates:
(df.sort_values('PRICE', ascending=False)
.set_index('PRICE')
.DEMAND.cumsum()
.reset_index()
.drop_duplicates('PRICE', keep='last'))
PRICE DEMAND
2 323 13
4 232 23
6 120 36
Related
I have a df:
df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df_test.pivot_table(index='month',
columns='year',
values=['passengers', 'cat_2'])\
.swaplevel(0,1, axis=1)\
.sort_index(axis=1, level=0)\
.fillna(0)
I am trying to calculate the difference between cat_2 and passengers each year compared to the year before.
What is the best way to achieve this?
Desired output would look similar to this:
year 1949 1950 1951
cat_2 passengers % diff cat_2 passengers % diff cat_2 passengers % diff
month
Jan 6 112 0 6 115 115/112 6 90 90/115
Feb 0 118 0 6 126 126/118 6 150 150 / 126
Mar 2 132 0 7 141 7 141
Apr 0 129 0 9 135 9 135
May 5 121 0 4 125 4 125
Jun 1 135 0 3 149 3 149
Jul 6 148 0 5 170 5 170
Aug 5 148 0 2 170 2 170
Sep 1 136 0 4 158 4 158
Oct 5 119 0 5 133 5 133
Nov 0 104 0 1 114 1 114
Dec 7 118 0 1 140 1 140
I only showed the desired calculations for columns passengers but the same calculations method can be used for cat_2 as well.
As there is nothing to compare the first year I filled the values with 0.
You can select columns passengers by DataFrame.xs, divide by shifted rows by DataFrame.shift, create MultiIndex and append to original DataFrame:
df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df = df_test.pivot_table(index='month',
columns='year',
values=['passengers', 'cat_2'])\
.swaplevel(0,1, axis=1)\
.sort_index(axis=1, level=0)\
.fillna(0)
df1 = df.xs('passengers', axis=1, level=1)
df2 = pd.concat({'% diff': df1.div(df1.shift(axis=1)).fillna(0)}, axis=1)
out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
.sort_index(axis=1, level=0, sort_remaining=False))
Another idea is shift values by add 1 year, only necessary remove last NaN column by iloc:
df1 = df.xs('passengers', axis=1, level=1)
df2 = (pd.concat({'% diff': df1.div(df1.rename(columns=lambda x: x+1))}, axis=1)
.fillna(0)
.iloc[:, :-1])
out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
.sort_index(axis=1, level=0, sort_remaining=False))
I'm using Python pandas and have a data frame that is pulled from my CSV file:
ID Value
123 10
432 14
213 12
'''
214 2
999 43
I want to randomly select some rows with the condition that the sum of the selected values = 30% of the total value.
Please advise how should I write this condition.
You can first shuffle the rows with sample, then filter using loc, cumsum and comparison to be ≤ to 30% of the total:
out = df.sample(frac=1).loc[lambda d: d['Value'].cumsum().le(d['Value'].sum()*0.3)]
Example output:
ID Value
0 123 10
3 214 2
2 213 12
Intermediates:
ID Value cumsum ≤30%
0 123 10 10 True
3 214 2 12 True
2 213 12 24 True
1 432 14 38 False
4 999 43 81 False
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
How to .merge 2 df, 1 column to match 2 columns ??
The goal is to merge 2 df to have count of records for every campaign id from a REF table to the Data by id.
The issue .merge just compare 1 column with 1 column
The Data is mess up and for some rows there are id names rather then id's.
It works if I want to merge 1 column to 1 column, or 2 columns to 2 columns, but NOT for 1 column to 2 columns
Reff table
g_spend =
campaignid id_name cost
154 campaign1 15
155 campaign2 12
1566 campaign33 12
158 campaign4 33
Data
cw =
campaignid
154
154
155
campaign1
campaign33
1566
158
campaign1
campaign1
campaign33
campaign4
Desired output
g_spend =
campaignid id_name cost leads
154 campaign1 15 5
155 campaign2 12 0
1566 campaign33 12 3
158 campaign4 33 2
What I done..
# Just work for one column
cw.head()
grouped_cw = cw.groupby(["campaignid"]).count()
grouped_cw.rename(columns={'reach':'leads'}, inplace=True)
grouped_cw = pd.DataFrame(grouped_cw)
# now merging
g_spend.campaignid = g_spend.campaignid.astype(str)
g_spend = g_spend.merge(grouped_cw, left_on='campaignid', right_index=True)
I would first set id_name as index in g_spend, then do a replace on cw, followed by a value_counts:
s = (cw.campaignid
.replace(g_spend.set_index('id_name').campaignid
.value_counts()
.to_frame('leads')
)
g_spend = g_spend.merge(s, left_on='campaignid', right_index=True)
Output:
campaignid id_name cost leads
0 154 campaign1 15 5
1 155 campaign2 12 1
2 1566 campaign33 12 3
3 158 campaign4 33 2
I have the foll. dataframe in pandas:
df
DAY YEAR REGION VALUE
1 2000 A 12
2 2000 A 10
3 2000 A 13
6 2000 A 15
1 2001 A 3
2 2001 A 40
3 2001 A 83
4 2001 A 95
1 2000 B 124
3 2000 B 102
5 2000 B 131
8 2000 B 150
1 2001 B 30
5 2001 B 4
8 2001 B 8
9 2001 B 12
I would like to create a new data frame such that each row contains a distinct combination of YEAR and REGION. It also contains a column which sums up the VALUE for that YEAR, REGION combination and another column which provides the maximum VALUE for the YEAR, REGION combination. The result should look like:
YEAR REGION SUM_VALUE MAX_VALUE
2000 A 50 15
2001 A 221 95
2000 B 507 150
2001 B 54 30
Here is what I am doing:
new_df = pandas.DataFrame()
for yr in df.YEAR.unique():
for reg in df.REGION.unique():
new_df = new_df.append({'YEAR': yr}, ignore_index=True)
new_df = new_df.append({'REGION: reg}, ignore_index=True)
However, this creates a new row each time, and is not very pythonic due to the xtra for loops. Is there a better way to proceed?
Please note that this is a toy dataframe, the actual dataframe has several VALUE columns. The proposed solution should scale, without having to manually specify the names of the VALUE columns.
groupby on 'YEAR' and 'REGION' and pass a list of funcs to call using agg:
In [9]:
df.groupby(['YEAR','REGION'])['VALUE'].agg(['sum','max']).reset_index()
Out[9]:
YEAR REGION sum max
0 2000 A 50 15
1 2000 B 507 150
2 2001 A 221 95
3 2001 B 54 30
EDIT:
If you want to name the aggregated columns, pass a dict:
In [18]:
df.groupby(['YEAR','REGION'])['VALUE'].agg({'sum_VALUE':'sum','max_VALUE':'max'}).reset_index()
Out[18]:
YEAR REGION max_VALUE sum_VALUE
0 2000 A 15 50
1 2000 B 150 507
2 2001 A 95 221
3 2001 B 30 54