I have the foll. dataframe in pandas:
df
DAY YEAR REGION VALUE
1 2000 A 12
2 2000 A 10
3 2000 A 13
6 2000 A 15
1 2001 A 3
2 2001 A 40
3 2001 A 83
4 2001 A 95
1 2000 B 124
3 2000 B 102
5 2000 B 131
8 2000 B 150
1 2001 B 30
5 2001 B 4
8 2001 B 8
9 2001 B 12
I would like to create a new data frame such that each row contains a distinct combination of YEAR and REGION. It also contains a column which sums up the VALUE for that YEAR, REGION combination and another column which provides the maximum VALUE for the YEAR, REGION combination. The result should look like:
YEAR REGION SUM_VALUE MAX_VALUE
2000 A 50 15
2001 A 221 95
2000 B 507 150
2001 B 54 30
Here is what I am doing:
new_df = pandas.DataFrame()
for yr in df.YEAR.unique():
for reg in df.REGION.unique():
new_df = new_df.append({'YEAR': yr}, ignore_index=True)
new_df = new_df.append({'REGION: reg}, ignore_index=True)
However, this creates a new row each time, and is not very pythonic due to the xtra for loops. Is there a better way to proceed?
Please note that this is a toy dataframe, the actual dataframe has several VALUE columns. The proposed solution should scale, without having to manually specify the names of the VALUE columns.
groupby on 'YEAR' and 'REGION' and pass a list of funcs to call using agg:
In [9]:
df.groupby(['YEAR','REGION'])['VALUE'].agg(['sum','max']).reset_index()
Out[9]:
YEAR REGION sum max
0 2000 A 50 15
1 2000 B 507 150
2 2001 A 221 95
3 2001 B 54 30
EDIT:
If you want to name the aggregated columns, pass a dict:
In [18]:
df.groupby(['YEAR','REGION'])['VALUE'].agg({'sum_VALUE':'sum','max_VALUE':'max'}).reset_index()
Out[18]:
YEAR REGION max_VALUE sum_VALUE
0 2000 A 15 50
1 2000 B 150 507
2 2001 A 95 221
3 2001 B 30 54
Related
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
For the given dataframe df as:
Election Yr. Party Region Votes
0 2000 A a 50
1 2000 A b 30
2 2000 B a 40
3 2000 B b 50
4 2000 C a 30
5 2000 C c 40
6 2004 A a 20
7 2004 A b 30
8 2004 B a 40
9 2004 B b 50
10 2004 C a 60
11 2004 C b 40
12 2008 A a 30
13 2008 A c 30
14 2008 B a 80
15 2008 B b 50
16 2008 C a 60
17 2008 C b 40
How to find the list of regions which has a different winner in every election. The winner is decided by the total votes by a party in a year.
First you need to figure out the winner in every election for each region, essentially the party with the highest vote.
winners = df.groupby(['Election Yr.', 'Region']).apply(lambda x: x.set_index('Party').Votes.idxmax())
Then you can figure out for each region how many different winners there have been:
n_unique_winners = winners.groupby(['Region']).nunique()
You can also figure out how many elections have occurred in each region:
n_elections = winners.groupby(['Region']).size()
Entries with a true value in n_unique_winners == n_elections are the regions you are looking for.
To get a list of these regions, you can do n_unique_winners[n_unique_winners == n_elections].index.values
I want to group a dataframe by a column then apply a cumsum over the other ordered by the first column descending
df1:
id PRICE DEMAND
0 120 10
1 232 2
2 120 3
3 232 8
4 323 5
5 323 6
6 323 2
df2:
id PRICE DEMAND
0 323 13
1 232 23
2 120 36
I do it in two instructions but I am feeling it can be done with only one sum
data = data.groupby('PRICE',as_index=False).agg({'DEMAND': 'sum'}).sort_values(by='PRICE', ascending=False)
data['DEMAND'] = data['DEMAND'].cumsum()
What you have seems perfectly fine to me. But if you want to chain everything together, first sort then groupby with sort=False so it doesn't change the order. Then you can sum within group and cumsum the resulting Series
(df.sort_values('PRICE', ascending=False)
.groupby('PRICE', sort=False)['DEMAND'].sum()
.cumsum()
.reset_index())
PRICE DEMAND
0 323 13
1 232 23
2 120 36
Another option would be to sort then cumsum and then drop_duplicates:
(df.sort_values('PRICE', ascending=False)
.set_index('PRICE')
.DEMAND.cumsum()
.reset_index()
.drop_duplicates('PRICE', keep='last'))
PRICE DEMAND
2 323 13
4 232 23
6 120 36
Suppose we have a Pandas DataFrame like the following:
df=pd.DataFrame({'name':['Ind','Chn','SG','US','SG','US','Ind','Chn','Fra','Fra'],'a':[5,6,3,4,7,12,66,78,65,100]})
I would like to sum the values of column 'a' for each distinct values of column 'name'.
I tried this code:
for i in df['name'].unique():
df['tot']=df[(df.name==i)]['a'].sum()
In the resulting new column, 'tot' Column contains only the sum of last distinct value of 'name' i.e (only 'Fra') for all rows rather than separate values for each of [Ind, US,Fra ,etc] . I would like to have one cell in the new column (tot) for each unique value of 'name' column and ultimately want to sort the whole dateframe 'df' through sum of each unique values.
I tried using dictionary,
dc={}
for i in df['name'].unique():
dc[i]=dc.get(i,0)+(df[(df.name==i)]['a'].sum())
I get the desired result though in dictionary,so I don't know how to sort df from here based on values of the dictionary 'dc'.
{'Ind': 71, 'Chn': 84, 'SG': 10, 'US': 16, 'Fra': 165}
Could anybody please explain the process to workout such scenario in as many ways as possible? Which would be the most efficient way when dealing with huge data? Thanks!
Edit: My expected output is just to sort the dataframe df by the value of the new column 'tot'.. Or like finding the rows associated with maximum or minimum values in the column 'tot'.
You are looking for groupby
df=pd.DataFrame({'name':['Ind','Chn','SG','US','SG','US','Ind','Chn','Fra','Fra'],'a':[5,6,3,4,7,12,66,78,65,100]})
df.groupby('name').a.sum()
Out[950]:
name
Chn 84
Fra 165
Ind 71
SG 10
US 16
Name: a, dtype: int64
Edit :
df.assign(total=df.name.map(df.groupby('name').a.sum())).sort_values(['name','total'])
Out[964]:
a name total
1 6 Chn 84
7 78 Chn 84
8 65 Fra 165
9 100 Fra 165
0 5 Ind 71
6 66 Ind 71
2 3 SG 10
4 7 SG 10
3 4 US 16
EDIT 2 :
df.groupby('name').a.sum().sort_values(ascending=True)
Out[1111]:
name
SG 10
US 16
Ind 71
Chn 84
Fra 165
Name: a, dtype: int64
df.groupby('name').a.sum().sort_values(ascending=False)
Out[1112]:
name
Fra 165
Chn 84
Ind 71
US 16
SG 10
Name: a, dtype: int64
(df.groupby('name').a.sum().sort_values(ascending=False)).index.values
Out[1119]: array(['Fra', 'Chn', 'Ind', 'US', 'SG'], dtype=object)
IIUIC, use groupby and transform
In [3716]: df['total'] = df.groupby('name')['a'].transform('sum')
In [3717]: df
Out[3717]:
a name total
0 5 Ind 71
1 6 Chn 84
2 3 SG 10
3 4 US 16
4 7 SG 10
5 12 US 16
6 66 Ind 71
7 78 Chn 84
8 65 Fra 165
9 100 Fra 165
And, use sort_values
In [3719]: df.sort_values(by='total', ascending=False)
Out[3719]:
a name total
8 65 Fra 165
9 100 Fra 165
1 6 Chn 84
7 78 Chn 84
0 5 Ind 71
6 66 Ind 71
3 4 US 16
5 12 US 16
2 3 SG 10
4 7 SG 10
I have two DataFrame
df1:
mat inv
0 100 23
1 101 35
2 102 110
df2:
mat sale
0 100 45
1 101 100
2 102 90
I merged the DataFrame in df:
mat inv sale
0 100 23 45
1 101 35 100
2 102 110 90
so I could create another column days:
df['days'] = df.inv / df.sale * 30
then I delete the column sale, and get this as result:
df:
mat inv days
0 100 23 15
1 101 35 10
2 102 110 36
Can I create the dayscolumn directly in df1 without first merging the DataFrame? since I don't need the column of df2, just the value to do the operation of days, and I don't really want to merge them to delete it in the end.
You can create the new column directly if you make sure the mat columns align properly:
df1 = df1.set_index('mat')
df2 = df2.set_index('mat')
df2['days'] = df1.inv.div(df2.sale).mul(30)
sale days
mat
100 4 15.33
101 100 10.50
102 90 36.67
you can also do it this way:
In [181]: df1['days'] = (df1.inv / df1['mat'].map(df2.set_index('mat')['sale']) * 30).astype(int)
In [182]: df1
Out[182]:
mat inv days
0 100 23 15
1 101 35 10
2 102 110 36
surely df1['days'] = df1['inv'] / df2['sale'] * 30 works?