Pandas sum multi-index columns with same name - python

I know that I can sum index's by:
df["name1"]+df["name2"]
But how does sum work when the two index names are the same?
Given the following CSV:
,,College 1,,,,,,,,,,,,College 2,,,,,,,,,,,,College 3,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Music,,,,Geography,,,,Business,,,,Mathematics,,,,Biology,,,,Geography,,,,Business,,,,Biology,,,,Technology,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13,4,9,6,2,0,10,11,14,4,12,12,5
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5,2,12,14,9,10,11,18,20,0,5,7,8
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12,4,9,6,2,1,13,15,18,3,19,8,16
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10,2,12,14,9,0,17,19,19,0,4,6,4
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20,4,9,6,2,8,12,16,13,4,19,18,7
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8,2,12,14,9,9,16,20,13,0,10,5,6
I can clean the file and setup a multi-index with pandas and numpy:
df = pd.read_csv("CollegeGrades2.csv", index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x)).ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
df.groupby(level=0, sort=False).sum()
However my issue is that I want to total the subjects e.g. College 1 Geography + College 3 Geography and display them in the following output:
I have tried separating them out into different data frames, summing them and then concatenating them but in doing so I lose the headings, for example:
music = df2["College 1", "Music"]
geography = df2["College 1", "Geography"] + df2["College 1", "Geography"]
pd.concat([music,geography], axis=1).groupby(level=0, sort=False).sum()
How I sum the subjects while maintaining my desired output? Any help would be appreciated.
Thank you.

You can also group by the column:
df.groupby(level=[1, 2], axis=1).sum().groupby(level=0).sum()
Result:
1 Biology Business Geography Mathematics Music Technology
2 D F M P D F M P D F M P D F M P D F M P D F M P
0
Evening 47 21 69 52 22 12 40 42 41 7 41 46 36 8 21 35 18 8 18 22 13 4 23 29
Year 1 68 26 63 57 22 12 40 42 34 5 34 45 29 13 30 31 20 0 21 18 13 4 19 17
Year 2 64 12 63 66 22 12 40 42 42 2 21 57 33 17 33 30 21 6 20 21 20 3 14 23

Related

Replace text in url in pandas dataframe

In my dataframe i have a links with utm parameters:
utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}
Also in dataframe i have sevral columns with id - CampaignId, AdGroupId, Keyword, Keyword ID
And I need to replace the values in curly brackets in the link with the values from these columns
For exmaple i need to replace {campaign_id} with values from CampaignId colums. And do this for each value in the link
The result should be like this -
utm_content=keys_3745473327|cid|31757442|aid|CRM|38372916231|src&utm_term=CRM
You can try this:
import pandas as pd
import numpy as np
# create some sample data
df = pd.DataFrame(columns=['CampaignId', 'AdGroupId', 'Keyword', 'Keyword ID'],
data=np.random.randint(low=0, high=100, size=(10, 4)))
df['url'] = 'utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}'
df
Output:
CampaignId AdGroupId Keyword Keyword ID url
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
And then write a custom function to replace your variables with f string and apply it to the dataframe creating a new column (You can also replace with the url column if you want):
def fill_link(CampaignId, AdGroupId, Keyword, KeywordID, url):
campaign_id = CampaignId
keyword = Keyword
gbid = AdGroupId
phrase_id = KeywordID
return eval("f'" + f"{url}" + "'")
df['url_filled'] = df.apply(lambda row: fill_link(row['CampaignId'], row['AdGroupId'], row['Keyword'], row['Keyword ID'], row['url']), axis=1)
df
CampaignId AdGroupId Keyword Keyword ID url url_filled
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|21|aid|26|41|src&utm_t...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_9|cid|28|aid|19|3|src&utm_ter...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|11|aid|37|43|src&utm_t...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|25|aid|17|54|src&utm_t...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_19|cid|32|aid|17|48|src&utm_t...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_92|cid|26|aid|80|90|src&utm_t...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|25|aid|1|54|src&utm_te...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_7|cid|81|aid|68|85|src&utm_te...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_55|cid|75|aid|37|56|src&utm_t...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_53|cid|14|aid|34|84|src&utm_t...
I am not sure if the name of your variables are correctly assigned as they are not exactly the same. But it shouldn't be a problem for you to replace them as you wish.

How to concatenate rows side by side in pandas

I want to combine the five rows of the same dataset into a single dataset
I have 700 rows and i want to combining every five rows
A B C D E F G
1 10,11,12,13,14,15,16
2 17,18,19,20,21,22,23
3 24,25,26,27,28,29,30
4 31,32,33,34,35,36,37
5 38,39,40,41,42,43,44
.
.
.
.
.
700
After combining the first five rows.. My first row should look like this:
A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G
1 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44
If you can guarantee that the total number of rows you have is a multiple of 5, dipping into numpy will be the most efficient way to solve this problem:
import numpy as np
import pandas as pd
data = np.arange(70).reshape(-1, 7)
df = pd.DataFrame(data, columns=[*'ABCDEFG'])
print(df)
A B C D E F G
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
5 35 36 37 38 39 40 41
6 42 43 44 45 46 47 48
7 49 50 51 52 53 54 55
8 56 57 58 59 60 61 62
9 63 64 65 66 67 68 69
out = pd.DataFrame(
df.to_numpy().reshape(-1, df.shape[1] * 5),
columns=[*df.columns] * 5
)
print(out)
A B C D E F G A B C D E F ... B C D E F G A B C D E F G
0 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 22 23 24 25 26 27 28 29 30 31 32 33 34
1 35 36 37 38 39 40 41 42 43 44 45 46 47 ... 57 58 59 60 61 62 63 64 65 66 67 68 69
[2 rows x 35 columns]
You can do:
cols = [col for v in [df.columns.tolist()]*len(df) for col in v]
dfs = [df[i:min(i+5,len(df))].reset_index(drop=True) for i in range(0,len(df),5)]
df2 = pd.concat([pd.DataFrame(df.stack()).T for df in dfs])
df2.columns = cols
df2.reset_index(drop=True, inplace=True)
see if this helps answer your question
unstack turns the columns into rows, and once we have the data in a column, we just need it transposed. reset_index makes the resulting series into a dataframe. the original columns names are made into an index, so when we transpose we have the columns as you had stated in your columns.
df.unstack().reset_index().set_index('level_0')[[0]].T
level_0 A A A A A B B B B B ... F F F F F G G G G G
0 10 17 24 31 38 11 18 25 32 39 ... 15 22 29 36 43 16 23 30 37 44
vote and/or accept if the answer helps
the easiest way is to convert your dataframe to a numpy array, reshape it then cast it back to a new dataframe.
Edit:
data= # your dataframe
new_dataframe=pd.DataFrame(data.to_numpy().reshape(len(data)//5,-1),columns=np.tile(data.columns,5))
Stacking and unstacking data in pandas
Data in tables are often presented multiple ways. Long form ("tidy data") refers to data that are stacked in a couple of columns. One of the columns will have categorical indicators about the values. In contrast, wide form ("stacked data") is where each category has it's own column.
In your example, you present the wide form of data, and you're trying to get it into long form. The pandas.melt, pandas.groupby, pandas.pivot, pandas.stack, pandas.unstack, and pandas.reset_index are the functions that help convert between these forms.
Start with your original dataframe:
df = pd.DataFrame({
'A' : [10, 17, 24, 31, 38],
'B' : [11, 18, 25, 32, 39],
'C' : [12, 19, 26, 33, 40],
'D' : [13, 20, 27, 34, 41],
'E' : [14, 21, 28, 35, 42],
'F' : [15, 22, 29, 36, 43],
'G' : [16, 23, 30, 37, 44]})
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
Use pandas.melt to convert it to long form, then sort to get it how you requested the data: The ignore index option helps us to get it back to wide form later.
melted_df = df.melt(ignore_index=False).sort_values(by='value')
variable value
0 A 10
0 B 11
0 C 12
0 D 13
0 E 14
0 F 15
0 G 16
1 A 17
1 B 18
...
Use groupby, unstack, and reset_index to convert it back to wide form. This is often a much more difficult process that relies on grouping by the value stacked column, other columns, index, and stacked variable and then unstacking and resetting the index.
(melted_df
.reset_index() # puts the index values into a column called 'index'
.groupby(['index','variable']) #groups by the index and the variable
.value #selects the value column in each of the groupby objects
.mean() #since there is only one item per group, it only aggregates one item
.unstack() #this sets the first item of the multi-index to columns
.reset_index() #fix the index
.set_index('index') #set index
)
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
This stuff can be quite difficult and requires trial and error. I would recommend making a smaller version of your problems and mess with them. This way you can figure out how the functions are working.
Try this using arange() with floordiv to get groups by every 5, then creating a new df with the groups. This should work even if your df is not divisible by 5.
l = 5
(df.groupby(np.arange(len(df.index))//l)
.apply(lambda x: pd.DataFrame([x.to_numpy().ravel()]))
.set_axis(df.columns.tolist() * l,axis=1)
.reset_index(drop=True))
or
(df.groupby(np.arange(len(df.index))//5)
.apply(lambda x: x.reset_index(drop=True).stack())
.unstack(level=[1,2])
.droplevel(0,axis=1))
Output:
A B C D E F G A B C ... E F G A B C D E F G
0 9 0 3 2 6 2 9 1 7 5 ... 2 5 9 5 4 9 7 3 8 9
1 9 5 0 8 1 5 8 7 7 7 ... 6 3 5 5 2 3 9 7 5 6

Calculate mean from multiple columns

I have 12 columns filled with wages. I want to calculate the mean but my output is 12 different means from each column, but I want one mean which is calculated with the whole dataset as one.
This is how my df looks:
Month 1 Month 2 Month 3 Month 4 ... Month 9 Month 10 Month 11 Month 12
0 1429.97 2816.61 2123.29 2123.29 ... 2816.61 2816.61 1429.97 1776.63
1 3499.53 3326.20 3499.53 2112.89 ... 1939.56 2806.21 2632.88 2459.55
2 2599.95 3119.94 3813.26 3466.60 ... 3466.60 3466.60 2946.61 2946.61
3 2599.95 2946.61 3466.60 2773.28 ... 2253.29 3119.94 1906.63 2773.28
I used this code to calculate the mean:
mean = df.mean()
Do i have to convert these 12 columns into one column or how can i calculate one mean?
Just call the mean again to get the mean of those 12 values:
df.mean().mean()
Use numpy.mean with convert values to 2d array:
mean = np.mean(df.to_numpy())
print (mean)
2914.254166666667
Or use DataFrame.melt:
mean = df.melt()['value'].mean()
print (mean)
2914.254166666666
You can also use stack:
df.stack().mean()
Suppose this dataframe:
>>> df
A B C D E F G H
0 60 1 59 25 8 27 34 43
1 81 48 32 30 60 3 90 22
2 66 15 21 5 23 36 83 46
3 56 42 14 86 41 64 89 56
4 28 53 89 89 52 13 12 39
5 64 7 2 16 91 46 74 35
6 81 81 27 67 26 80 19 35
7 56 8 17 39 63 6 34 26
8 56 25 26 39 37 14 41 27
9 41 56 68 38 57 23 36 8
>>> df.stack().mean()
41.6625

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Iterating over dataframe and replace with values from another dataframe

I have 2 dataframes, df1 and df2, and df2 holds the min and max values for the corresponding columns.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
I would like to iterate through df1 and replace the cell values with those of df2 when the df1 cell value is below/above the respective columns' min/max values.
First dont loop/iterate in pandas, if exist some another better and vectorized solutions like here.
Use numpy.select with broadcasting for set values by conditions:
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
print (df1)
A B C D E
0 45 2 28 34 38
1 17 19 42 22 33
2 32 49 47 9 32
3 46 32 47 25 19
4 14 36 32 16 4
5 49 3 2 20 39
6 2 20 47 48 7
7 41 35 28 38 33
8 21 30 27 34 33
print (df2)
A B C D E
0 5 3 4 7 2
1 30 20 30 40 50
#for pandas below 0.24 change .to_numpy() to .values
min1 = df2.loc[0].to_numpy()
max1 = df2.loc[1].to_numpy()
arr = df1.to_numpy()
df = pd.DataFrame(np.select([arr < min1, arr > max1], [min1, max1], arr),
index=df1.index,
columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
Another better solution with numpy.clip:
df = pd.DataFrame(np.clip(arr, min1, max1), index=df1.index, columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5

Categories

Resources