What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
The agg function will do this for you. Pass the columns and function as a dict with column, output:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')
If you want to add a suffix/prefix to the aggregated column names, use add_suffix() / add_prefix().
df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().add_suffix("_total")
If you want to retain Code and Country as columns after aggregation, set as_index=False in groupby() or use reset_index().
df.groupby(["Code", "Country"], as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
# df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()
Related
I have a pandas dataframe like this:
Name Year Sales
Ann 2010 500
Ann 2011 500
Bob 2010 400
Bob 2011 700
Ed 2010 300
Ed 2011 300
I want to be able to combine the figures in the sales column for each name returning:
Name Sales
Ann 1000
Bob 1100
Ed 600
Perhaps I need a for loop to go through and combine the 2 values for both years and create a new column, but I'm not quite sure. Is there a pandas function that can help me with this?
That's a simple dataframe groupby.
In that case you'll just have to select the two columns you need
df = df[["Name", "Sales"]]
And then apply the groupby
df.groupby(["name"], as_index=False).sum()
By default the groupby will make the grouped by columns part of the index. If you want to keep them as colum you need to specify as_index=False
I have a dataframe like this one, with many more rows:
zone
keyword
sales
nyc1
iphone
10
nyc1
smart tv
6
nyc1
iphone
12
nyc2
laptop
22
slc1
iphone
3
slc2
radio
5
la1
iphone
10
la1
tablet
22
la1
tablet
5
How can I get another dataframe where for each zone/keyword I get the sum of the sales column (grouped by zone/keyword) in descending order?
For this example it should look like this (I don't want to reorder based on the other 2 columns, only sales):
zone
keyword
sales
nyc1
iphone
22
nyc1
smart tv
6
nyc2
laptop
22
slc1
iphone
3
slc2
radio
5
la1
tablet
27
la1
iphone
10
I already grouped the columns using
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum()
But the result is a series with the sum-of-sales column not in order.
Using reset_index and sort_values does order by sales, but removes the groupby and order the whole dataframe...
.reset_index().sort_values('sales', ascending=False)
How can I get a dataframe like the one above?
After you complete your groupby, you can use sort_values
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum()
sorted_df = df_sales.sort_values(by=['zone'])
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
df_sales.groupby(['zone','keyword'])['sales'].sum().reset_index().sort_values('sales', ascending=False)
reset_index reverts series back to dataframe and after that you can sort values.
soulution 1 : using agg(sum)
To get DataFrame object instead, use double square brackets around sales.
df_sales = df_sales.groupby(['zone','keyword'])[['sales']].agg('sum').reset_index()
soulution 2 : using sum()
df_sales = df_sales.groupby(['zone','keyword'])['sales'].sum().reset_index()
I have two dataframes:
df1
Name Age State Postcode AveAge_State_PC
John 40 PA 1000 35
Janet 40 LV 1050 30
Jake 30 PA 1000 35
Jess 20 LV 1050 30
df2
State Postcode AveAge_State_PC
PA 1000 ???
LV 1050 ???
How do I get the values into the 2nd table? They should all be the same so happy to take the first value that appears.
I have tried:
df2 = df2.merge(df1[['State', 'Postcode', 'AveAge_State_PC']], how = 'left',
left_on = ['State', 'Postcode'], right_on = ['State', 'Postcode']).drop(columns= ['State', 'Postcode'])
but getting an
ValueError: You are trying to merge on int64 and object columns.
Edit
Im also getting duplicate rows when merging rather than just keeping the same number of rows in df2
I assume because there are multiple same values in df1? any help would be much appreciated! thanks!
I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)
I have two numerical dataframes (df1 and df2), each with a common index but with different column headers. I want to apply a function that takes: for the ith column of df1, and the jth column of df2, apply the Pearson correlation function (or cosine similarity, or similar user defined function), and return the number.
I want to return the number into a dataframe, df3, where the columns of df1 are the index of df3, the columns of df2 are the columns of df3, and the cells represent the value of the correlation between the two vectors (columns) from df1 and df2.
*not all of the values are populated. Where there's a difference, match only on the inner join of the two vectors (this can be done in the user defined function). Assume df1 and df2 have a different length/number of columns to each other.
Example: I have a dataframe (df1) of male dating profiles, where the columns are the names of the men, and the row index is their interest in a certain topic, between 0 and 100.
I have a second dataframe (df2) of female dating profiles in the same way.
I want to return a matrix of Males along the side, Females across the top, and the number corresponds to the similarity coefficient between the two profiles, for each man/woman pair.
eg:
df1
bob joe carlos
movies 50 45 90
sports 10 NaN 10
walking 20 NaN 50
skiing NaN 80 40
df2
mary anne sally
movies 40 70 NaN
sports 50 0 30
walking 80 10 50
skiing 30 NaN 40
Desired output, df3:
mary anne sally
bob 4.53 19.3 77.4
joe 81.8 75.7 91.0
carlos 45.8 12.2 18.8
I tried this with the classic double for loop, but even I know this is the work of satan in Pandas world. The tables are relatively large, so reasonable efficiency is important (which the below obviously isn't). Thanks in advance.
df3 = pd.DataFrame(index=df1.columns, columns=df2.columns)
for usera in df1:
for userb in df2:
df3.loc[userb, usera] = myfunc(df1[usera], df2[userb])
I've experimented with a few alternatives of your code and this one is the fastest as of now:
df3 = pd.DataFrame(([myfunc_np(col_a, col_b) for col_b in df2.values.T] for col_a in df1.values.T),
index=df1.columns, columns=df2.columns)
Here myfunc_np is a numpy version of myfunc that acts on numpy arrays directly rather than pandas series.
Further performance improvement would likely require to vectorize myfunc_np, i.e. having a myfunc_np_vec that takes one column u1 in df1 and the entire df2, and returns a vector of similarity values of u1 with all columns in df2 at the same time.