How can I groupby and aggregate pandas dataframe with many columns - python

I am working on a pandas dataframe with 168 columns. First three columns contain name of the country, latitude and longtitude. Rest of the columns contain numerical data. Each row represents a country but for some countries there are multiple rows. I need to aggregate those rows by summing. I can aggregate first three columns with following code:
df = df.groupby('Country', as_index=False).agg({'Lat':'first','Long':'first'})
However, I couldn't find a way to include in that code remaining 165 columns without explicitly writing all the column names. In addition, column names represent dates and are named like 5/27/20,5/28/20,5/29/20, etc. So I need to keep the column names.
How can I do that? Thanks.

Maybe you can generate the dictionary from the column names:
df = df.groupby('Country', as_index=False).agg({c: 'first' for c in df.columns})

Related

pandas | drop_duplicates(subset=['col1', 'col2', 'col3']) not dropping based on given column names

I'm trying to drop duplicated rows in dataframe by using .drop_duplicates with a subset list so that it only drops the rows that has the same value in the given column names. But for some reason, it didn't drop all of them.
This is the dataframe before dropping...
This is the code that I used to drop the rows...
df_combined.drop_duplicates(subset = ['Anonymized_ID', 'COURSE', 'GRADE'], keep='last', inplace=True)
This is the dataframe after dropping...
I was expecting to see only two rows after dropping since they have same values for the specified column

How to make new dataframe from existing dataframe with unique rows values of one column and corresponding row values from other columns?

I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).

How can I combine columns in pandas dataframe with duplicate names?

I have got pandas dataframe which looks as following: enter image description here
There are multiple columns for Australia based on the provinces, and the columns are titled Australia, Australia.1, Australia.2 and so on. It is also the case for other countries such as the USA, the UK or Canada. I want to have only one column for each of these countries. For example, I want to have one column named Australia with the sum total of values in each provinces, and I want to avoid duplicate column names. How can I do it using pandas dataframe in Python?
You can transpose the dataframe and reset the index. Then, remove periods and number and groupby the columns which in my example is index:
df = df.T.reset_index()
df['index'] = df['index'].str.replace('\.\d+', '')
df.iloc[:,2:] = df.iloc[:,2:].astype(float)
df = df.groupby('index').sum()
df

Collapsing values of a Pandas column based on Non-NA value of other column

I have a data like this in a csv file which I am importing to pandas df
I want to collapse the values of Type column by concatenating its strings to one sentence and keeping it at the first row next to date value while keeping rest all rows and values same.
As shown below.
Edit:
You can try ffill + transform
df1=df.copy()
df1[['Number', 'Date']]=df1[['Number', 'Date']].ffill()
df1.Type=df1.Type.fillna('')
s=df1.groupby(['Number', 'Date']).Type.transform(' '.join)
df.loc[df.Date.notnull(),'Type']=s
df.loc[df.Date.isnull(),'Type']=''

pandas unique values multiple columns different dtypes

Similar to pandas unique values multiple columns I want to count the number of unique values per column. However, as the dtypes differ I get the following error:
The data frame looks like
A small[['TARGET', 'title']].apply(pd.Series.describe) gives me the result, but only for the category types and I am unsure how to filter the index for only the last row with the unique values per column
Use apply and np.unique to grab the unique values in each column and take its size:
small[['TARGET','title']].apply(lambda x: np.unique(x).size)
Thanks!

Categories

Resources