Ungrouping a pandas dataframe after aggregation operation - python

I have used the "groupby" method on my dataframe to find the total number of people at each location.
To the right of the "sum" column, I need to add a column that lists all of the people's names at each location (ideally in separate rows, but a list would be fine too).
Is there a way to "ungroup" my dataframe again after having found the sum?
dataframe.groupby(by=['location'], as_index=False)['people'].agg('sum')

You can do two different things:
(1) Create an aggregate DataFrame using groupby.agg and calling appropriate methods. The code below lists all names corresponding to a location:
out = dataframe.groupby(by=['location'], as_index=False).agg({'people':'sum', 'name':list})
(2) Use groupby.transform to add a new column to dataframe that has the sum of people by location in each row:
dataframe['sum'] = dataframe.groupby(by=['location'])['people'].transform('sum')

I think you are looking for 'transform' ?
dataframe.groupby(by=['location'], as_index=False)['people'].transform('sum')

Related

EDA for loop on multiple columns of dataframe in Python

Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

Keep rows from a dataframe whose index name is NOT in a given list

So, I have a list with tuples, and a multi-index dataframe. I want to find the rows of the dataframe whose indices are NOT included in the list of tuples, and create a new dataframe with these elements. Any help? Thanx!
You can use isin with a negation to explicitly filter your DataFrame:
new_df = df[~df.index.isin(list_of_tuples)]
Alternatively, use drop to remove the tuples you don't want to be included in the new DataFrame.
new_df = df.drop(list_of_tuples)
From a couple simple tests, using isin appears to be faster, although drop is a bit more readable.

Simple way to create multiple columns with one row from multiple rows sharing the same key in pandas

I'm trying to create a dataframe from vote data that is in the following format:
Name,StateCode,GeoStratum,CountyCode,fips,Precinct,PrecinctReport,TotalVotes,FullName,VoteCount,ElectoralVote,Percent
Hawaii,HI,2,1,15001,43,43,64865,Hillary Clinton,64
Hawaii,HI,2,1,15001,43,43,64865,Donald Trump,27
Hawaii,HI,2,1,15001,43,43,64865,Gary Johnson,4
Hawaii,HI,2,1,15001,43,43,64865,Jill Stein,4
I'd like to convert this data into a format like this:
Name,StateCode,GeoStratum,CountyCode,fips,Precinct,PrecinctReport,TotalVotes,FullName,VoteCount,ElectoralVote,Clinton,Trump,Johnson,Stein
Hawaii,HI,2,1,15001,43,43,64865,64,27,4,4
Is there a simple way to take the fips column as a key, then use the values from Percent where 'Hillary Clinton' or 'Donald Trump' etc.. are the values in FullName to fill the Trump, Clinton etc columns?
Of course, a couple nested loops will do this. Hoping to find a nice way.
Use pivot_table and declare the index, columns, and values you want to get in your pivoted data:
df.pivot_table(index=['Name', 'StateCode', 'GeoStratum', 'CountyCode', 'fips', 'Precinct',
'PrecinctReport', 'TotalVotes'], columns='FullName', values='VoteCount')
Eventually use reset_index to get the table you need and drop useless columns that can remain from this pivot.

Categories

Resources