Aggregating similar rows in Pandas

Aggregating similar rows in Pandas - python

I've got a dataframe that's currently aggregated by zip code, and looks similar to this:
Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1
I want to aggregate the dataframe and Number_of_People column by state instead, combining identical rows (aside from Number of people) so that the data above instead looks like this:
Year Organization State Number_of_people
2021 A NJ 7
2020 B AZ 3
2021 C VA 7
2019 A NJ 1
In other words, if rows are identical in all columns EXCEPT Number_of_people, I want to combine the rows and add the number_of_people.
I'm stuck on how to approach this problem after deleting the Zip column -- I think I need to group by Year, Organization, and State but not sure what to do after that.

A more pythonic version without zip codes
df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum()
A more pythonic version with zip codes
df.groupby(['Year','Organization','State'], as_index=False).sum()

You don't have to drop zip first if you don't want, use the syntax below.
data = '''Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1'''
df = pd.read_csv(io.StringIO(data), sep='\s+', engine='python')
df[['Year','Organization','State', 'Number_of_people']].groupby(['Year','Organization','State']).sum().reset_index()
Output
Year Organization State Number_of_people
0 2019 A NJ 1
1 2020 B AZ 3
2 2021 A NJ 7
3 2021 C VA 7
If you do want to drop the zip code, then use this:
df.groupby(['Year','Organization','State']).sum().reset_index()

Related

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.

df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0

There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')

In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.

Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns

If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032

where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work

I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to extract info from original dataframe after doing some analysis on it?

So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?

You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa

You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating similar rows in Pandas - python

A more pythonic version without zip codes df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum() A more pythonic version with zip codes df.groupby(['Year','Organization','State'], as_index=False).sum()

Related

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Get latest value looked up from other dataframe

How to fill dataframe's empty/nan cell with conditional column mean

Calculating new rows in a Pandas Dataframe on two different columns

How to extract info from original dataframe after doing some analysis on it?

Categories

Resources