Python column hierarchy creation for groupby - python

I have a question I can't get my head around, although I'm sure it's dead simple.
I import an excel file, with sales figures of cars.
I need to be able to report on it by country.
The country is not part of the file, but I have the info of all the cars of each country. (I can create another DataFrame from it, or a list, or dict...)
My idea was to create a hierarchie in the columns. I just can't figure out how.
import pandas as pd
german=['BMW','Audi','Mercedes','Volkswagen']
italian=['Fiat','Ferrari']
toclean=pd.DataFrame([['car','4','5',10,20,15,50,20,13,24]],
columns=['type','wheels','seats','BMW','Audi','Mercedes','Volkswagen','Fiat','Ferrari','SEAT'])
type
wheels
seats
BMW
Audi
Mercedes
Volkswagen
Fiat
Ferrari
SEAT
car.
4
5
10
20
15
50
20
13
24

Something like this?
def country(brand):
if brand in german:
return 'germany'
elif brand in italian:
return 'italy'
else:
return None
long_df = toclean.melt()
long_df['country'] = long_df['variable'].map(country)
long_df
variable value country
0 type car None
1 wheels 4 None
2 seats 5 None
3 BMW 10 germany
4 Audi 20 germany
5 Mercedes 15 germany
6 Volkswagen 50 germany
7 Fiat 20 italy
8 Ferrari 13 italy
9 SEAT 24 None
long_df.groupby('country')['value'].sum()
country
germany 95
italy 33
Name: value, dtype: int64

Related

Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

I'd like to flag if a row is duplicated, and attach if it's the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.
More visually, I'd like to go from:
id
Country
City
1
France
Paris
2
France
Paris
3
France
Lyon
4
France
Lyon
5
France
Lyon
to
id
Country
City
duplicated_flag
1
France
Paris
1
2
France
Paris
1
3
France
Lyon
2
4
France
Lyon
2
5
France
Lyon
2
Note that id is not taken into account to see if the row is duplicated.
Two options:
First, if you have lots of columns that you need to compare, you can use:
comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()
We drop the columns that aren't needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().
But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It's a bit more verbose so I'm not super happy with this option, but I'm leaving this here for completeness in case this suits your use case better:
country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)
These output:
id Country City duplicated_flag
0 1 France Paris 1
1 2 France Paris 1
2 3 France Lyon 2
3 4 France Lyon 2
4 5 France Lyon 2

How to get top 3 sales in data frame after using group by and sorting in python?

recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Extract country name from text in column to create another column

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.
Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.
You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Categories

Resources