Find percentile in pandas dataframe based on groups

Find percentile in pandas dataframe based on groups - python

Season Name value
2001 arkansas 3.497
2002 arkansas 3.0935
2003 arkansas 3.3625
2015 arkansas 3.766
2001 colorado 2.21925
2002 colorado 1.4795
2010 colorado 2.89175
2011 colorado 2.48825
2012 colorado 2.08475
2013 colorado 1.68125
2014 colorado 2.5555
2015 colorado 2.48825
In the dataframe above, I want to identify top and bottom 10 percentile values in column value for each state (arkansas and colorado). How do I do that? I can identify top and bottom percentile for entire value column like so:
np.searchsorted(np.percentile(a, [10, 90]), a))

You can use groupby + quantile:
df.groupby('Name')['value'].quantile([.1, .9])
Name
arkansas 0.1 3.174200
0.9 3.685300
colorado 0.1 1.620725
0.9 2.656375
Name: value, dtype: float64
And then call np.searchsorted.
Alternatively, use qcut.
df.groupby('Name').apply(lambda x:
pd.qcut(x['value'], [.1, .9]))
Name
arkansas 0 (3.173, 3.685]
1 NaN
2 (3.173, 3.685]
3 NaN
colorado 4 (1.62, 2.656]
5 NaN
6 NaN
7 (1.62, 2.656]
8 (1.62, 2.656]
9 (1.62, 2.656]
10 (1.62, 2.656]
11 (1.62, 2.656]
Name: value, dtype: object

If the variable for your dataframe is df, this should work. I'm not sure what you want your output to look like, but I just created code for a dictionary, where each key is a state. Also, since you have very few values, I used the option "nearest" for the argument interpolation (the default value is interpolation). To see the possible options, check out the documentation for the function here.
import pandas as pd
import numpy as np
df = pd.read_csv('stacktest.csv')
#array of unique state names from the dataframe
states = np.unique(df['Name'])
#empty dictionary
state_data = dict()
for state in states:
state_data[state] = np.percentile(df[df['Name'] == state]['value'],[10,90],interpolation = 'nearest')
print(state_data)

Related

The problem that a value whose index I know with the loc function cannot update another column in the same index?

Datatable:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2017 NaN
1 NISSAN 2017 NaN
2 HYUNDAI 2017 1.0
3 DODGE 2017 NaN
I want to update more than one index and column data on that index with the loc function.
but when I use the loc function, it changes the new values by twos
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
data.loc[indexlister , listcolumns] = listnewvalue
As you can see in the output below. just zero and the first index 'VEHICLE_YEAR' should be 16000, 'NUM_PASSENGERS' should be 28000. BUT, BOTH ZERO AND THE FIRST ROW HAS CHANGED IN BOTH COLUMNS.
How can i check this and change only the columns and index i want.or do you have a different method? thank you very much.
output:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 16000 28000.0
1 NISSAN 16000 28000.0
In the printout, I set fields to be empty so that new entries appear. for example; I want to assign the value 2005 to the 0 index of the column 'VEHICLE_YEAR' and to the 1st index 2005 of the column 'NUM_PASSENGERS'
The output I want is as follows:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2005 Nan
1 NISSAN Nan 2005
2 HYUNDAI Nan Nan

The list you're setting the values with needs to correspond to the number of rows and number of columns you've selected with loc. If it receives a single list, it will assign all selected rows at those columns to that value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARAC' : ['CHEVROLET', 'NISSAN', 'HYUNDAI', 'DODGE'],
'VEHICLE_YEAR' : [2017, 2017, 2017, 2017],
'NUM_PASSENGERS' : [np.nan, np.nan, 1.0, np.nan]
})
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET NaN 2017
1 NISSAN NaN 2017
2 HYUNDAI 1.0 2017
3 DODGE NaN 2017
df.loc[[0, 2], ['NUM_PASSENGERS', 'VEHICLE_YEAR']] = [[1000, 2014], [3000, 2015]]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 1000.0 2014
1 NISSAN NaN 2017
2 HYUNDAI 3000.0 2015
3 DODGE NaN 2017
If you only want to change the values in the NUM_PASSENGERS column, select only that and give it a single list/array, the same length as your row indices.
df.loc[[0,1,3], ['NUM_PASSENGERS']] = [10, 20, 30]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 10.0 2014
1 NISSAN 20.0 2017
2 HYUNDAI 3000.0 2015
3 DODGE 30.0 2017
The docs might be helpful too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc
If this didn't answer your question, please provide your expected output.

I solved the problem as follows.
I could not describe the problem exactly, I am working on it, but when I changed it that way, it worked. And now I can change the row and column value I want to the value I want.
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
for i in len(indexlister):
df.loc[lister[count], listcolumn[count]] = listnewvalue[count]

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0

There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')

In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Python Pandas fill missing zipcode with values from another datafrane based on conditions

I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?

Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".

You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)

Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA

You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns

If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032

where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work

I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Extract country name from text in column to create another column

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.

Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece

You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.

You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find percentile in pandas dataframe based on groups - python

Related

The problem that a value whose index I know with the loc function cannot update another column in the same index?

Get latest value looked up from other dataframe

Python Pandas fill missing zipcode with values from another datafrane based on conditions

Calculating new rows in a Pandas Dataframe on two different columns

Extract country name from text in column to create another column

Categories

Resources