Pandas: Divide MultiIndex data frame by row

Pandas: Divide MultiIndex data frame by row - python

I have a data frame with a multi index (panel), and I would like to divide for each group (county) and each row, the values by a specific year.
>>> fields
Out[39]: ['emplvl', 'population', 'estab', 'estab_pop', 'emp_pop']
>>> df[fields]
Out[40]:
emplvl population estab estab_pop emp_pop
county year
1001 2003 11134.500000 46800 801.75 0.017131 0.237917
2004 11209.166667 48366 824.00 0.017037 0.231757
2005 11452.166667 49676 870.75 0.017529 0.230537
2006 11259.250000 51328 862.50 0.016804 0.219359
2007 11403.333333 52405 879.25 0.016778 0.217600
2008 11272.833333 53277 890.25 0.016710 0.211589
2009 11003.833333 54135 877.00 0.016200 0.203267
2010 10693.916667 54632 877.00 0.016053 0.195745
2011 10627.000000 NaN 862.00 NaN NaN
2012 10136.916667 NaN 841.75 NaN NaN
1003 2003 51372.250000 151509 4272.00 0.028196 0.339071
2004 53450.583333 156266 4536.25 0.029029 0.342049
2005 56110.250000 162183 4880.50 0.030093 0.345969
2006 59291.000000 168121 5067.50 0.030142 0.352669
2007 62600.083333 172404 5337.25 0.030958 0.363101
2008 62611.500000 175827 5529.25 0.031447 0.356097
2009 58947.666667 179406 5273.75 0.029396 0.328571
2010 58139.583333 183195 5171.25 0.028228 0.317364
2011 59581.000000 NaN 5157.75 NaN NaN
2012 60440.250000 NaN 5171.75 NaN NaN
The rows to divide by
>>> df[fields].loc[df.index.get_level_values('year') == 2007, fields]
Out[32]:
emplvl population estab estab_pop emp_pop
county year
1001 2007 11403.333333 52405 879.25 0.016778 0.217600
1003 2007 62600.083333 172404 5337.25 0.030958 0.363101
However, both
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields], axis=0)
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields], axis=1)
gives me a data frame full of NaN, probably because pandas is trying to divide taking the year-index into account and not finding anything to divide.
To compensate for this, I also tried
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields].values)
which gives me ValueError: Shape of passed values is (5, 2), indices imply (5, 20).

I think you can reset_index with df1 and then use div:
fields = ['emplvl', 'population', 'estab', 'estab_pop', 'emp_pop']
df1 = df.loc[df.index.get_level_values('year') == 2007, fields].reset_index(level=1)
print df1
year emplvl population estab estab_pop emp_pop
county
1001 2007 11403.333333 52405.0 879.25 0.016778 0.217600
1003 2007 62600.083333 172404.0 5337.25 0.030958 0.363101
print df.div(df1[fields], axis=0)
emplvl population estab estab_pop emp_pop
county year
1001 2003 0.976425 0.893045 0.911857 1.021039 1.093369
2004 0.982973 0.922927 0.937162 1.015437 1.065060
2005 1.004282 0.947925 0.990333 1.044761 1.059453
2006 0.987365 0.979449 0.980950 1.001550 1.008084
2007 1.000000 1.000000 1.000000 1.000000 1.000000
2008 0.988556 1.016640 1.012511 0.995947 0.972376
2009 0.964966 1.033012 0.997441 0.965550 0.934131
2010 0.937789 1.042496 0.997441 0.956789 0.899563
2011 0.931920 NaN 0.980381 NaN NaN
2012 0.888943 NaN 0.957350 NaN NaN
1003 2003 0.820642 0.878802 0.800412 0.910782 0.933820
2004 0.853842 0.906394 0.849923 0.937690 0.942022
2005 0.896329 0.940715 0.914422 0.972059 0.952818
2006 0.947139 0.975157 0.949459 0.973642 0.971270
2007 1.000000 1.000000 1.000000 1.000000 1.000000
2008 1.000182 1.019855 1.035974 1.015796 0.980711
2009 0.941655 1.040614 0.988102 0.949545 0.904902
2010 0.928746 1.062591 0.968898 0.911816 0.874038
2011 0.951772 NaN 0.966368 NaN NaN
2012 0.965498 NaN 0.968992 NaN NaN

At first I would suggest you to set a unique dataframe for your operation. Let's assume its name is df.
You need to select the row by which all the rows are to be divided by at first. This is the row with 2007 as year and individual county names.
In the following code, I have looped through the index and column of dataframe. The index of row to divide by has been selected as reference_index which includes the name of county and the year.
Then the rest of the rows are divided by the row with reference_index first. At the end, the row is divided by itself to get 1 value.
for index in df.index:
for column in df.columns:
county = index[0]
#index of reference row to divide rest of the rows by
reference_index = (county, 2007)
if index!=reference_index:
df.loc[index, column] = df.loc[index, column] / df.loc[reference_index, column]
#The row with 2007 year should also be divided by itself, but at the end. otherwise, it becomes 1 beforehand.
df.loc[reference_index] = df.loc[reference_index] / df.loc[reference_index]

Related

Creation of DataFrame with specific conditions on rows

Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#

First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN

The problem that a value whose index I know with the loc function cannot update another column in the same index?

Datatable:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2017 NaN
1 NISSAN 2017 NaN
2 HYUNDAI 2017 1.0
3 DODGE 2017 NaN
I want to update more than one index and column data on that index with the loc function.
but when I use the loc function, it changes the new values by twos
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
data.loc[indexlister , listcolumns] = listnewvalue
As you can see in the output below. just zero and the first index 'VEHICLE_YEAR' should be 16000, 'NUM_PASSENGERS' should be 28000. BUT, BOTH ZERO AND THE FIRST ROW HAS CHANGED IN BOTH COLUMNS.
How can i check this and change only the columns and index i want.or do you have a different method? thank you very much.
output:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 16000 28000.0
1 NISSAN 16000 28000.0
In the printout, I set fields to be empty so that new entries appear. for example; I want to assign the value 2005 to the 0 index of the column 'VEHICLE_YEAR' and to the 1st index 2005 of the column 'NUM_PASSENGERS'
The output I want is as follows:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2005 Nan
1 NISSAN Nan 2005
2 HYUNDAI Nan Nan

The list you're setting the values with needs to correspond to the number of rows and number of columns you've selected with loc. If it receives a single list, it will assign all selected rows at those columns to that value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARAC' : ['CHEVROLET', 'NISSAN', 'HYUNDAI', 'DODGE'],
'VEHICLE_YEAR' : [2017, 2017, 2017, 2017],
'NUM_PASSENGERS' : [np.nan, np.nan, 1.0, np.nan]
})
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET NaN 2017
1 NISSAN NaN 2017
2 HYUNDAI 1.0 2017
3 DODGE NaN 2017
df.loc[[0, 2], ['NUM_PASSENGERS', 'VEHICLE_YEAR']] = [[1000, 2014], [3000, 2015]]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 1000.0 2014
1 NISSAN NaN 2017
2 HYUNDAI 3000.0 2015
3 DODGE NaN 2017
If you only want to change the values in the NUM_PASSENGERS column, select only that and give it a single list/array, the same length as your row indices.
df.loc[[0,1,3], ['NUM_PASSENGERS']] = [10, 20, 30]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 10.0 2014
1 NISSAN 20.0 2017
2 HYUNDAI 3000.0 2015
3 DODGE 30.0 2017
The docs might be helpful too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc
If this didn't answer your question, please provide your expected output.

I solved the problem as follows.
I could not describe the problem exactly, I am working on it, but when I changed it that way, it worked. And now I can change the row and column value I want to the value I want.
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
for i in len(indexlister):
df.loc[lister[count], listcolumn[count]] = listnewvalue[count]

fill a new column with the division of data of a column in a Excel Sheet by looking up the denominator value from other Sheet

i have a 2 dataframes as given below,
import pandas as pd
restaurant = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx")
restaurant
Restaurant StartYear Capex inflation_adjusted_capex
Bawarchi Restaurant 1986 6000 Nan
Ks Baker's 1988 2000 Nan
Rajesh Restaurant 1989 1050 Nan
Ahmed Steak House 1990 9000 Nan
Absolute Barbique 1997 9500 Nan
inflation = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx", sheet_name="Sheet2")
inflation
Years Inflation_Factor
1985 0.111
1986 0.134
1987 0.191
1988 0.2253
1989 0.265
1990 0.304
Aim: is to fill "inflation_adjusted_capex" with div of "Capex" by corresponding years "Inflation_Factor from second Dataframe.
The code i wrote is,
for i in restaurant["StartYear"]:
restaurant["inflation_adjusted_capex"] =
(restaurant["inflation_adjusted_capex"])/(inflation[inflation["Years"] == i]["Inflation_Factor"])
print(restaurant["inflation_adjusted_capex"])
0 Nan
1 Nan
2 Nan
3 Nan
4 Nan
Name: Inflation adjusted Capex to current year, dtype: float64
Unfortunately this code is returning Nan values, kindly help me. Thanks in advance.

There are a couple ways to do this. The first is to join the dataframes so that you have your inflation factors in the first dataframe, and then do the calculation:
#add inflation_factor column to first dataframe
restaurant = restaurant.merge(inflation, left_on = 'StartYear', right_on = 'Year')
#do dividsion
restaurant['inflation_adjusted_capex'] = restaurant['Capex']/restaurant['Inflation_Factor']
The other is to apply a function that behaves like an excel VLOOKUP:
#set year as index for inflation so we can look up based on it
inflation = inflation.set_index('Year')
#look up inflation factor and divide with a lambda function
restaurant['inflation_adjusted_capex'] = inflation.apply(lambda row: row['Capex']/inflation['Inflation_Factor'][row['StartYear']], 1)

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns

If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032

where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work

I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Find percentile in pandas dataframe based on groups

Season Name value
2001 arkansas 3.497
2002 arkansas 3.0935
2003 arkansas 3.3625
2015 arkansas 3.766
2001 colorado 2.21925
2002 colorado 1.4795
2010 colorado 2.89175
2011 colorado 2.48825
2012 colorado 2.08475
2013 colorado 1.68125
2014 colorado 2.5555
2015 colorado 2.48825
In the dataframe above, I want to identify top and bottom 10 percentile values in column value for each state (arkansas and colorado). How do I do that? I can identify top and bottom percentile for entire value column like so:
np.searchsorted(np.percentile(a, [10, 90]), a))

You can use groupby + quantile:
df.groupby('Name')['value'].quantile([.1, .9])
Name
arkansas 0.1 3.174200
0.9 3.685300
colorado 0.1 1.620725
0.9 2.656375
Name: value, dtype: float64
And then call np.searchsorted.
Alternatively, use qcut.
df.groupby('Name').apply(lambda x:
pd.qcut(x['value'], [.1, .9]))
Name
arkansas 0 (3.173, 3.685]
1 NaN
2 (3.173, 3.685]
3 NaN
colorado 4 (1.62, 2.656]
5 NaN
6 NaN
7 (1.62, 2.656]
8 (1.62, 2.656]
9 (1.62, 2.656]
10 (1.62, 2.656]
11 (1.62, 2.656]
Name: value, dtype: object

If the variable for your dataframe is df, this should work. I'm not sure what you want your output to look like, but I just created code for a dictionary, where each key is a state. Also, since you have very few values, I used the option "nearest" for the argument interpolation (the default value is interpolation). To see the possible options, check out the documentation for the function here.
import pandas as pd
import numpy as np
df = pd.read_csv('stacktest.csv')
#array of unique state names from the dataframe
states = np.unique(df['Name'])
#empty dictionary
state_data = dict()
for state in states:
state_data[state] = np.percentile(df[df['Name'] == state]['value'],[10,90],interpolation = 'nearest')
print(state_data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Divide MultiIndex data frame by row - python

Related

Creation of DataFrame with specific conditions on rows

The problem that a value whose index I know with the loc function cannot update another column in the same index?

fill a new column with the division of data of a column in a Excel Sheet by looking up the denominator value from other Sheet

Calculating new rows in a Pandas Dataframe on two different columns

Find percentile in pandas dataframe based on groups

Categories

Resources