I have a Pandas DataFrame with a hierarchical index (MultiIndex). I created this DataFrame by grouping values for "cousub" and "year".
annualMed = df.groupby(["cousub", "year"])[["ratio", "sr_val_transfer"]].median().round(2)
print annualMed.head(8)
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
I would like to add an "overall" value in the "year" level that I could then populate with values based on a grouping of "cousub" alone, i.e., excluding "year". I would like the result to look like the following
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0
How can I add this new item to the "years" level of the MultiIndex?
If you want to just add these two columns explicitly, you could just specify all the MultiIndex levels with loc.
df.loc[('Allen Park city', 'Overall'), :] = (0.50, 90000.)
df.loc[('Belleville city', 'Overall'), :] = (0.50, 135000.)
If you had a whole list of cities that you wanted to add this row for however, this would be a bit tedious. Maybe you could append another DataFrame with the overall values with a bit of index manipulation.
(df.reset_index()
.append(pd.DataFrame([['Allen Park city', 'Overall', 0.5, 90000.],
['Belleville city', 'Overall', 0.5, 135000.]],
columns=list(df.index.names) + list(df.columns)))
.set_index(df.index.names)
.sort_index())
Demo
Method 1 (smaller case)
>>> df.loc[('Allen Park city', 'Overall'), :] = (0.50, 90000.)
>>> df.loc[('Belleville city', 'Overall'), :] = (0.50, 135000.)
>>> df.sort_index()
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0
Method 2 (larger case)
>>> (df.reset_index()
.append(pd.DataFrame([['Allen Park city', 'Overall', 0.5, 90000.],
['Belleville city', 'Overall', 0.5, 135000.]],
columns=list(df.index.names) + list(df.columns)))
.set_index(df.index.names)
.sort_index())
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0
Related
I need to group/merge each city of the same name and calculate its overall percentage, to see which city amongst them has the lowest % literacy rate.
Code:
Python
import pandas as pd
df = pd.DataFrame({'Cities': ["Cape Town", "Cape Town", "Cape Town", "Tokyo", "Cape Town", "Tokyo", "Mumbai", "Belgium", "Belgium" ],
'LiteracyRate': [0.05, 0.35, 0.2, 0.11, 0.15, 0.2, 0.65, 0.35, 0.45]})
print(df)
For example:
Cities LiteracyRate
0 Cape Town 0.05
1 Cape Town 0.35
2 Cape Town 0.2
3 Tokyo 0.11
4 Cape Town 0.15
5 Tokyo 0.2
6 Mumbai 0.65
7 Belgium 0.35
8 Belgium 0.45
I'm expecting this:
Cities LiteracyRate %LiteracyRate
0 Cape Town 0.75 75
1 Tokyo 0.31 31
2 Mumbai 0.65 65
3 Belgium 0.8 80
So I tried this code below but it's not giving me desirable results, the countries with similar names are still not merged. And the percentages ain't right.
# Calculate the percentage
df["%LiteracyRate"] = (df["LiteracyRate"]/df["LiteracyRate"].sum())*100
# Show the DataFrame
print(df)
You can use groupby() in pandas, to join cities with the same names and sum() to calculate %
df = df.groupby('Cities').sum()
Than you can format results using
df['%LiteracyRate'] = (df['LiteracyRate']*100).round().astype(int)
df = df.reset_index()
To sort them by literacy rate you can
df = df.sort_values(by='%LiteracyRate')
df = df.reset_index()
Hope this helps!
I have two dataframes of unequal lengths. I want to combine them with a condition.
If two rows of df1 are identical then they must share the same value of df2.(without changing order )
import pandas as pd
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
df1 = pd.DataFrame(data=d)
I={'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
df2 = pd.DataFrame(data=I)
dfc=pd.concat([df1,df2], axis=1)
my output
country conc
0 France 0.30
1 France 0.25
2 Japan 0.21
3 China 0.37
4 China 0.15
5 Canada NaN
6 Canada NaN
7 India NaN
expected output
country conc
0 France 0.30
1 France 0.30
2 Japan 0.25
3 China 0.21
4 China 0.21
5 Canada 0.37
6 Canada 0.37
7 India 0.15
You need to create a link between the values and the countries first.
df2["country"] = df1["country"].unique()
Then you can use it to merge it with your original dataframe.
pd.merge(df1, df2, on="country")
But be aware that this only works as long as the number of the values is identical to the number of countries and the order for them is as expected.
I'd construct the dataframe directly, without intermediate dfs.
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
I = {'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
c = 'country'
dfc = pd.DataFrame(I, index=pd.Index(pd.unique(d[c]), name=c)).reindex(d[c]).reset_index()
So I have a data like this:
I want to group them and sum the values of Month 0 - Month 3, I can achieve that using pandas groupby.
The problem is the End date column has different values, and I want to take the latest date in the column. For this example, means I want the End date column to have the value 2020-09-25. As below:
How do I do this with pandas groupby? For your convenience, the variable for the columns names are below:
details_columns = [ "Person Name", "Bill rate", "Project ERP","Status", "Assignment", "Engagement Code", "End date"]
sum_columns = ["Month 0", "Month 1", "Month 2", "Month 3"]
I need the return value to be a DataFrame hoping anyone can help. Thanks!
Text data:
Person Name Bill rate Project ERP Status Assignment Engagement Code End date Current Month U% Month 1 U% Month 2 U% Month 3 U%
John Doe 3500000 0.58 Chargeable - Standard Project A 21572323 2020-08-22 0 0.5 0.3 0.2
John Doe 3500000 0.58 Chargeable - Standard Project A 21572323 2020-05-22 0.4 0.25 0 0
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-09-25 0 0.7 0.7 0.7
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-05-22 0.2 0.12 0 0
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-04-03 0.1 0 0 0
Create dictionary d for sum columns and max value for column End date and then aggregate by GroupBy.agg, last is added DataFrame.reindex for same order of columns like original DataFrame:
cols = ["Person Name", "Bill rate", "Project ERP","Status", "Assignment","Engagement Code"]
sum_columns = ["Current Month U%", "Month 1 U%", "Month 2 U%","Month 3 U%"]
d = dict.fromkeys(sum_columns, 'sum')
d["End date"] = 'max'
df1 = df.groupby(cols, as_index=False).agg(d).reindex(df.columns, axis=1)
print (df1)
Person Name Bill rate Project ERP Status Assignment \
0 John Doe 3500000 0.45 Chargeable Standard Project B
1 John Doe 3500000 0.58 Chargeable Standard Project A
Engagement Code End date Current Month U% Month 1 U% Month 2 U% \
0 21579528 2020-09-25 0.3 0.82 0.7
1 21572323 2020-08-22 0.4 0.75 0.3
Month 3 U%
0 0.7
1 0.2
I extract the data from a webpage but would like to arrange it into the pandas dataframe table.
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
finz = html.fromstring(finviz.content)
col = finz.xpath('//table/tr/td[#class="table-top"]/text()')
data = finz.xpath('//table/tr/td/a[#class="screener-link"]/text()')
Col is the column for the pandas dataframe and each of the 28 data points in data list will be arranged accordingly into rows. data points 29 to 56 in the second row and so forth. How to write the code elegantly?
datalist = []
for y in range (28):
datalist.append(data[y])
>>> datalist
['1', 'Agilent Technologies, Inc.', 'Healthcare', 'Medical Laboratories & Research', 'USA', '23.00B', '29.27', '4.39', '4.53', '18.76', '1.02%', '5.00%', '5.70%', '3
24.30M', '308.52M', '2.07', '8.30%', '15.70%', '14.60%', '1.09', '1,775,149', '2', 'Alcoa Corporation', 'Basic Materials', 'Aluminum', 'USA', '1.21B', '-']
But the result is not in table form like dataframe
Pandas has a function to parse HTML: pd.read_html
You can try the following:
# Modules
import pandas as pd
import requests
# HTML content
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
# Convert to dataframe
df = pd.read_html(finviz.content)[-2]
# Set 1st row to columns names
df.columns = df.iloc[0]
# Drop 1st row
df = df.drop(df.index[0])
# df = df.set_index('No.')
print(df)
# 0 No. Ticker Company Sector Industry Country ... Debt/Eq Profit M Beta Price Change Volume
# 1 1 A Agilent Technologies, Inc. Healthcare Medical Laboratories & Research USA ... 0.51 14.60 % 1.20 72.47 - 0.28 % 177333
# 2 2 AA Alcoa Corporation Basic Materials Aluminum USA ... 0.44 - 10.80 % 2.03 6.28 3.46 % 3021371
# 3 3 AAAU Perth Mint Physical Gold ETF Financial Exchange Traded Fund USA ... - - - 16.08 - 0.99 % 45991
# 4 4 AACG ATA Creativity Global Services Education & Training Services China ... 0.02 - 2.96 0.95 - 0.26 % 6177
# 5 5 AADR AdvisorShares Dorsey Wright ADR ETF Financial Exchange Traded Fund USA ... - - - 40.80 0.22 % 1605
# 6 6 AAL American Airlines Group Inc. Services Major Airlines USA ... - 3.70 % 1.83 12.81 4.57 % 16736506
# 7 7 AAMC Altisource Asset Management Corporation Financial Asset Management USA ... - -17.90 % 0.78 12.28 0.00 % 0
# 8 8 AAME Atlantic American Corporation Financial Life Insurance USA ... 0.28 - 0.40 % 0.29 2.20 3.29 % 26
# 9 9 AAN Aaron's, Inc. Services Rental & Leasing Services USA ... 0.20 0.80 % 1.23 22.47 - 0.35 % 166203
# 10 10 AAOI Applied Optoelectronics, Inc. Technology Semiconductor - Integrated Circuits USA ... 0.49 - 34.60 % 2.02 7.80 2.63 % 61303
# 11 11 AAON AAON, Inc. Industrial Goods General Building Materials USA ... 0.02 11.40 % 0.88 48.60 0.71 % 20533
# 12 12 AAP Advance Auto Parts, Inc. Services Auto Parts Stores USA ... 0.21 5.00 % 1.04 95.94 - 0.58 % 165445
# 13 13 AAPL Apple Inc. Consumer Goods Electronic Equipment USA ... 1.22 21.50 % 1.19 262.39 2.97 % 11236642
# 14 14 AAT American Assets Trust, Inc. Financial REIT - Retail USA ... 1.03 12.50 % 0.99 25.35 2.78 % 30158
# 15 15 AAU Almaden Minerals Ltd. Basic Materials Gold Canada ... 0.04 - 0.53 0.28 - 1.43 % 34671
# 16 16 AAWW Atlas Air Worldwide Holdings, Inc. Services Air Services, Other USA ... 1.33 - 10.70 % 1.65 22.79 2.70 % 56521
# 17 17 AAXJ iShares MSCI All Country Asia ex Japan ETF Financial Exchange Traded Fund USA ... - - - 60.13 1.18 % 161684
# 18 18 AAXN Axon Enterprise, Inc. Industrial Goods Aerospace/Defense Products & Services USA ... 0.00 0.20 % 0.77 71.11 2.37 % 187899
# 19 19 AB AllianceBernstein Holding L.P. Financial Asset Management USA ... 0.00 89.60 % 1.35 19.15 1.84 % 54588
# 20 20 ABB ABB Ltd Industrial Goods Diversified Machinery Switzerland ... 0.67 5.10 % 1.10 17.44 0.52 % 723739
# [20 rows x 29 columns]
I let you improve the data selection if the HTML page structure change ! The parent div id might be useful.
Explanation "[-2]": the read_html returns a list of dataframe:
list_df = pd.read_html(finviz.content)
print(type(list_df))
# <class 'list'>
# Elements types in the lists
print(type(list_df [0]))
# <class 'pandas.core.frame.DataFrame' >
So in order to get the desired dataframe, I select the 2nd element before the end with [-2]. This discussion explains about negative indexes.
How do I drop unique? It is interfering with groupby and qcut.
df0 = psql.read_frame(sql_query,conn)
df = df0.sort(['industry','C'], ascending=[False,True] )
Here is my dataframe:
id industry C
5 28 other industry 0.22
9 32 Specialty Eateries 0.60
10 33 Restaurants 0.84
1 22 Processed & Packaged Goods 0.07
0 21 Processed & Packaged Goods 0.14
8 31 Processed & Packaged Goods 0.43
11 34 Major Integrated Oil & Gas 0.07
14 37 Major Integrated Oil & Gas 0.50
15 38 Independent Oil & Gas 0.06
18 41 Independent Oil & Gas 0.06
19 42 Independent Oil & Gas 0.13
12 35 Independent Oil & Gas 0.43
16 39 Independent Oil & Gas 0.65
17 40 Independent Oil & Gas 0.91
13 36 Independent Oil & Gas 2.25
2 25 Food - Major Diversified 0.35
3 26 Beverages - Soft Drinks 0.54
4 27 Beverages - Soft Drinks 0.73
6 29 Beverages - Brewers 0.19
7 30 Beverages - Brewers 0.21
And I've used the following code from pandas and qcut to rank column 'C' which sadly went batsh*t on me.
df['rank'] = df.groupby(['industry'])['C'].transform(lambda x: pd.qcut(x,5, labels=range(1,6)))
After researching a bit, the reason qcut threw errors is because of the unique value for industry column, reference to error and another ref to err.
Although, I still want to be able to rank without throwing out unique (unique should be assign to the value of 1) if that is possible. But after so many tries, I am convinced that qcut can't handle unique and so I am willing to settle for dropping unique to make qcut happy doing its thing.
But if there is another way, I'm very curious to know. I really appreciate your help.
Just in case anyone still wants to do this. You should be able to do it by selecting only duplicates?
df = df[df['industry'].duplicated(keep=False)]