I have a pandas df like such:
Here is the input data:
[{'Region/Province': 'PHILIPPINES', 'Commodity': 'Atis [Sugarapple]', '2018 January': '..', '2018 February': '..'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Avocado', '2018 January': '..', '2018 February': '..'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Banana Bungulan, green', '2018 January': '12.57', '2018 February': '12.48'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Banana Cavendish', '2018 January': '9.96', '2018 February': '8.8'}]
Where the columns after commodity are like this: 2018 January, 2018 February.. 2018 Annual all the way up to 2021.
But I need it like this:
Where there are repeated Commodity names, but split by year/month with the Amount being it's own column. I've tried pd.wide_to_long() and it's close to what I need, but the years become their own columns.
Any help is much appreciated
Try stack with str.split
stacked = (
df.set_index(['Region/Province', 'Commodity'])
.stack()
.reset_index(name='Amount')
)
stacked[['Year', 'Month']] = stacked['level_2'].str.split(expand=True)
stacked = stacked.drop('level_2', axis=1)
stacked:
Region/Province Commodity Amount Year Month
0 PHILIPPINES Atis [Sugarapple] .. 2018 January
1 PHILIPPINES Atis [Sugarapple] .. 2018 February
2 PHILIPPINES Avocado .. 2018 January
3 PHILIPPINES Avocado .. 2018 February
4 PHILIPPINES Banana Bungulan, green 12.57 2018 January
5 PHILIPPINES Banana Bungulan, green 12.48 2018 February
6 PHILIPPINES Banana Cavendish 9.96 2018 January
7 PHILIPPINES Banana Cavendish 8.8 2018 February
or melt and str.split
melt = df.melt(['Region/Province', 'Commodity'], value_name='Amount')
melt[['Year', 'Month']] = melt['variable'].str.split(expand=True)
melt = melt.drop('variable', axis=1)
melt:
Region/Province Commodity Amount Year Month
0 PHILIPPINES Atis [Sugarapple] .. 2018 January
1 PHILIPPINES Avocado .. 2018 January
2 PHILIPPINES Banana Bungulan, green 12.57 2018 January
3 PHILIPPINES Banana Cavendish 9.96 2018 January
4 PHILIPPINES Atis [Sugarapple] .. 2018 February
5 PHILIPPINES Avocado .. 2018 February
6 PHILIPPINES Banana Bungulan, green 12.48 2018 February
7 PHILIPPINES Banana Cavendish 8.8 2018 February
Related
I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.
Name
College
Year
Mary
Princeton
2017
Joe
Harvard
2018
Bill
Princeton
2016
Louise
Princeton
2020
Michael
Harvard
2019
Penny
Yale
2018
Harry
Yale
2015
I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:
Name
College
Year
Louise
Princeton
2020
Mary
Princeton
2017
Bill
Princeton
2016
Michael
Harvard
2019
Joe
Harvard
2018
Penny
Yale
2018
Harry
Yale
2015
So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!
Add a temporary extra column with the max year per group and sort on multiple columns:
out = (df
.assign(max_year=df.groupby('College')['Year'].transform('max'))
.sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
.drop(columns='max_year')
)
output:
Name College Year
3 Louise Princeton 2020
0 Mary Princeton 2017
2 Bill Princeton 2016
4 Michael Harvard 2019
1 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
with temporary column:
Name College Year max_year
3 Louise Princeton 2020 2020
0 Mary Princeton 2017 2020
2 Bill Princeton 2016 2020
4 Michael Harvard 2019 2019
1 Joe Harvard 2018 2019
5 Penny Yale 2018 2018
6 Harry Yale 2015 2018
You first want to sort by "College" then "Year", then keep "College" values together by using .groupby
import pandas as pd
data = [
["Mary", "Princeton", 2017],
["Joe", "Harvard", 2018],
["Bill", "Princeton", 2016],
["Louise", "Princeton", 2020],
["Michael", "Harvard", 2019],
["Penny", "Yale", 2018],
["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()
You'd get this output:
Name College Year
Penny Yale 2018
Harry Yale 2015
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018
You will have to first find the maximum among each group and set that as a column.
You can then sort by values based on max and year.
df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)
Output:
Out[60]:
Name College Year
0 Louise Princeton 2020
1 Mary Princeton 2017
2 Bill Princeton 2016
3 Michael Harvard 2019
4 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
I want to merge three data frames in Python, the code I have now provide me with some wrong outputs.
This is the first data frame
df_1
Year Month X_1 Y_1
0 2021 January $90 $100
1 2021 February NaN $120
2 2021 March $100 $130
3 2021 April $110 $140
4 2021 May Nan $150
5 2019 June $120 $160
This is the second data frame
df_2
Year Month X_2 Y_2
0 2021 January Nan $120
1 2021 February NaN $130
2 2021 March $80 $140
3 2021 April $90 $150
4 2021 May Nan $150
5 2021 June $120 $170
This is the third data frame
df_3
Year Month X_3 Y_3
0 2021 January $110 $150
1 2021 February $140 $160
2 2021 March $97 $170
3 2021 April $90 $180
4 2021 May Nan $190
5 2021 June $120 $200
The idea is to combine them into one data frame like this:
df_combined
Year Month X_1 Y_1 X_2 Y_2 X_3 Y_3
0 2019 January $90 $100 NaN $120 $110 $150
1 2019 February NaN $120 NaN $130 $140 $160
2 2019 March $100 $130 $80 $140 $97 $170
3 2019 April $110 $140 $90 $150 $90 $180
4 2019 May Nan $150 Nan $150 Nan $190
5 2019 June $120 $160 $120 $170 $120 $200
The code I have for now does not give me the correct outcome, only df_3 has to the correct numbers.
# compile the list of data frames you want to merge
import functools as ft
from functools import reduce
data_frames = [df_1, df_2, df_3]
df_merged = reduce(lambda cross, right: pd.merge(cross,right,on=['Year'],
how='outer'),data_frames)
#remove superfluous columns
df_merged.drop(['Month_x', 'Month_y'], axis=1, inplace=True)
You can try with
df_1.merge(df_2, how='left', on=['Year', 'Month']).merge(df_3, how='left', on=['Year', 'Month'])
One option of probably many is to do
from functools import reduce
import pandas as pd
idx = ["Year", "Month"]
new_df = reduce(pd.DataFrame.join, (i.set_index(idx) for i in dataframes)).reset_index()
or
reduce(lambda x, y: pd.merge(x, y, how="outer", on=["Year", "Month"]), dataframes)
My dataframe looks like this. 3 columns. All I want to do is write a FUNCTION that, when the first two columns are inputs, the corresponding third column (GHG intensity) is the output. I want to be able to input any property name and year and achieve the corresponding GHG intensity value. I cannot stress enough that this has to be written as a function using def. Please help!
Property Name Data Year \
467 GALLERY 37 2018
477 Navy Pier, Inc. 2016
1057 GALLERY 37 2015
1491 Navy Pier, Inc. 2015
1576 GALLERY 37 2016
2469 The Chicago Theatre 2016
3581 Navy Pier, Inc. 2014
4060 Ida Noyes Hall 2015
4231 Chicago Cultural Center 2015
4501 GALLERY 37 2017
5303 Harpo Studios 2015
5450 The Chicago Theatre 2015
5556 Chicago Cultural Center 2016
6275 MARTIN LUTHER KING COMMUNITY CENTER 2015
6409 MARTIN LUTHER KING COMMUNITY CENTER 2018
6665 Ida Noyes Hall 2017
7621 Ida Noyes Hall 2018
7668 MARTIN LUTHER KING COMMUNITY CENTER 2017
7792 The Chicago Theatre 2018
7819 Ida Noyes Hall 2016
8664 MARTIN LUTHER KING COMMUNITY CENTER 2016
8701 The Chicago Theatre 2017
9575 Chicago Cultural Center 2017
10066 Chicago Cultural Center 2018
GHG Intensity (kg CO2e/sq ft)
467 7.50
477 22.50
1057 8.30
1491 23.30
1576 7.40
2469 4.50
3581 17.68
4060 11.20
4231 13.70
4501 7.90
5303 18.70
5450 NaN
5556 10.30
6275 14.10
6409 12.70
6665 8.30
7621 8.40
7668 12.10
7792 4.40
7819 10.20
8664 12.90
8701 4.40
9575 9.30
10066 7.50
Here is an example, with a a different data frame to test:
import pandas as pd
df = pd.DataFrame(data={'x': [3, 5], 'y': [4, 12]})
def func(df, arg1, arg2, arg3):
''' agr1 and arg2 are input columns; arg3 is output column.'''
df = df.copy()
df[arg3] = df[arg1] ** 2 + df[arg2] ** 2
return df
Results are:
print(func(df, 'x', 'y', 'z'))
x y z
0 3 4 25
1 5 12 169
You can try this code
def GHG_Intensity(PropertyName, Year):
Intensity = df[(df['Property Name']==PropertyName) & (df['Data Year']==Year)]['GHG Intensity (kg CO2e/sq ft)'].to_list()
return Intensity[0] if len(Intensity) else 'GHG Intensity Not Available'
print(GHG_Intensity('Navy Pier, Inc.', 2016))
I have two dataframes (A and B). I want to remove all the rows in B where the values for columns Month, Year, Type, Name are an exact match.
Dataframe A
Name Type Month Year country Amount Expiration Paid
0 EXTRON GOLD March 2019 CA 20000 2019-09-07 yes
0 LEAF SILVER March 2019 PL 4893 2019-02-02 yes
0 JMC GOLD March 2019 IN 7000 2020-01-16 no
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 EXTRON GOLD March 2019 CA 90000 2019-09-07 yes
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD March 2019 IN 6000 2020-01-16 no
0 JMC GOLD April 2019 IN 1000 2020-01-16 no
Desired Output:
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD April 2019 IN 1000 2020-01-16 no
We can using merge here
l=['Month', 'Year','Type', 'Name']
B=B.merge(A[l],on=l,indicator=True,how='outer').loc[lambda x : x['_merge']=='left_only'].copy()
# you can add drop here like B=B.drop('_merge',1)
Name Type Month Year country Amount Expiration Paid _merge
0 JONS GOLD March 2018 PL 500.0 2019-10-17 yes left_only
1 ABBY BRONZE March 2019 AU 60000.0 2019-02-02 yes left_only
2 BUYT GOLD March 2018 BR 50.0 2018-03-22 no left_only
4 JAYB PURPLE March 2019 PL 9.9 2018-04-20 yes left_only
6 JMC GOLD April 2019 IN 1000.0 2020-01-16 no left_only
I tried using MultiIndex for the same.
cols =['Month', 'Year','Type', 'Name']
index1 = pd.MultiIndex.from_arrays([df1[col] for col in cols])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in cols])
df2 = df2.loc[~index2.isin(index1)]
Consider the following sites (site1, site2, site3) which have a number of different tables.
I am using read_html to scrap the tables into a single table as follows:
import multiprocessing
links = ['site1.com','site2.com','site3.com']
def process_url(url):
return pd.concat(pd.read_html(url), ignore_index=False)
pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)
With the above procedure I am getting a single table. Although is what I expected, it would be helpful to add a flag or a "table counter", just to not lose the reference of the table (e.g. which row belongs or corresponds to which table). So, how to add the number of the table to a row?.
Something like this, the same single table, but with a table_num column:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date table_num
1 Allied Bank Mulberry AR 91.0 Today's Bank September 23, 2016 October 17, 2016 1
2 The Woodbury Banking Company Woodbury GA 11297.0 United Bank August 19, 2016 October 17, 2016 1
3 First CornerStone Bank King of Prussia PA 35312.0 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 1
4 Trust Company Bank Memphis TN 9956.0 The Bank of Fayette County April 29, 2016 September 6, 2016 2
5 North Milwaukee State Bank Milwaukee WI 20364.0 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 2
6 Hometown National Bank Longview WA 35156.0 Twin City Bank October 2, 2015 April 13, 2016 3
7 The Bank of Georgia Peachtree City GA 35259.0 Fidelity Bank October 2, 2015 October 24, 2016 3
8 Premier Bank Denver CO 34112.0 United Fidelity Bank, fsb July 10, 2015 August 17, 2016 3
9 Edgebrook Bank Chicago IL 57772.0 Republic Bank of Chicago May 8, 2015 July 12, 2016 3
10 Doral Bank NaN NaN NaN NaN NaN NaN 4
11 En Espanol San Juan PR 32102.0 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 4
12 Capitol City Bank & Trust Company Atlanta GA 33938.0 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015 4
13 Valley Bank Fort Lauderdale FL 21793.0 Landmark Bank, National Association June 20, 2014 June 29, 2015 5
14 Valley Bank Moline IL 10450.0 Great Southern Bank June 20, 2014 June 26, 2015 5
15 Slavie Federal Savings Bank Bel Air MD 32368.0 Bay Bank, FSB May 3, 2014 June 15, 2015 5
16 Columbia Savings Bank Cincinnati OH 32284.0 United Fidelity Bank, fsb May 23, 2014 November 10, 2016 6
17 AztecAmerica Bank NaN NaN NaN NaN NaN NaN 6
18 En Espanol Berwyn IL 57866.0 Republic Bank of Chicago May 16, 2014 October 20, 2016 6
For instance, if there are two tables in site1, the function must assign 0 to all the rows of table1, and with regards to table2 in site1 the function must assign 1 to all the rows of table2.
On the other hand, if site2 has two tables, the function must assign 3 to all the rows of table1 and 4 to table2 for all the tables that live in site2.
Also, is it possible to use assign() or other method to get the reference of each row (e.g. the table of provenance)?
try to change your process_url() function as follows:
def process_url(url):
return pd.concat([x.assign(table_num=i)
for i,x in enumerate(pd.read_html(url))],
ignore_index=False)