Issue with read html in pandas - python

I want to read a table form Wikipedia:
import pandas as pd
caption="Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df
But I got this errore:
"ValueError: No tables found matching pattern 'Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)'"
This method worked for table like below table:
caption = "Average daily maximum and minimum temperatures for selected cities in Minnesota"
df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match=caption)
df
But I get confused for this one, how can I solved this problem?

You have multiple problems here.
pandas doesn't support https, and there's no such caption that you're looking for.
Try this:
import pandas as pd
import requests
caption = "Table of countries by IHDI"
df = pd.read_html(
requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index").text,
match=caption,
)
print(df[0].head())
Output:
Rank Country ... 2019 estimates (2020 report)[4][5][6]
Rank Country ... Overall loss (%) Growth since 2010
0 1 Norway ... 6.1 0.021
1 2 Iceland ... 5.8 0.055
2 3 Switzerland ... 6.9 0.015
3 4 Finland ... 5.3 0.040
4 5 Ireland ... 7.3 0.066
[5 rows x 6 columns]

import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index')
df[2]
Or if you wish to use match argument
import pandas as pd
caption="Table of countries by IHDI"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df[0]
Returns

Related

How to randomly assign treatment group in python?

In my research, I employ regression-based difference-in-difference specification. And to conduct a placebo test, I tried to randomly assign the placebo treated entry year to all treatment groups based on a uniform distribution. For example, my original data looks like this
treatment_group_dummy treated_year group_number
1 1996 1
1 2005 3
1 2001 5
1 2006 5
1 2007 5
1 2002 5
and I want to randomly assign treated years to all treatment groups based on a uniform distribution from 1996 ~ 2007. For example,
treatment_group_dummy treated_year group_number
1 2007 1
1 1996 3
1 2004 5
1 2005 5
1 2001 5
1 2006 5
Here is my preliminary code, but I think it does not work at all...
import random
import numpy as np
import pandas as pd
import itertools as it
random.seed(0)
numGroups=5
numYears=1996 ~ 2007
data = list(it.product(range(numGroups),range(numMembers)))
df = pd.DataFrame(data=data,columns=['group','years'])
Does anyone give some though about it?
Thanks in advance
I don't see any initialization of numMembers in your code. So I am not sure about the size of the list you want. But following is a possible implementation
import numpy as np
import pandas as pd
# set a random seed
np.random.seed(2021)
numGroups = 5
# number of rows in the dataset
size = 10
data = {
'group': np.random.randint(1, numGroups+1, size),
'years': np.random.randint(1996, 2008, size)
}
df = pd.DataFrame(data)
Edit 1: Based on the additional explanation from author, when we want to randomize treated_year only
df['treated_year'] = np.random.randint(1996, 2008, df.shape[0])

How to append NBA player NAMES with STATS?

I am still learning web scraping and could use some help please. I would like to get NBA team stats for each player by combining 3 different dataframes: playerNames, playerStats_one, playerStats_two.
Here is an example of the dataframe that I am looking for:
Name GP MIN PTS ...
Anthony Davis 62 34.4 26.1 ...
Lebron James 67 34.6 25.3 ...
Kyle Kuzma 61 25.0 12.8 ...
... ... ... ...
Here is my code so far:
import pandas as pd
import requests
url = 'https://www.espn.com/nba/team/stats/_/name/lal/season/2020/seasontype/2'
df = pd.read_html(url)
#goal 1 get player names
playerNames = df[0]
#goal 2 get stats
playerStats_one = df[1]
playerStats_two = df[3]
#goal 3 append or concat player stats to player name dataframe
new_df = pd.concat([playerNames, playerStats_one, playerStats_two], ignore_index=True, sort=False)
new_df2 = playerNames.append(playerStats_one, ignore_index=True)
I tried to use pd.concat and the append option and the output had a bunch of 'nan' values. Any suggestions would be greatly appreciated. Thanks in advance for any insight you may offer.

How could I transform the numpy array to pandas dataframe?

I am new to analyze using python, I wonder how can I transform the format of the left table to the right one. My initial thought is to create a nested for loop.
The desired table
First, I find read the required csv file.
Imported csv
Then, I count the number of countries in the Column 'country' and the number of the new column names list.
`countries = len(test['country'])`
`columns = len(['Year', 'Values'])`
After that, I should go for the nested for loop, however, I have no idea on writing the code.What I have come up was as follows:
`for i in countries:`
`for j in columns:`
You can use df.melt here:
In [3575]: df = pd.DataFrame({'country':['Afghanistan', 'Albania'], '1970':[1.36, 6.1], '1971':[1.39, 6.22], '1972':[1.43, 6.34]})
In [3576]: df
Out[3576]:
country 1970 1971 1972
0 Afghanistan 1.36 1.39 1.43
1 Albania 6.10 6.22 6.34
In [3609]: df = df.melt('country', var_name='Year', value_name='Values').sort_values('country')
In [3610]: df
Out[3610]:
country Year Values
0 Afghanistan 1970 1.36
2 Afghanistan 1971 1.39
4 Afghanistan 1972 1.43
1 Albania 1970 6.10
3 Albania 1971 6.22
5 Albania 1972 6.34
Not sure of what you want to do, but:
If you want to transform a column in a numpy array, you can use the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [1,2,3], "bar": [10,20,30]})
print(df)
foo_array = np.array(df["foo"])
print(foo_array)
and then iterate on foo_array
You can also loop on your data frame using :
for row in df.iterrows():
print(row)
But it's not recommended since you can often use built in pandas function to do the same job.
your data frame is also an iterable object which only contains the columns names:
for d in df:
print(d)
# output:
# foo
# bar

Pandas - Operate with series and delays

I have the following dataframe in Pandas. My doubt is how can I make to operate with series where one has a time delay. For example, I would like to calculate the result of dividing the GDP of a period by the population of the next period.
GDP Population
1950 3.31 1
1951 3.5 1
...
2000 15.2 2
To do that you can just use:
df['new_col'] = df['GDP'] / df['Population'].shift(1)
Have you consider using shift?
import pandas as pd
import numpy as np
df = pd.DataFrame({"GDP": np.random.normal(3,1,51),
"Pop":np.random.randint(1,10,51)},
index=np.arange(1950,2001))
df["res"] = df.GDP.shift(1)/df.Pop

pandas how to use groupby to group columns by date in the label?

I have a dataframe 10730 rows × 249 columns, i have columns:
Index(['RegionID', 'Metro', 'CountyName', 'SizeRank', '1996-04', '1996-05',
'1996-06', '1996-07', '1996-08', '1996-09',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=249)
so what i need to do is group the columns by the quarter, jan to march Q1, and so on till Q4(using mean for the values). i know how to group 3 columns for example, but how do i group all the columns since i cannot specify the name of the column one by one.
This is the dataframe head in csv to use for testing:
'State,RegionName,RegionID,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,1999-11,1999-12,2000-01,2000-02,2000-03,2000-04,2000-05,2000-06,2000-07,2000-08,2000-09,2000-10,2000-11,2000-12,2001-01,2001-02,2001-03,2001-04,2001-05,2001-06,2001-07,2001-08,2001-09,2001-10,2001-11,2001-12,2002-01,2002-02,2002-03,2002-04,2002-05,2002-06,2002-07,2002-08,2002-09,2002-10,2002-11,2002-12,2003-01,2003-02,2003-03,2003-04,2003-05,2003-06,2003-07,2003-08,2003-09,2003-10,2003-11,2003-12,2004-01,2004-02,2004-03,2004-04,2004-05,2004-06,2004-07,2004-08,2004-09,2004-10,2004-11,2004-12,2005-01,2005-02,2005-03,2005-04,2005-05,2005-06,2005-07,2005-08,2005-09,2005-10,2005-11,2005-12,2006-01,2006-02,2006-03,2006-04,2006-05,2006-06,2006-07,2006-08,2006-09,2006-10,2006-11,2006-12,2007-01,2007-02,2007-03,2007-04,2007-05,2007-06,2007-07,2007-08,2007-09,2007-10,2007-11,2007-12,2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08\nNY,New York,6181,New York,Queens,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,432600.0,438700.0,440500.0,433900.0,422000.0,415700.0,421200.0,431100.0,435100.0,431900.0,428400.0,430700.0,438800.0,446800.0,455400.0,465500.0,472600.0,478200.0,487600.0,498600.0,508800.0,515300.0,517000.0,517800.0,520800.0,521500.0,523000.0,526300.0,524800.0,519100.0,516200.0,516400.0,516300.0,515500.0,512200.0,509200.0,509800.0,511600.0,512700.0,514000.0,513400.0,510700.0,508100.0,506700.0,505200.0,503700.0,502900.0,502400.0,500500.0,496400.0,491900.0,487500.0,484400.0,481700.0,477900.0,473600.0,469700.0,466100.0,461700.0,457700.0,455300.0,454800.0,456000.0,457800.0,461300.0,466100.0,470200.0,472800.0,475300.0,477100.0,478400.0,479100.0,478900.0,477700.0,476700.0,477100.0,478000.0,478000.0,476800.0,475300.0,473800.0,472000.0,470600.0,469900.0,469500.0,468200.0,465800.0,463500.0,461800.0,460100.0,459700.0,460800.0,461700.0,462500.0,463900.0,466000.0,467500.0,468200.0,468700.0,469400.0,469400,469100.0,468700,469300,470300,472100,474300,477600,481400,485100,488800,492600,495900,499500,503500,506400,509900,515700,520800,522200,522400,523800,526200,528400,529600,530800,532200,533800,536200,540600,545600,551400,557200,563000,568700,573600,576200,578400,582200,588000,592200,592500,590200,588000,586400\nCA,Los Angeles,12447,Los Angeles-Long Beach-Anaheim,Los Angeles,2,155000.0,154600.0,154400.0,154200.0,154100.0,154300.0,154300.0,154200.0,154800.0,155900.0,157000.0,157700.0,158200.0,158600.0,158800.0,158900.0,159100.0,159800.0,160700.0,161900.0,163400.0,165400.0,167000.0,168500.0,169900.0,171400.0,172900.0,174300.0,175800.0,177800.0,180100.0,182600.0,184400.0,185600.0,186900.0,188200.0,189600.0,191300.0,193100.0,194700.0,196300.0,197700.0,199100.0,200700.0,202300.0,204400.0,207000.0,209800.0,212300.0,214500.0,216600.0,219000.0,221100.0,222800.0,224300.0,226100.0,228100.0,230600.0,233000.0,235400.0,237300.0,239100.0,240900.0,242900.0,245000.0,247300.0,250100.0,253100.0,255900.0,258800.0,261900.0,265200.0,268600.0,272600.0,276900.0,281800.0,287000.0,292200.0,297000.0,302100.0,307600.0,313400.0,319000.0,324300.0,329600.0,334600.0,339300.0,344500.0,350600.0,356800.0,363400.0,370700.0,378400.0,386500.0,394900.0,404300.0,414600.0,425500.0,436600.0,447400.0,456700.0,464400.0,471200.0,477400.0,483500.0,489100.0,494700.0,501400.0,509700.0,518300.0,527200.0,536100.0,545400.0,555200.0,564500.0,571900.0,576800.0,579700.0,581800.0,583800.0,585300.0,587300.0,589900.0,592200.0,593300.0,593400.0,593100.0,592900.0,591600.0,590900.0,591800.0,592600.0,592100.0,590200.0,586200.0,581600.0,577500.0,572800.0,567600.0,562100.0,554400.0,545000.0,535500.0,525400.0,513600.0,502000.0,491200.0,480200.0,469000.0,459300.0,451200.0,443900.0,436800.0,430900.0,426100.0,421800.0,417800.0,413700.0,410200.0,407900.0,406300.0,404900.0,404200.0,402900.0,405900.0,412000.0,415000.0,413100.0,412100.0,411300.0,410100.0,408400.0,406800.0,405100.0,403300.0,401900.0,401000.0,399200.0,397100.0,395000.0,392700.0,390200.0,387400.0,384700.0,382100.0,379500.0,377200.0,375700.0,373800.0,371500.0,370000.0,370300.0,372100.0,375300.0,378600.0,382100.0,385600.0,389000.0,391800.0,396400.0,401500,405700.0,410700,418200,425500,432700,440400,448100,455200,461900,467800,472300,475700,479400,484000,489400,494200,498100,501800,505600,509000,512600,516000,518900,521700,525100,528900,532400,535300,538200,541000,544000,547200,550600,554200,558200,560800,562800,565600,569700,574000,577800,580600,583000,585100\nIL,Chicago,17426,Chicago,Cook,3,109700.0,109400.0,109300.0,109300.0,109100.0,109000.0,109000.0,109600.0,110200.0,110800.0,111300.0,111700.0,112200.0,112300.0,112100.0,112200.0,113000.0,113700.0,114200.0,114800.0,115500.0,116200.0,117100.0,117600.0,117800.0,118300.0,119200.0,120000.0,120600.0,121500.0,122300.0,122700.0,122900.0,123300.0,123700.0,124500.0,125700.0,127300.0,128800.0,130200.0,131400.0,132600.0,133700.0,134600.0,135500.0,136800.0,138300.0,140100.0,141900.0,143700.0,145300.0,146700.0,147900.0,149000.0,150400.0,152000.0,154000.0,155600.0,157000.0,158200.0,159900.0,161800.0,163700.0,165300.0,166400.0,167500.0,168800.0,170400.0,172100.0,173900.0,175600.0,177000.0,177800.0,177600.0,177300.0,177700.0,178800.0,180400.0,182300.0,183800.0,185000.0,185600.0,186800.0,188900.0,191300.0,194100.0,197500.0,200200.0,202300.0,203700.0,204000.0,204000.0,204400.0,205300.0,206300.0,207000.0,207600.0,208600.0,209600.0,210900.0,212800.0,214600.0,216400.0,218300.0,220300.0,222300.0,224000.0,225400.0,226900.0,228600.0,230100.0,231800.0,233200.0,234500.0,236000.0,237500.0,239000.0,240800.0,242500.0,243900.0,244900.0,245300.0,245400.0,245800.0,245800.0,245500.0,245900.0,246900.0,247300.0,247400.0,247300.0,247000.0,246700.0,246400.0,246100.0,246100.0,246300.0,246400.0,246700.0,247100.0,246700.0,245300.0,243900.0,242000.0,239800.0,237900.0,236000.0,233500.0,231800.0,230700.0,229200.0,226700.0,225200.0,224500.0,223800.0,223000.0,221900.0,219700.0,217500.0,215600.0,213800.0,212900.0,212300.0,211900.0,210800.0,209300.0,207300.0,205300.0,204200.0,204100.0,203100.0,201100.0,199000.0,196700.0,193800.0,191100.0,189200.0,188100.0,187600.0,186500.0,184400.0,181700.0,178700.0,175900.0,174100.0,172800.0,171400.0,170100.0,169100.0,167900.0,166700.0,166200.0,166400.0,166800.0,167900.0,168900.0,168400.0,167100.0,166900.0,167300.0,167500,167700.0,168300,169100,170400,172400,175100,178200,181000,183200,184600,185800,187200,189100,191100,192500,192600,192400,192900,193900,195600,197800,200100,201700,202000,201200,200500,201500,204000,206500,207600,207700,208100,209100,209000,207800,206900,206200,205800,206200,207300,208200,209100,211000,213000\nPA,Philadelphia,13271,Philadelphia,Philadelphia,4,50000.0,49900.0,49600.0,49400.0,49400.0,49300.0,49300.0,49400.0,49700.0,49600.0,49500.0,49700.0,49800.0,49700.0,49700.0,49800.0,49700.0,49700.0,49800.0,49900.0,49900.0,50000.0,50300.0,50600.0,50800.0,50800.0,50800.0,50800.0,50700.0,50500.0,50500.0,50700.0,50700.0,50800.0,50900.0,51100.0,51200.0,51400.0,51500.0,51400.0,51500.0,51800.0,52100.0,52100.0,52300.0,52700.0,53100.0,53200.0,53400.0,53700.0,53800.0,53800.0,54100.0,54500.0,54700.0,54600.0,54800.0,55100.0,55400.0,55500.0,55400.0,55500.0,55700.0,55900.0,56300.0,56600.0,57000.0,57500.0,58100.0,58600.0,59100.0,59700.0,60300.0,60700.0,61200.0,61800.0,62200.0,62500.0,63000.0,63600.0,63900.0,64200.0,64700.0,65300.0,65700.0,66100.0,66800.0,67700.0,68500.0,69200.0,69800.0,70700.0,71700.0,72800.0,73700.0,74700.0,75700.0,76700.0,77800.0,79100.0,80500.0,82100.0,84000.0,85600.0,87000.0,88200.0,89600.0,91300.0,93000.0,94900.0,96700.0,98400.0,100200.0,101900.0,103400.0,104900.0,106400.0,107500.0,108200.0,109300.0,110800.0,112500.0,113800.0,114800.0,115600.0,116000.0,116400.0,116700.0,116800.0,116900.0,117300.0,117800.0,118200.0,118600.0,119300.0,120200.0,120900.0,121400.0,121300.0,120900.0,120200.0,119600.0,119600.0,119500.0,118800.0,118100.0,117500.0,117100.0,117000.0,116700.0,116300.0,115800.0,115500.0,115900.0,116300.0,116400.0,116400.0,116100.0,116000.0,116200.0,116700.0,117300.0,118000.0,118200.0,119500.0,120900.0,121300.0,121300.0,122100.0,123000.0,123300.0,122300.0,120000.0,118200.0,117600.0,117900.0,117800.0,117400.0,117000.0,116900.0,116700.0,116500.0,115700.0,115300.0,115500.0,115600.0,115200.0,114800.0,114100.0,113500.0,112900.0,111800.0,110800.0,110400.0,110400.0,110200.0,109900.0,109700.0,110000.0,110700.0,111800,112100.0,111900,112000,112200,111800,111200,111000,110900,111100,111800,112700,112900,113100,113900,114200,113600,113500,114100,114900,115500,115500,115400,115600,116000,116100,116100,116400,117000,117900,119000,120100,121300,122300,122700,122300,121600,121800,123300,125200,126400,127000,127400,128300,129100\nAZ,Phoenix,40326,Phoenix,Maricopa,5,87200.0,87700.0,88200.0,88400.0,88500.0,88900.0,89400.0,89700.0,90100.0,90700.0,91400.0,91700.0,91800.0,92000.0,92300.0,92600.0,93000.0,93400.0,94000.0,94600.0,95300.0,96100.0,96800.0,97300.0,97700.0,98400.0,99200.0,100100.0,100500.0,100700.0,100900.0,101700.0,102600.0,103400.0,103900.0,104400.0,105100.0,105900.0,106200.0,106600.0,107400.0,108300.0,109000.0,109700.0,110400.0,111000.0,111700.0,112800.0,113700.0,114300.0,115100.0,115600.0,115900.0,116500.0,117200.0,117400.0,117600.0,118400.0,119700.0,120700.0,121200.0,121500.0,122000.0,122400.0,122700.0,123000.0,123600.0,124300.0,125000.0,125800.0,126600.0,127200.0,127900.0,128400.0,128800.0,129500.0,130500.0,131600.0,132500.0,133200.0,134000.0,134900.0,135700.0,136500.0,137200.0,138000.0,138600.0,138900.0,139200.0,139400.0,139600.0,140300.0,141400.0,142500.0,143700.0,144900.0,145900.0,147100.0,148400.0,150300.0,153100.0,156200.0,159400.0,162900.0,166500.0,170000.0,173900.0,178800.0,185000.0,192300.0,200700.0,209400.0,217000.0,223600.0,229800.0,234900.0,238600.0,241300.0,243000.0,244100.0,244800.0,245400.0,245600.0,245600.0,245300.0,244600.0,243800.0,243400.0,243400.0,243600.0,243200.0,242200.0,241300.0,240200.0,238400.0,236400.0,234700.0,233300.0,231600.0,229100.0,226100.0,222800.0,218800.0,214300.0,209500.0,205200.0,201100.0,197300.0,193700.0,190300.0,186700.0,182800.0,180500.0,179600.0,178000.0,175100.0,172100.0,168400.0,164200.0,160000.0,156000.0,151800.0,147600.0,143900.0,138900.0,133400.0,130200.0,129200.0,127700.0,126200.0,124800.0,123100.0,120700.0,118500.0,117000.0,115800.0,114800.0,114100.0,113200.0,111800.0,110100.0,108000.0,105900.0,104100.0,102900.0,102300.0,102400.0,103000.0,104100.0,105800.0,107600.0,109100.0,111200.0,114000.0,117200.0,120400.0,123300.0,125800.0,128300.0,130500.0,132500,134400.0,136200,138400,141600,144700,147400,150500,153600,156100,158100,160000,161600,162700,163300,163700,164100,164200,164500,164700,165200,166200,167200,168400,169900,171000,171500,172100,172900,174100,175500,177100,179100,181000,182400,183800,185300,186600,188000,189100,190200,191300,192800,194500,195900\n'
I changed the column index to date by dropping the non dates from the df quarter = df.drop(['RegionID','Metro','CountyName','SizeRank'],axis=1)
then change the columns to date quarter.columns = pd.to_datetime(quarter.columns) then i would like to do something likequarter = quarter.groupby(pd.TimeGrouper(freq='3M'),axis=1) but it's not working, then i would merge it back to the non-date columns. Also with this approach i wouldnt know how to put the right label for it like [2015Q4,2016Q1,2016Q2,2016Q3,2016Q4]
Here is a vectorized solution which uses pd.PeriodIndex and groupby(..., axis=1):
Data:
In [69]: x
Out[69]:
2016-01 2016-02 2016-03 2016-04 2016-05 2016-06
0 1 0 1 0 0 0
1 2 0 1 0 0 0
2 1 1 2 0 1 0
Solution:
In [70]: x.groupby(pd.PeriodIndex(x.columns, freq='Q'), axis=1).mean()
Out[70]:
2016Q1 2016Q2
0 0.666667 0.000000
1 1.000000 0.000000
2 1.333333 0.333333
Explanation:
In [71]: pd.PeriodIndex(x.columns, freq='Q')
Out[71]: PeriodIndex(['2016Q1', '2016Q1', '2016Q1', '2016Q2', '2016Q2', '2016Q2'], dtype='period[Q-DEC]', freq='Q-DEC')
It's not pretty, but this is the first thing I thought of. It sounds like the date columns can be manipulated separately. Break those out into a separate dataframe of the form below. If the other fields are kept, the conversion to datetime will throw an error.
import numpy as np
import pandas as pd
csv_df = pd.DataFrame({'2016-01':[1,2,1], '2016-02':[0,0,1], '2016-03':[1,1,2], '2016-04':[0,0,0], '2016-05':[0,0,1], '2016-06':[0,0,0]})
# convert columns into datetime format
csv_df.rename(columns=lambda x: pd.to_datetime(x, format='%Y-%m'), inplace=True)
# now strip out the year and the quarter
csv_df.rename(columns=lambda x: str(x.year) + 'Q' + str(x.quarter), inplace=True)
# #lucarlig improved my suggestion by using groupby as follows
csv_df = csv_df.groupby(csv_df.columns, axis=1).mean()
Consider melting the dataframe from the wide format to long format, parse out the quarter and year using datetime and run a pivot_table() to transform back from long to wide aggregating the values with mean:
import pandas as pd
import datetime as dt
import numpy as np
...
# MELT DATAFRAME
meltdf = pd.melt(df, id_vars = ['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
var_name = 'Date', value_name = 'Data')
# EXTRACT QUARTER
meltdf['Date'] = pd.to_datetime(meltdf['Date'] + '-01')
meltdf['YearQuarter'] = meltdf['Date'].dt.year.astype(str) + 'Q' + \
meltdf['Date'].dt.quarter.astype(str)
# PIVOT DATAFRAME
pivotdf = pd.pivot_table(meltdf, index=['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
columns=['YearQuarter'], values='Data', aggfunc=np.mean)
Output
print(pivotdf.head())
# State RegionName RegionID Metro CountyName SizeRank 1996Q2 1996Q3 1996Q4 1997Q1 1997Q2 1997Q3 ...
# AZ Phoenix 40326 Phoenix Maricopa 5 87700 88600 89733.33333 91266.66667 92033.33333 93000
# CA Los Angeles 12447 Los Angeles...Los Angeles 2 154666.6667 154200 154433.3333 156866.6667 158533.3333 159266.6667
# IL Chicago 17426 Chicago Cook 3 109466.6667 109133.3333 109600 111266.6667 112200 112966.6667
# NY New York 6181 New York Queens 1
# PA Philadelphia 13271 Philadelphia Philadelphia 4 49833.33333 49366.66667 49466.66667 49600 49733.33333 49733.33333

Categories

Resources