I'm trying to read in an excel file that has a sub-header. So far, I'm doing the following:
link = 'http://www.bea.gov/industry/xls/io-annual/GDPbyInd_GO_NAICS_1997-2013.xlsx'
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=3)
Unfortunately, the data has a second sub header in row 4 (0-indexed) that only gives the unit of measurement, as follows. Can I somehow cleanly ignore that row?
Table IO Code Description 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Current-dollar gross output (Millions of dollars)
A 1111A0 Oilseed farming 19973 17241 13259 13646 13721 14258 15672 21290 17910 18325 21425 31559 33027 34592 38524 43203 44948
skiprows can be a list of rows to ignore, so this does what you want:
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=[0, 1, 2, 4])
Related
So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.
I am trying to create a plot in excel using xlsxwriter with python3 and pandas. How do I get xlsxwriter to use the pubyear column for the x_axis values?
I can plot successfully using matplotlib, but I am required to produce excel-charts.
This code
n [248]: df1.describe
Out[248]:
<bound method NDFrame.describe of africa
pubyear
2018 57371
2017 70838
2016 66250
2015 58572
2014 52453
2013 46733
2012 42521
2011 38851
2010 33463
2009 29603
2008 25947
2007 22573
2006 19188
2005 16701>
writer = pd.ExcelWriter('/tmp/pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
chart = workbook.add_chart({'type': 'column'})
# Configure the series of the chart from the dataframe data.
chart.add_series({'values': '=Sheet1!$D$1:$D$15'})
chart.set_x_axis({'name': 'Pubyear', 'min': '=Sheet1!$A$2',
'max': '=Sheet1!$A$14',
"date_axis" : "=Sheet1!$A$2:$A15$" })
chart.set_y_axis({'name': 'Output'})
worksheet.insert_chart('I2', chart)
writer.save()
produced an file containing this data in the spreadsheet where pubyear is in column A and africa in column B:
A B
pubyear africa
2018 57371
2017 70838
2016 66250
2015 58572
2014 52453
2013 46733
2012 42521
2011 38851
2010 33463
2009 29603
2008 25947
2007 22573
2006 19188
2005 16701
The plot shows a bar chart with
x-axis: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I want pubyear as in column A to be the x-labels.
When you add the series you add the categories (pubyear) at the same time
chart.add_series({'categories':'=Sheet1!$A$2:$A$15'
'values': '=Sheet1!$B$2:$B$15'})
I have the following df I have filtered from a CSV of financial data for stocks.
ticker comp_name per_fisc_year per_fisc_qtr tot_revnu
47 A AGILENT TECH 2006 4 4973.0
48 A AGILENT TECH 2007 4 5420.0
58 A AGILENT TECH 2006 1 1167.0
59 A AGILENT TECH 2006 2 1239.0
60 A AGILENT TECH 2006 3 1239.0
61 A AGILENT TECH 2006 4 1328.0
62 A AGILENT TECH 2007 1 1280.0
63 A AGILENT TECH 2007 2 1320.0
64 A AGILENT TECH 2007 3 1374.0
65 A AGILENT TECH 2007 4 1446.0
I then need to ADD up all the Quarterly data to get annual with a Pivot Table.
mean_rev_table = pd.pivot_table(zacks_df_filter_1, values=['tot_revnu'],
index=['comp_name'],columns=['per_fisc_year'],
aggfunc=np.mean)
mean_rev_table[:5]
which gives me a nicely formatted table
tot_revnu
per_fisc_year 2006 2007
comp_name
1800FLOWERS.COM 390.962667 290.26000
21ST CENTURY IN 550.114800 349.28200
24/7 KID DOC 0.857600 1.09520
24/7 REAL MEDIA 80.097200 57.66300
3COM CORP 409.215333 506.99238
Now I want to calc annual growth or just delta between 2006 and 2007 but I dont know how to reference the total Annual in the table (2006 and 2007).
I tried.
mean_rev_table['rev_growth']= mean_rev_df['2007'] - mean_rev_df['2006']
but I get a key error because I think it only recognizes tot_revnu as the column. I probably need to recreate the Pivot Table but not sure how to. Thanks
You need remove []for avoid MultiIndex in columns:
mean_rev_table=zacks_df_filter_1.pivot_table(
values='tot_revnu', <-[] create MultiIndex
index='comp_name',
columns='per_fisc_year',aggfunc=np.mean)
Another solution is droplevel:
mean_rev_table.columns = mean_rev_table.columns.droplevel(0)
you can also use groupby() + unstack():
mean_rev_table = (zacks_df_filter_1.groupby(['comp_name','per_fisc_year'])['tot_revnu']
.sum()
.unstack('per_fisc_year')
.rename_axis(None, 1))
Result:
In [46]: mean_rev_table
Out[46]:
2006 2007
comp_name
AGILENT TECH 9946.0 10840.0
I am working with a pandas dataframe. From the code:
contracts.groupby(['State','Year'])['$'].mean()
I have a pandas groupby object with two group layers: State and Year.
State / Year / $
NY 2009 5
2010 10
2011 5
2012 15
NJ 2009 2
2012 12
DE 2009 1
2010 2
2011 3
2012 6
I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?
After grouping by State and Year and taking the mean,
means = contracts.groupby(['State', 'Year'])['$'].mean()
you could groupby the State alone, and use filter to keep the desired groups:
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15
states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
'State': np.random.choice(states, size=N),
'Year': np.random.choice(years, size=N),
'$': np.random.randint(10, size=N)})
means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
print(result)
yields
State Year
DE 2009 8
2010 5
2011 3
2012 6
NY 2009 2
2010 1
2011 5
2012 9
Name: $, dtype: int64
Alternatively, you could filter first and then take the mean:
filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()
but playing with various examples suggest this is typically slower than taking the mean, then filtering.
How do I reorder the LandUse column in this dataframe:
Region North South
LandUse Year
Corn 2005 149102.3744 2078875.0976
2010 201977.2160 2303998.5024
Developed 2005 1248.4416 10225.5552
2010 707.4816 7619.8528
Forests/Wetlands 2005 26511.4304 69629.8624
2010 23433.7600 48124.4288
Open Lands 2005 232290.1056 271714.9568
2010 45845.8112 131696.3200
Other Ag 2005 125527.1808 638010.4192
2010 257439.8848 635332.9024
Soybeans 2005 50799.1232 1791342.1568
2010 66271.2064 1811186.4512
Currently, 'LandUse' is organized alphabetically. I want it to be in following order:
lst = ['Open Lands','Forests/Wetlands','Developed','Corn','Soybeans','Other Ag']
You could do df = df.loc[lst] to reorder the index.