I am trying to create a plot in excel using xlsxwriter with python3 and pandas. How do I get xlsxwriter to use the pubyear column for the x_axis values?
I can plot successfully using matplotlib, but I am required to produce excel-charts.
This code
n [248]: df1.describe
Out[248]:
<bound method NDFrame.describe of africa
pubyear
2018 57371
2017 70838
2016 66250
2015 58572
2014 52453
2013 46733
2012 42521
2011 38851
2010 33463
2009 29603
2008 25947
2007 22573
2006 19188
2005 16701>
writer = pd.ExcelWriter('/tmp/pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
chart = workbook.add_chart({'type': 'column'})
# Configure the series of the chart from the dataframe data.
chart.add_series({'values': '=Sheet1!$D$1:$D$15'})
chart.set_x_axis({'name': 'Pubyear', 'min': '=Sheet1!$A$2',
'max': '=Sheet1!$A$14',
"date_axis" : "=Sheet1!$A$2:$A15$" })
chart.set_y_axis({'name': 'Output'})
worksheet.insert_chart('I2', chart)
writer.save()
produced an file containing this data in the spreadsheet where pubyear is in column A and africa in column B:
A B
pubyear africa
2018 57371
2017 70838
2016 66250
2015 58572
2014 52453
2013 46733
2012 42521
2011 38851
2010 33463
2009 29603
2008 25947
2007 22573
2006 19188
2005 16701
The plot shows a bar chart with
x-axis: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I want pubyear as in column A to be the x-labels.
When you add the series you add the categories (pubyear) at the same time
chart.add_series({'categories':'=Sheet1!$A$2:$A$15'
'values': '=Sheet1!$B$2:$B$15'})
Related
I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12
I have a query regarding merging two dataframes
For example i have 2 dataframes as below :
print(df1)
Year Location
0 2013 america
1 2008 usa
2 2011 asia
print(df2)
Year Location
0 2008 usa
1 2008 usa
2 2009 asia
My expected output :
Year Location
2013 america
2008 usa
2011 asia
Year Location
2008 usa
2008 usa
2009 asia
Output i am getting right now :
Year Location Year Location
2013 america 2008 usa
2008 usa 2008 usa
2011 asia 2009 asia
I have tried using pd.concat and pd.merge with no luck
Please help me with above
Simply specify the axis along which to concatenate (axis=1) in pd.concat:
df_merged=pd.concat([df1,df2],axis=1)
pd.concat([df1, df2]) should work. If all the column headings are the same, it will bind the second dataframe's rows below the first. This graphic from a pandas cheat sheet (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) explains it pretty well:
It's the same columns and same order, so that you can use: df1.append(df2)
output_df = pd.concat([df1, df2], ignore_index=False)
if you'd set ignore_index = True, you lost your original indexes and get 0..n-1 instead
It works for MultiIndex too
I have trying to create a stacked bar chart that will show the percentage which each item occurred over a given year.
The problem is that when I plot these values, not all the bar's show. Seems like some of the bar's are being masked by the bars that are showing.
This is the relevant code:
barWidth = 0.85
plt.bar(list(yearly_timeline.index),yearly_timeline.values, color='#a3acff',edgecolor='white',width=barWidth)
plt.bar(list(yearly_links.index),yearly_links.values, color='#FFD700',edgecolor='white',width=barWidth)
plt.bar(list(yearly_status.index),yearly_status.values, color='#b5ffb9',edgecolor='white',width=barWidth)
plt.bar(list(yearly_posts.index),yearly_posts.values,color='#f9bc86',edgecolor='white',width=barWidth)
plt.bar(list(yearly_shared.index),yearly_shared.values,color='#f9bc86',edgecolor='white',width=barWidth)
plt.xticks(list(yearly_links.index))
fig = plt.gcf()
fig.set_size_inches(20,10)
plt.tick_params(labelsize=20)
plt.show()
This is a sample of the datasets I am plotting:
#yearly posts
year
2009 4.907975
2010 11.656442
2013 11.656442
2014 24.539877
2015 7.975460
2016 12.269939
2017 16.564417
2018 10.429448
dtype: float64
#yearly shared
year
2010 1.273885
2011 0.636943
2012 9.554140
2013 29.936306
2014 28.025478
2015 15.923567
2016 7.643312
2017 4.458599
2018 2.547771
dtype: float64
#yearly timeline
year
2010 4.059041
2011 18.450185
2012 18.819188
2013 12.915129
2014 25.830258
2015 16.236162
2016 2.214022
2017 1.107011
2018 0.369004
dtype: float64
#yearly status
year
2009 6.916192
2010 6.997559
2011 15.296989
2012 22.294548
2013 19.528072
2014 13.913751
2015 10.740439
2016 1.790073
2017 1.464605
2018 1.057771
dtype: float64
#yearly links
year
2009 0.655738
2010 0.218579
2011 8.196721
2012 8.524590
2013 1.530055
2014 7.103825
2015 26.338798
2016 17.595628
2017 25.027322
2018 4.808743
dtype: float64
In your case, you could simplify your code by aggregating all your data in a single DataFrame (I assume they are currently individual Series):
create dummy data
my_names = ['timeline','links','status','posts','shared']
my_series = [pd.Series(data=np.random.random(size=(9,)), index=range(2010,2019), name=n) for n in my_names]
convert list of Series into a DataFrame:
df = pd.DataFrame(my_series).T
display(df)
timeline links status posts shared
2010 0.534663 0.107604 0.265774 0.849307 0.149886
2011 0.064561 0.354329 0.557265 0.297695 0.563122
2012 0.646828 0.011643 0.608695 0.493709 0.337949
2013 0.170792 0.083039 0.866962 0.278223 0.501074
2014 0.386262 0.979529 0.972009 0.333049 0.505644
2015 0.764539 0.223265 0.365314 0.712091 0.757626
2016 0.012084 0.700645 0.118666 0.118811 0.332993
2017 0.407492 0.480495 0.399464 0.613331 0.655171
2018 0.072698 0.262846 0.763811 0.783575 0.859755
The easy way, using pandas plot command:
df.plot(kind='bar', stacked=True, width=0.85)
or using matplotlib directly, to give more flexibility:
fig, ax = plt.subplots()
for i,col in enumerate(df.columns):
ax.bar(df.index, height=df[col], bottom=df.iloc[:,:i].sum(axis=1), edgecolor="white", width=0.85)
ax.set_xticks(df.index)
I'm trying to read in an excel file that has a sub-header. So far, I'm doing the following:
link = 'http://www.bea.gov/industry/xls/io-annual/GDPbyInd_GO_NAICS_1997-2013.xlsx'
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=3)
Unfortunately, the data has a second sub header in row 4 (0-indexed) that only gives the unit of measurement, as follows. Can I somehow cleanly ignore that row?
Table IO Code Description 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Current-dollar gross output (Millions of dollars)
A 1111A0 Oilseed farming 19973 17241 13259 13646 13721 14258 15672 21290 17910 18325 21425 31559 33027 34592 38524 43203 44948
skiprows can be a list of rows to ignore, so this does what you want:
xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=[0, 1, 2, 4])
How do I reorder the LandUse column in this dataframe:
Region North South
LandUse Year
Corn 2005 149102.3744 2078875.0976
2010 201977.2160 2303998.5024
Developed 2005 1248.4416 10225.5552
2010 707.4816 7619.8528
Forests/Wetlands 2005 26511.4304 69629.8624
2010 23433.7600 48124.4288
Open Lands 2005 232290.1056 271714.9568
2010 45845.8112 131696.3200
Other Ag 2005 125527.1808 638010.4192
2010 257439.8848 635332.9024
Soybeans 2005 50799.1232 1791342.1568
2010 66271.2064 1811186.4512
Currently, 'LandUse' is organized alphabetically. I want it to be in following order:
lst = ['Open Lands','Forests/Wetlands','Developed','Corn','Soybeans','Other Ag']
You could do df = df.loc[lst] to reorder the index.