Hi have a dataframe that looks like this:
ID Date Total_Amount priority
1 2007 4488 High
2 2007 40981 Low
3 2017 450 Medium
4 2008 1000 Low
each row is a new person (ID) and the rows show how much they spent per year (total amount).
I want to create a bar chart with the years on the x-axis and the Total_Amount as the y-axis height but it needs to be stacked by priority. e.g. if 10 spent money in 2007 and their Total_Amount sum is £100,000, the height of the bar needs to be 100,000 stacked by priority( e.g. 5 may have been high, 4 low and 1 medium).
I tried using crosstab with date as row and priority as columns but I don't get a dataframe for Total_Amount spent, I get one for the number of people in each priority.
You can use groupby() and then unstack():
df2 = df.groupby(['Date','priority'])['Total_Amount'].sum().unstack('priority').fillna(0)
df2.plot(kind='bar', stacked=True)
Produces:
Almost same we still using crosstab
pd.crosstab(index=df.Date,columns=df.priority,values=df.Total_Amount,aggfunc='sum')\
.fillna(0).plot(kind='bar')
Related
I have two dataframes. df1 shows annual rainfall over a certain area:
df1:
longitude latitude year
-13.0 8.0 1979 15.449341
1980 21.970507
1981 18.114307
1982 16.881737
1983 24.122467
1984 27.108953
1985 27.401234
1986 18.238272
1987 25.421076
1988 11.796293
1989 17.778618
1990 18.095036
1991 20.414757
and df2 shows the upper limits of each bin:
bin limits
0 16.655970
1 18.204842
2 19.526524
3 20.852657
4 22.336731
5 24.211905
6 27.143820
I'm trying to add a new column to df2 that shows the frequency of rainfall events from df1 in their corresponding bin. For example, in bin 1 I'd be looking for the values in df1 that fall between 16.65 and 18.2.
I've tried the following:
rain = df1['tp1']
for i in range 7:
limit = df2.iloc[i]
out4['count']=rain[rain>limit].count()
However, I get the following message:
ValueError: Can only compare identically-labeled Series objects
Which I think is referring to the fact that I'm comparing two df's that are different sizes? I'm also unsure if that loop is correct or not.
Any help is much appreciated, thanks!
Use pd.cut to assign your rainfall into bins:
# Define the limits for your bins
# Bin 0: (-np.inf , 16.655970]
# Bin 1: (16.655970, 18.204842]
# Bin 2: (18.204842, 19.526524]
# ...
# note that your bins only go up to 27.14 while max rainfall is 27.4 (row 6).
# You may need to add / adjust your limits.
limits = [-np.inf] + df2["limits"].to_list()
# Assign the rainfall to each bin
bins = pd.cut(df1["rainfall"], limits, labels=df2["bin"])
# Count how many values fall into each bin
bins.value_counts(sort=False).rename_axis("bin")
I have a dataframe like that:
year
count_yes
count_no
1900
5
7
1903
5
3
1915
14
6
1919
6
14
I want to have two bins, independently of the value itself.
How can I group those categories and sum its values?
Expected result:
year
count_yes
count_no
1900
10
10
1910
20
20
Logic: Grouped the first two rows (1900 and 1903) and the two last rows (1915 and 1919) and summed the values of each category
I want to create a stacked percentage column graphic, so 1900 would be 50/50% and 1910 would be also 50/50%.
I've already created the function to build this graphic, I just need to adjust the dataframe size into bins to create a better distribution and visualization
This is a way to do what you need, if you are ok using the decades as index:
df['year'] = (df.year//10)*10
df_group = df.groupby('year').sum()
Output>>>
df_group
count_yes count_no
year
1900 10 10
1910 20 20
You can bin the years with pandas.cut and aggregate with groupby+sum:
bins = list(range(1900, df['year'].max()+10, 10))
group = pd.cut(df['year'], bins=bins, labels=bins[:-1], right=False)
df.drop('year', axis=1).groupby(group).sum().reset_index()
If you only want to specify the number of bins, compute group with:
group = pd.cut(df['year'], bins=2, right=False)
output:
year count_yes count_no
0 1900 10 10
1 1910 20 20
I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.
I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)
You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)
I have a dataframe that contains the date of a snowstorm and also a ranking of said snowstorm ranging from 1950-2019. I want to create a stacked histogram where the x-axis is decade and the y-axis is counts of snowstorm by category.
An example of what I am trying to create is listed below.
I am having trouble understanding how exactly to aggregate the data in such a fashion that would allow me to plot something like I shared.
For example, here is a snippet of the 1950s dataframe:
Start End Category Year count
1959-03-12 1959-03-14 2 1950 13
1958-03-18 1958-03-23 3 1950 6
1958-02-12 1958-02-18 3 1950 6
1957-12-03 1957-12-05 1 1950 32
1956-03-18 1956-03-20 1 1950 32
I have all of the counts for each category, but how do I turn that into data that is plottable on stacked histogram?
Aggregate you data first, then plot with the argument stacked=True
pivot_table
df.pivot_table('count', 'Year', 'Category', 'sum').plot.bar(stacked=True)
groupby
df.groupby(['Year', 'Category'])['count'].sum().unstack().plot.bar(stacked=True)
Keep in mind that you can change the aggregation to something else.
df.pivot_table('count', 'Year', 'Category', 'first').plot.bar(stacked=True)
df.groupby(['Year', 'Category'])['count'].first().unstack().plot.bar(stacked=True)
Also, you can drop duplicates prior.
(
df.drop_duplicates(['Year', 'Category'])
.pivot_table('count', 'Year', 'Category')
.plot.bar(stacked=True)
)
I have a dataset with a multi-index and I would like to graph based on one index and one of the columns.
I tried referencing the data, '% Smokers', based on the index. The two indexes are Age Group and Year.
I want the graph to have 4 lines, for each age group, with Year as the x-axis.
The tail of my dataset looks like:
% Smokers Cigs per Day Smoker Count Total Count
Age Group Year
4.0 2003 9.221673 14.947439 86486.103843 9.378570e+05
1.0 2002 23.668647 7.832528 185319.850343 7.829761e+05
2.0 2002 24.130250 10.379573 616136.073633 2.553376e+06
3.0 2002 23.300126 13.569244 389576.705723 1.671994e+06
4.0 2002 9.892616 12.739635 89247.050214 9.021583e+05
I tried the following code:
fig, ax = plt.subplots(1,2, figsize = (20,10))
ax[0].plot(part1_df["% Smokers"].loc[1.0])
ax[0].plot(part1_df["% Smokers"].loc[2.0])
ax[0].plot(part1_df["% Smokers"].loc[3.0])
ax[0].plot(part1_df["% Smokers"].loc[4.0])
I'm getting a KeyError: '% Smokers'