I have a dataframe like that:
year
count_yes
count_no
1900
5
7
1903
5
3
1915
14
6
1919
6
14
I want to have two bins, independently of the value itself.
How can I group those categories and sum its values?
Expected result:
year
count_yes
count_no
1900
10
10
1910
20
20
Logic: Grouped the first two rows (1900 and 1903) and the two last rows (1915 and 1919) and summed the values of each category
I want to create a stacked percentage column graphic, so 1900 would be 50/50% and 1910 would be also 50/50%.
I've already created the function to build this graphic, I just need to adjust the dataframe size into bins to create a better distribution and visualization
This is a way to do what you need, if you are ok using the decades as index:
df['year'] = (df.year//10)*10
df_group = df.groupby('year').sum()
Output>>>
df_group
count_yes count_no
year
1900 10 10
1910 20 20
You can bin the years with pandas.cut and aggregate with groupby+sum:
bins = list(range(1900, df['year'].max()+10, 10))
group = pd.cut(df['year'], bins=bins, labels=bins[:-1], right=False)
df.drop('year', axis=1).groupby(group).sum().reset_index()
If you only want to specify the number of bins, compute group with:
group = pd.cut(df['year'], bins=2, right=False)
output:
year count_yes count_no
0 1900 10 10
1 1910 20 20
Related
I have two dataframes. df1 shows annual rainfall over a certain area:
df1:
longitude latitude year
-13.0 8.0 1979 15.449341
1980 21.970507
1981 18.114307
1982 16.881737
1983 24.122467
1984 27.108953
1985 27.401234
1986 18.238272
1987 25.421076
1988 11.796293
1989 17.778618
1990 18.095036
1991 20.414757
and df2 shows the upper limits of each bin:
bin limits
0 16.655970
1 18.204842
2 19.526524
3 20.852657
4 22.336731
5 24.211905
6 27.143820
I'm trying to add a new column to df2 that shows the frequency of rainfall events from df1 in their corresponding bin. For example, in bin 1 I'd be looking for the values in df1 that fall between 16.65 and 18.2.
I've tried the following:
rain = df1['tp1']
for i in range 7:
limit = df2.iloc[i]
out4['count']=rain[rain>limit].count()
However, I get the following message:
ValueError: Can only compare identically-labeled Series objects
Which I think is referring to the fact that I'm comparing two df's that are different sizes? I'm also unsure if that loop is correct or not.
Any help is much appreciated, thanks!
Use pd.cut to assign your rainfall into bins:
# Define the limits for your bins
# Bin 0: (-np.inf , 16.655970]
# Bin 1: (16.655970, 18.204842]
# Bin 2: (18.204842, 19.526524]
# ...
# note that your bins only go up to 27.14 while max rainfall is 27.4 (row 6).
# You may need to add / adjust your limits.
limits = [-np.inf] + df2["limits"].to_list()
# Assign the rainfall to each bin
bins = pd.cut(df1["rainfall"], limits, labels=df2["bin"])
# Count how many values fall into each bin
bins.value_counts(sort=False).rename_axis("bin")
I have a dataframe where I need to create a grouping of ages and then have the averages amount of Tip amount for each group.
My Data looks the following
Tip amount
Age
3
30
30
35
4
60
1
12
7
25
3
45
15
31
5
8
I have tried to use pd.cut() with bins to create the grouping, but I can't seem to get the Tip amount average (maybe using mean()) to be in the DataFrame as well.
import pandas as pd
bins= [0,15,30,45,60,85]
labels = ['0-14','15-29','30-44','45-59','60+']
df['Tip amount']=df['Tip amount'].astype(int)
#df = df.groupby('Age')[['Tip amount']].mean()
df = df.groupby(pd.cut(df['Age'], bins=bins, labels=labels, right=False)).size()
This gives the following output:
Age
0-14
2
15-29
1
30-44
3
45-59
1
60+
1
But I would like to have the average Tip amount for the groups as well.
Age
Tip amount
0-14
2
avg
15-29
1
avg
30-44
3
avg
45-59
1
avg
60+
1
avg
Try:
df.groupby(pd.cut(df['Age'], bins=bins, labels=labels, right=False)).agg({'Age': ['size'], 'Tip amount': ['mean']})
I am counting the number of negative numbers and positive numbers within each year. Ultimately I want to get the percent of negative and positive for each year.
I tried groupby year and counting the categories, but the new columns appears with no name.
df1= df.groupby(['Year','Count of Negative/Positive Margins'])['Count of Negative/Positive Margins'].count()
df1.head()
Out[194]:
Year Count of Negative/Positive Margins
2005 1 4001
2 1373
2006 1 4046
2 1304
2007 1 4156
Name: Count of Negative/Positive Margins, dtype: int64
This my expected output:
2005 1 74%
2 26%
.
.
.
Use SeriesGroupBy.value_counts with grouping only column Year and parameter normalize=True, then multiple by 100, round by Series.round, convert to strings and add %:
df = (df.groupby('Year')['Count of Negative/Positive Margins']
.value_counts(normalize=True)
.mul(100)
.round()
.astype(str)
.add('%')
.reset_index(name='percentage')
)
Hi have a dataframe that looks like this:
ID Date Total_Amount priority
1 2007 4488 High
2 2007 40981 Low
3 2017 450 Medium
4 2008 1000 Low
each row is a new person (ID) and the rows show how much they spent per year (total amount).
I want to create a bar chart with the years on the x-axis and the Total_Amount as the y-axis height but it needs to be stacked by priority. e.g. if 10 spent money in 2007 and their Total_Amount sum is £100,000, the height of the bar needs to be 100,000 stacked by priority( e.g. 5 may have been high, 4 low and 1 medium).
I tried using crosstab with date as row and priority as columns but I don't get a dataframe for Total_Amount spent, I get one for the number of people in each priority.
You can use groupby() and then unstack():
df2 = df.groupby(['Date','priority'])['Total_Amount'].sum().unstack('priority').fillna(0)
df2.plot(kind='bar', stacked=True)
Produces:
Almost same we still using crosstab
pd.crosstab(index=df.Date,columns=df.priority,values=df.Total_Amount,aggfunc='sum')\
.fillna(0).plot(kind='bar')
I have a time series data which contains date, year, month and ratings columns. I have grouped by year and month and and then i am counting the number of rating in each month for that year. I have done this the following way:
nightlife_ratings_mean = review_data_nightlife.groupby(['year','month'])['stars'].count()
I get the following data frame
year month
2005 8 3
9 4
10 16
11 13
12 7
2006 1 62
2 24
3 13
4 20
5 11
6 13
7 11
8 29
9 33
10 46
I want to plot this such that my x label is year and and y label is count and i want a line plot with marker-ø.
How can I do this. I am trying this for the first time. So please help.
You can call plot on your DataFrame and include the option style = 'o-':
plt = nightlife_ratings_mean.plot(x = 'year', y = 'stars', style = 'o-', title = "Stars for each month and year")
plt.set_xlabel("[Year, Month]")
plt.set_ylabel("Stars")
This will plot the following: