I have two dataframes. df1 shows annual rainfall over a certain area:
df1:
longitude latitude year
-13.0 8.0 1979 15.449341
1980 21.970507
1981 18.114307
1982 16.881737
1983 24.122467
1984 27.108953
1985 27.401234
1986 18.238272
1987 25.421076
1988 11.796293
1989 17.778618
1990 18.095036
1991 20.414757
and df2 shows the upper limits of each bin:
bin limits
0 16.655970
1 18.204842
2 19.526524
3 20.852657
4 22.336731
5 24.211905
6 27.143820
I'm trying to add a new column to df2 that shows the frequency of rainfall events from df1 in their corresponding bin. For example, in bin 1 I'd be looking for the values in df1 that fall between 16.65 and 18.2.
I've tried the following:
rain = df1['tp1']
for i in range 7:
limit = df2.iloc[i]
out4['count']=rain[rain>limit].count()
However, I get the following message:
ValueError: Can only compare identically-labeled Series objects
Which I think is referring to the fact that I'm comparing two df's that are different sizes? I'm also unsure if that loop is correct or not.
Any help is much appreciated, thanks!
Use pd.cut to assign your rainfall into bins:
# Define the limits for your bins
# Bin 0: (-np.inf , 16.655970]
# Bin 1: (16.655970, 18.204842]
# Bin 2: (18.204842, 19.526524]
# ...
# note that your bins only go up to 27.14 while max rainfall is 27.4 (row 6).
# You may need to add / adjust your limits.
limits = [-np.inf] + df2["limits"].to_list()
# Assign the rainfall to each bin
bins = pd.cut(df1["rainfall"], limits, labels=df2["bin"])
# Count how many values fall into each bin
bins.value_counts(sort=False).rename_axis("bin")
I have a dataframe like that:
year
count_yes
count_no
1900
5
7
1903
5
3
1915
14
6
1919
6
14
I want to have two bins, independently of the value itself.
How can I group those categories and sum its values?
Expected result:
year
count_yes
count_no
1900
10
10
1910
20
20
Logic: Grouped the first two rows (1900 and 1903) and the two last rows (1915 and 1919) and summed the values of each category
I want to create a stacked percentage column graphic, so 1900 would be 50/50% and 1910 would be also 50/50%.
I've already created the function to build this graphic, I just need to adjust the dataframe size into bins to create a better distribution and visualization
This is a way to do what you need, if you are ok using the decades as index:
df['year'] = (df.year//10)*10
df_group = df.groupby('year').sum()
Output>>>
df_group
count_yes count_no
year
1900 10 10
1910 20 20
You can bin the years with pandas.cut and aggregate with groupby+sum:
bins = list(range(1900, df['year'].max()+10, 10))
group = pd.cut(df['year'], bins=bins, labels=bins[:-1], right=False)
df.drop('year', axis=1).groupby(group).sum().reset_index()
If you only want to specify the number of bins, compute group with:
group = pd.cut(df['year'], bins=2, right=False)
output:
year count_yes count_no
0 1900 10 10
1 1910 20 20
Dataframe contains essentially three things.
Date, Count, and Company.
I want to create a program that makes bar charts with count on the y axis and company on the x axis; but there should be multiple charts for different months. for.eg there should be a may chart containing all the companies counts from that month only
Ive tried using groupby to organise them by company and using .sum() to count up for the whole database per company but am not able to do it also specific to a month
I can group them by company but I want to create individual graphs per company and also by month
Metric Count Date
Apple 97 16/01/2019
Samsung 84 06/01/2019
Linux 100 03/02/2019
Microsoft 61 29/01/2019
Blackberry 17 24/02/2019
LG 98 23/02/2019
Panasonic 20 22/02/2019
Apple 100 19/03/2019
Samsung 43 02/01/2019
Linux 21 06/01/2019
Microsoft 72 05/03/2019
Blackberry 75 24/03/2019
LG 82 19/03/2019
Panasonic 42 25/02/2019
Apple 50 12/01/2019
Samsung 74 15/02/2019
Linux 41 09/03/2019
Microsoft 97 12/03/2019
Blackberry 15 28/03/2019
I can group them by company but I want to create individual graphs per company and also by month
df = pd.read_csv('values.csv', delimiter = ',')
df.head(1)
df = df.query('Metric == "Company"')
df = df.groupby('Company').sum().Count
print(df)
df = df.plot(kind='bar', align='center', title ="entity",figsize=(15,10),legend=True, fontsize=5)
df.set_ylabel("Count",fontsize=12)
df.set_xlabel("Company",fontsize=12)
I can group them by company but I want to create individual graphs per company and also by month
Try this
data['Date']= pd.to_datetime(data['Date'],format='%d/%m/%Y')
data['Month']=data['Date'].dt.strftime('%b')
df = data.groupby(['Month', 'Metric']).sum()
df.plot(kind='bar')
It gives the output as below.
One plot for each month could be plotted with the code below
data['Date']= pd.to_datetime(data['Date'],format='%d/%m/%Y')
data['Month']=data['Date'].dt.strftime('%b')
df = data.groupby(['Month', 'Metric']).sum()
months = df.index.levels[0]
for month in months:
data = df.loc[month]
data.plot(kind='bar', align='center', title =str(month), legend=True)
IIUC:
new_df = (df.groupby([pd.Grouper(key='Date', freq='M'])
.Count.sum()
)
(new_df.reset_index()
.groupby('Date')
.plot.bar(x='Metric', y='Count',subplots=True)
)
You could add a 'Month' column and group by month and metric:
import datetime
# New month column
month_key = lambda x: datetime.date(x.year, x.month, 1)
df['Month'] = df['Date'].apply(month_key)
# Group by month and metric
df = df.groupby(['Month', 'Metric']).sum()
# One plot for each month
months = df.index.levels[0]
for month in months:
data = df.loc[month]
data.plot(kind='bar', align='center', title =str(month), legend=True)
Hi have a dataframe that looks like this:
ID Date Total_Amount priority
1 2007 4488 High
2 2007 40981 Low
3 2017 450 Medium
4 2008 1000 Low
each row is a new person (ID) and the rows show how much they spent per year (total amount).
I want to create a bar chart with the years on the x-axis and the Total_Amount as the y-axis height but it needs to be stacked by priority. e.g. if 10 spent money in 2007 and their Total_Amount sum is £100,000, the height of the bar needs to be 100,000 stacked by priority( e.g. 5 may have been high, 4 low and 1 medium).
I tried using crosstab with date as row and priority as columns but I don't get a dataframe for Total_Amount spent, I get one for the number of people in each priority.
You can use groupby() and then unstack():
df2 = df.groupby(['Date','priority'])['Total_Amount'].sum().unstack('priority').fillna(0)
df2.plot(kind='bar', stacked=True)
Produces:
Almost same we still using crosstab
pd.crosstab(index=df.Date,columns=df.priority,values=df.Total_Amount,aggfunc='sum')\
.fillna(0).plot(kind='bar')
I have two columns, categorical and year, that I am trying to plot. I am trying to take the sum total of each categorical per year to create a multi-class time series plot.
ax = data[data.categorical=="cat1"]["categorical"].plot(label='cat1')
data[data.categorical=="cat2"]["categorical"].plot(ax=ax, label='cat3')
data[data.categorical=="cat3"]["categorical"].plot(ax=ax, label='cat3')
plt.xlabel("Year")
plt.ylabel("Number per category")
sns.despine()
But am getting an error stating no numeric data to plot. I am looking for something similar to the above, perhaps with data[data.categorical=="cat3"]["categorical"].lambda x : (1 for x in data.categorical)
I will use the following lists as examples.
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2","cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3","cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
My goal is to obtain something similar to the following picture
I'm hesitant to call this a "solution", as it's basically just a summary of basic Pandas functionality, which is explained in the same documentation where you found the time series plot you've placed in your post. But seeing as there's some confusion around groupby and plotting, a demo may help clear things up.
We can use two calls to groupby().
The first groupby() gets a count of category appearances per year, using the count aggregation.
The second groupby() is used to plot the time series for each category.
To start, generate a sample data frame:
import pandas as pd
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2",
"cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3",
"cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,
2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
df = pd.DataFrame({'categorical':categorical,
'year':year})
categorical year
0 cat1 2013
1 cat1 2014
...
21 cat1 2015
22 cat3 2013
Now get counts per category, per year:
# reset_index() gives a column for counting, after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year','categorical'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'ct'})
)
year categorical ct
0 2013 cat1 2
1 2013 cat2 2
2 2013 cat3 3
3 2014 cat1 5
4 2014 cat2 3
5 2014 cat3 1
6 2015 cat1 1
7 2015 cat2 2
8 2015 cat3 4
Finally, plot time series for each category, keyed by color:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
# key gives the group name (i.e. category), data gives the actual values
for key, data in ctdf.groupby('categorical'):
data.plot(x='year', y='ct', ax=ax, label=key)
Have you tried groupby?
df.groupby(["year","categorical"]).count()