I am trying to create two plots with data similar to the df I created down below. Sr no. represent the the total number of publication each year. For example, in 2022 there are total 4 publications, in 2021, there are 2 publications, and in 2020, there are 6 publications in total.
I want:
In the first plot: 'total number of publications per year' and 'total citation per year'; x-axis is year, left side of y-axis is number of publications each year, right side of y-axis is total citation per year. Bar graph for publication and line/dot graph for citation.
In the second plot: 'total number of publications per year' and 'mean total citation per year', 'mean total citation per article', x-axis is year, the left side of y-axis is No of pub/ Mean Total citation per article, the right side of y-axis mean total citation per year.
The example plots I want for this data posted below:
pub vs citation per year
pub and citation history
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("cancer.csv")
Data
Sr Year Cited by
1 2022 5
2 2022 2
3 2022 7
4 2022 3
5 2021 5
6 2021 25
7 2020 23
8 2020 16
9 2020 1
10 2020 3
11 2020 23
12 2020 3
I was trying with groupby command like following:
figure = df.groupby(['Year'])['Cited by'].mean()
But I am not sure how to continue to generate the graphs like in the example above. Any help will be highly appreciated.
Related
I am trying to sort a chart with flight accident information. So in csv file there are different airlines, year of the accident and bunch of other things. I want to add up all the incidents by year and another chart adding by each year and each airline:
First chart desirable outcome:
year
incidents
2012
11
2013
12
Second chart desirable outcome:
year
incidents
Airline
2011
23
United
2011
20
Hawaii
2011
30
United
I tried to use dt.year but it's not working. Because csv year is in 2018,2019 format, not in 2018-10-12. I cannot use it as date information.
Try:
import matplotlib.pyplot as plt
# Per year
df.value_counts('year').plot()
# Per year, for each company
df.value_counts(['year', 'Airline']).unstack('Airline').plot(kind='bar')
plt.show()
I have a dataset with sales per customer, per month. I have both a date field (e.g. June 2018) and a "month counter" which gives each month a progressive number (e.g., if data starts in Jan 2018, Jan 2018 is "1", Dec 2018 is "12", and Jan 2019 is "13").
Please see the image, the first 4 columns is a sample of the data I have.
I'd like, for each month and each customer, to sum the sales of the previous 6 months and of the next 6 months, like in the last 2 columns in the attached image.
For instance: for month 1 and customer "John", I'd like to sum sales for month 2,3,4,5,6,7, only looking at "John", this would be "Next 6 months sales" for John in month 1. Reverse logic for the last 6 months sales.
I tried building a for loop and building some functions, but I didn't quite manage to build anything like what I need.
data
I have a housing market dataset categorized by U.S Counties showing columns such as total_homes_sold. I'm trying to show a comparison between housing sales YoY (e.g. Jan 2020 vs. Jan 2019) and by county (e.g. Aberdeen Mar 2020 vs. Suffolk Mar 2020). However not sure how to group the dates as they are not sorted by months (Jan, Feb, Mar etc.) but rather by 4-week intervals: period_begin and period_end.
Intervals between years vary. The period_begin for Aberdeen (around Jan) for 2019 might be 1/7 to 2/3 but 1/6 to 2/2 for 2020 (image shown below).
I tried using count (code below) to label each 4-week period as a number (shown below) thinking I could compare Aberdeen 2017-1 to Aberdeen 2020-1 (1 coded as the first time interval) but realized that some years for some regions have more 4 week periods in a year than others (2017 has 13 whereas 2018 has 14).
*df['count'] = df.groupby((everyfourth['region_name'] != df['region_name'].shift(1)).cumsum()).cumcount()+1*
Any ideas on what code I could use to closely categorize these two columns into month-like periods?
Snippet of Dataset here
Let me know if you have any questions. Not sure I made sense! Thanks.
I have this data frame where I want to graph 3 plots based on year with x and y being Unspcs Desc and Total_Price. For example plot one will be specific to the year 2018 and only contain contents of Unspsc Desc and Total_Price for 2018
Material Total_Price Year_Purchase
Gasket 50,000 2018
Washer 6,000 2019
Bolts 7,000 2019
Nut 3,000 2020
Gasket 25,000 2019
Gasket 2500 2020
Washer 33500 2018
Nuts 7000 2019
The code I was using
dw.groupby(['Unspsc Desc', 'Total_Price']).Year_Purchase.sort_values().plot.bar()
This question already has answers here:
changing sort in value_counts
(4 answers)
Closed 3 years ago.
I have a movies dataframe that looks like this...
title decade
movie name 1 2000
movie name 2 1990
movie name 3 1990
movie name 4 2000
movie name 5 2010
movie name 6 1980
movie name 7 1980
I want to plot number of movies per decade which I am doing this way
freq = movies['decade'].value_counts()
#freq returns me following
2000 56
1980 41
1990 37
1970 21
2010 9
# as you can see the value_counts() method returns a series sorted by the frequencies
freq = movies['decade'].value_counts(sort=False)
# now the frequencies are not sorted, because I want to distribution to be in sequence of decade year
# and not its frequency so I do something like this...
movies = movies.sort_values(by='decade', ascending=True)
freq = movies['decade'].value_counts(sort=False)
now the Series freq should be sorted w.r.t to decades but it does not
although movies is sorted
can someone tell what I am doing wrong? Thanks.
The expected output I am looking for is something like this...
1970 21
1980 41
1990 37
2000 56
2010 9
movies['decade'].value_counts()
returns a series with the decade as index and is sorted descending by count. To sort by decade, just append
movies['decade'].value_counts().sort_index()
or
movies['decade'].value_counts().sort_index(ascending=False)
should do the trick.