I have a huge csv file of data, it looks like this:
STAID, SOUID, DATE, TX, Q_TX
162,100522,19010101, -31, 0
162,100522,19010102, -13, 0
TX is temperature, data goes on for a few thousand more lines to give you an idea.
For every year, I want to plot the amount of days with a temperature above 25 degrees.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv")
zomers = data.index[data["TX"] > 250].tolist()
x_values = []
y_values = []
plt.xlabel("Years")
plt.ylabel("Amount of days with TX > 250")
plt.title("Zomerse Dagen Per Jaar")
plt.plot(x_values, y_values)
# save plot
plt.savefig("zomerse_dagen.png")
X-axis should be the years say 1900-2010 or something, and the y-axis should be the amount of days with a temperature higher than 250 in that year.
How do I go about this? >_< I can't quite get a grasp on how to extract the amount of days from the data.... and use it in a plot.
You can create the data points separately to make it a little easier to comprehend. Then use pandas.pivot_table to aggregate. Here is a working example that should get you going.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv", parse_dates=["DATE"])
data.sort_values("DATE", inplace=True)
data["above_250"] = data.TX > 250
data["year"] = data.apply(lambda x: x["DATE"].year, axis=1).astype("category")
plot_df = pd.pivot_table(data, index="year", values="above_250", aggfunc="sum")
years = plot_df.index
y_pos = np.arange(len(years))
values = plot_df.above_250
plt.bar(y_pos, values, align='center', alpha=0.5)
plt.xticks(y_pos, years)
plt.ylabel("Amount of days with TX > 250")
plt.xlabel("Year")
plt.title("Zomerse Dagen Per Jaar")
plt.show()
You can use the datetime module from the python standard library to parse the dates, in particular, have a look at the strptime function. You can then use the datetime.year attribute to aggregate your data.
You can also use an OrderedDict to keep track of your aggregation before you assign OrderedDict.keys() and OrdredDict.values() to x_values and y_values respectively.
Related
I have a time series data like below where the data consists of year and week. So, the data is from 2014 1st week to 2015 52 weeks.
Now, below is the line plot of the above mentioned data
As you can see the x axis labelling is not quite what I was trying to achieve since the point after 201453 should be 201501 and there should not be any straight line and it should not be up to 201499. How can I rescale the xaxis exactly according to Due_date column? Below is the code
rand_products = np.random.choice(Op_2['Sp_number'].unique(), 3)
selected_products = Op_2[Op_2['Sp_number'].isin(rand_products)][['Due_date', 'Sp_number', 'Billing']]
plt.figure(figsize=(20,10))
plt.grid(True)
g = sns.lineplot(data=selected_products, x='Due_date', y='Billing', hue='Sp_number', ci=False, legend='full', palette='Set1');
the issue is because 201401... etc. are read as numbers and that is the reason the line chart has that gap. To fix it, you will need to change the numbers to date format and plot it.
As the full data is not available, below is the two column dataframe which has the Due_date in the form of integer YYYYWW. Billing column is a bunch of random numbers. Use the method here to convert the integers to dateformat and plot. The gap will be removed....
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
Due_date = list(np.arange(201401,201454)) #Year 2014
Due_date.extend(np.arange(201501,201553)) #Year 2915
Billing = random.sample(range(500, 1000), 105) #billing numbers
df = pd.DataFrame({'Due_date': Due_date, 'Billing': Billing})
df.Due_date = df.Due_date.astype(str)
df.Due_date = pd.to_datetime(df['Due_date']+ '-1',format="%Y%W-%w") #Convert to date
plt.figure(figsize=(20,10))
plt.grid(True)
ax = sns.lineplot(data=df, x='Due_date', y='Billing', ci=False, legend='full', palette='Set1')
Output graph
I have a graph that shows the closing price of a stock throughout a day at each five minute interval. The x axis shows the time and the range of x values is from 9:30 to 4:00 (16:00).
The problem is that the automatic bounds for the x axis go from 9:37 to 16:07 and I really just want it from 9:30 to 16:00.
The code I am currently running is this:
stk = yf.Ticker(ticker)
his = stk.history(interval="5m", start=start, end=end).values.tolist() #open - high - low - close - volume
x = []
y = []
count = 0
five_minutes = datetime.timedelta(minutes = 5)
for bar in his:
x.append((start + five_minutes * count))#.strftime("%H:%M"))
count = count + 1
y.append(bar[3])
plt.clf()
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%H:%M"))
plt.gca().xaxis.set_major_locator(mdates.MinuteLocator(interval=30))
plt.plot(x, y)
plt.gcf().autofmt_xdate()
plt.show()
And it produces this plot (currently a link because I am on a new user account):
I thought I was supposed to use the axis.set_data_interval function providing, so I did so by providing datetime objects representing 9:30 and 16:00 as the min and the max. This gave me the error:
TypeError: '<' not supported between instances of 'float' and 'datetime.datetime'
Is there another a way for me to be able to adjust the first xtick and still have it automatically fill in the rest?
This problem can be fixed by adjusting the way you use the mdates tick locator. Here is an example based on the one shared by r-beginners to make it comparable. Note that I use the pandas plotting function for convenience. The x_compat=True argument is needed for it to work with mdates:
import pandas as pd # 1.1.3
import yfinance as yf # 0.1.54
import matplotlib.dates as mdates # 3.3.2
# Import data
ticker = 'AAPL'
stk = yf.Ticker(ticker)
his = stk.history(period='1D', interval='5m')
# Create pandas plot with appropriately formatted x-axis ticks
ax = his.plot(y='Close', x_compat=True, figsize=(10,5))
ax.xaxis.set_major_locator(mdates.MinuteLocator(byminute=[0, 30]))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M', tz=his.index.tz))
ax.legend(frameon=False)
ax.figure.autofmt_xdate(rotation=0, ha='center')
The sample data was created by obtaining Apple's stock price from Yahoo Finance. The desired five-minute interval labels are a list of strings obtained by using the date function to get the start and end times at five-minute intervals.
Based on this, the x-axis is drawn as a graph of the number of five-minute intervals and the closing price, and the x-axis is set to any interval by slicing.
import yfinance as yf
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
ticker = 'AAPL'
stk = yf.Ticker(ticker)
his = stk.history(period='1D',interval="5m")
his.reset_index(inplace=True)
time_rng = pd.date_range('09:30','15:55', freq='5min')
labels = ['{:02}:{:02}'.format(t.hour,t.minute) for t in time_rng]
fig, ax = plt.subplots()
x = np.arange(len(his))
y = his.Close
ax.plot(x,y)
ax.set_xticks(x[::3])
ax.set_xticklabels(labels[::3], rotation=45)
plt.show()
I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?
You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?
So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)
It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.
I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()
I have a csv file with data in the following format -
Issue_Type DateTime
Issue1 03/07/2011 11:20:44
Issue2 01/05/2011 12:30:34
Issue3 01/01/2011 09:44:21
... ...
I'm able to read this csv file, but what I'm unable to achieve is to plot a graph or rather trend based on the data.
For instance - I'm trying to plot a graph with X-axis as Datetime(only Month) and Y-axis as #of Issues. So I would show the trend in line-graphy with 3 lines indicating the pattern of issue under each category for the month.
I really don't have a code for plotting the graph and hence can't share any, but so far I'm only reading the csv file. I'm not sure how to proceed further to plot a graph
PS: I'm not bent on using python - Since I've parsed csv using python earlier I though of using the language, but if there is an easier approach using some other language - I would be open explore that as well.
A way to do this is to use dataframes with pandas.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";") #Reads the csv
df.index = pd.to_datetime(df["DateTime"]) #Set the index of the dataframe to the DateTime column
del df["DateTime"] #The DateTime column is now useless
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_Type"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m')) #This will only show the month number on the graph
This assumes that Issue1/2/3 are integers, I assumed they were as I didn't really understand what they were supposed to be.
Edit: This should do the trick then, it's not pretty and can probably be optimised, but it works well:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";")
df.index = pd.to_datetime(df["DateTime"])
del df["DateTime"]
list=[]
for Issue in df["Issue_Type"]:
list.append(int(Issue[5:]))
df["Issue_number"]=list
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_number"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m'))
plt.show()
The first thing you need to do is to parse the datetime fields as dates/times. Try using dateutil.parser for that.
Next, you will need to count the number of issues of each type in each month. The naive way to do that would be to maintain lists of lists for each issue type, and just iterate through each column, see which month and which issue type it is, and then increment the appropriate counter.
When you have such a frequency count of issues, sorted by issue types, you can simply plot them against dates like this:
import matplotlib.pyplot as plt
import datetime as dt
dates = []
for year in range(starting_year, ending_year):
for month in range(1, 12):
dates.append(dt.datetime(year=year, month=month, day=1))
formatted_dates = dates.DateFormatter('%b') # Format dates to only show month names
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(issues[0], dates) # To plot just issues of type 1
ax.plot(issues[1], dates) # To plot just issues of type 2
ax.plot(issues[2], dates) # To plot just issues of type 3
ax.xaxis.set_major_formatter(formatted_dates) # Format X tick labels
plt.show()
plt.close()
honestly, I would just use R. check this link out on downloading / setting up R & RStudio.
data <- read.csv(file="c:/yourdatafile.csv", header=TRUE, sep=",")
attach(data)
data$Month <- format(as.Date(data$DateTime), "%m")
plot(DateTime, Issue_Type)