Newbie Matplotlib and Pandas Plotting from CSV file - python

I haven't had much training with Matplotlib at all, and this really seems like a basic plotting application, but I'm getting nothing but errors.
Using Python 3, I'm simply trying to plot historical stock price data from a CSV file, using the date as the x axis and prices as the y. The data CSV looks like this:
(only just now noticing to big gap in times, but whatever)
import glob
import pandas as pd
import matplotlib.pyplot as plt
def plot_test():
files = glob.glob('./data/test/*.csv')
for file in files:
df = pd.read_csv(file, header=1, delimiter=',', index_col=1)
df['close'].plot()
plt.show()
plot_test()
I'm using glob for now just to identify any CSV file in that folder, but I've also tried just designating one specific CSV filename and get the same error:
KeyError: 'close'
I've also tried just designating a specific column number to only plot one particular column instead, but I don't know what's going on.
Ideally, I would like to plot it just like real stock data, where everything is on the same graph, volume at the bottom on it's own axis, open high low close on the y axis, and date on the x axis for every row in the file. I've tried a few different solutions but can't seem to figure it out. I know this has probably been asked before but I've tried lots of different solutions from SO and others but mine seems to be hanging up on me. Thanks so much for the newbie help!

Here on pandas documentation you can find that the header kwarg should be 0 for your csv, as the first row contains the column names. What is happening is that the DataFrame you are building doesn't have the column close, as it is taking the headers from the "second" row. It will probably work fine if you take the header kwarg or change it to header=0. It is the same with the other kwargs, no need to define them. A simple df = pd.read_csv(file) will do just fine.

You can prettify this according to your needs
import pandas
import matplotlib.pyplot as plt
def plot_test(file):
df = pandas.read_csv(file)
# convert timestamp
df['timestamp'] = pandas.to_datetime(df['timestamp'], format = '%Y-%m-%d %H:%M')
# plot prices
ax1 = plt.subplot(211)
ax1.plot_date(df['timestamp'], df['open'], '-', label = 'open')
ax1.plot_date(df['timestamp'], df['close'], '-', label = 'close')
ax1.plot_date(df['timestamp'], df['high'], '-', label = 'high')
ax1.plot_date(df['timestamp'], df['low'], '-', label = 'low')
ax1.legend()
# plot volume
ax2 = plt.subplot(212)
# issue: https://github.com/matplotlib/matplotlib/issues/9610
df.set_index('timestamp', inplace = True)
df.index.to_pydatetime()
ax2.bar(df.index, df['volume'], width = 1e-3)
ax2.xaxis_date()
plt.show()

Related

Parsing an csv file and plotting with Python

I'm new to Python development and I have to implement a project on data analysis. I have a data.txt file which has the following values:
ID,name,date,confirmedInfections
DE2,BAYERN,2020-02-24,19
.
.
DE2,BAYERN,2020-02-25,19
DE1,BADEN-WÃœRTTEMBERG,2020-02-24,1
.
.
DE1,BADEN-WÃœRTTEMBERG,2020-02-26,7
.
.(lot of other names and data)
What I'm trying to do?
As you can see in the file above each name represents a city with covid infections. For each city, I need to save a data frame for each city and plot a time series graph which uses the index of date on x-axis and confirmedInfections on y-axis. An example:
Because of the big data file I was given with four columns I think that I'm doing a mistake on parsing that file and selecting the correct values. Here is an example of my code:
# Getting the data fron Bayern city
data = pd.read_csv("data.txt", index_col="name")
first = data.loc["BAYERN"]
print(first)
# Plotting the timeseries
series = read_csv('data.txt' ,header=0, index_col=0, parse_dates=True, squeeze=True)
series.plot()
pyplot.show()
And here is a photo of the result:
As you can see on the x-axis I get all the different IDs that are included on data.txt. From that to exlude the ID and stats of each city.
Thanks for your time.
You need to parse date after reading from CSV
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
# You can limit the columns as below provided
headers = ['ID','name','date','confirmedInfections']
data = pd.read_csv('data.csv',names=headers)
data['Date'] = data['Date'].map(lambda x: datetime.strptime(str(x), '%Y/%m/%d'))
x = data['Date']
y = data['confirmedInfections']
# Plot using pyplotlib
plt.plot(x,y)
# display chart
plt.show()
I haven't tested this particular code.
I hope this will work for you

Reordering heatmap from seaborn using column info from additional text file

I wrote a python script to read in a distance matrix that was provided via a CSV text file. This distance matrix shows the difference between different animal species, and I'm trying to sort them in different ways(diet, family, genus, etc.) using data from another CSV file that just has one row of ordering information. Code used is here:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
dietCols = pd.read_csv("label_diet.txt", header=None)
df = pd.read_csv("distance_matrix.txt", header=None)
ax = sns.heatmap(df)
fig = ax.get_figure()
fig.savefig("fig1.png")
mp.clf()
dfDiet = pd.read_csv("distance_matrix.txt", header=None, names=dietCols)
ax2 = sns.heatmap(dfDiet, linewidths=0)
fig2 = ax2.get_figure()
fig2.savefig("fig2.png")
mp.clf()
When plotting the distance matrix, the original graph looks like this:
However, when the additional naming information is read from the text file, the graph produced only has one column and looks like this:
You can see the matrix data is being used as row labeling, and I'm not sure why that would be. Some of the rows provided have no values so they're listed as "NaN", so I'm not sure if that would be causing a problem. Is there any easy way to order this distance matrix using an exterior file? Any help would be appreciated!

Plotting dates only when frequency changes

I have been trying to plot date against frequency.
This is how my data set looks like:
2017-07-04,13
2018-04-11,13
2017-08-17,13
2017-08-30,13
2018-04-26,12
2018-01-03,12
2017-07-05,11
2017-06-21,11
This is the code I have tried:
with open('test.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(temp)
### Extract data from CSV ###
with open('test.csv', 'r') as n:
reader = csv.reader(n)
dates = []
freq = []
for row in reader:
dates.append(row[0])
freq.append(row[1])
fig = plt.figure()
graph = fig.add_subplot(111)
# Plot the data as a red line with round markers
graph.plot(dates, freq, 'r-o')
graph.set_xticks(dates)
graph.set_xticklabels(
[dates]
)
plt.show()
This is the result I got:
The xlabels are very cluttered. I want the dates in the labels to be displayed only when there is a change of value.
I don't know how to do that.
Help is appreciated.
Thanks!
Firstly, I would strongly encourage you to use the pandas library and its DataFrame object to handle your data. It has some very useful functions, such as read_csv, which will save you some work.
To have matplotlib space the xticks more sensibly, you'll want to convert your dates to datetime objects (instead of storing your dates as strings).
Here I'll read your data in with pandas, parse the dates and order by date:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Read data
df = pd.read_csv('/path/to/test.csv', names=['date', 'freq'], parse_dates=['date'])
# Sort by date
df.sort_values(by='date', inplace=True)
You can then go ahead and plot the data (you'll need the latest version of pandas to automatically handle the dates):
fig, ax = plt.subplots(1, 1)
# Plot date against frequency
ax.plot(df['date'], df['freq'], 'r-o')
# Rotate the tick labels
ax.tick_params(axis='x', rotation=45)
fig.tight_layout()
If you only wanted to display dates when the frequency changes, the following would work
ax.set_xticks(df.loc[np.diff(df['freq']) != 0, 'date'])
though I wouldn't recommend it (the unequal spacing looks messy)

Marked in Scatter plots, if Unexpected Values shows

I have a dataframe, like this. I want to do scatter plots of it.
I want to do scatter plots of Value1 but whenever value2 is decreased to below 0.6, I want to marked in those scatter plots (Value1) to red color otherwise default color is okay.
Any Suggestions ?
Add another column with color information:
import matplotlib.cm as cm
df['color'] = [int(value < 0.6) for value in df.Value2]
df.plot.scatter(x=df.index, y='Value1',c='color',cmap=cm.jet)
I use seaborn's lmplot (advanced scatterplot) tool for that.
You can make a new column in your spreadsheet file with name "Category". It's very easy to categorize variables in excel or openoffice
(It's something like this -> (if(cell_value<0.6-->low),if(cell_value>0.6-->high)).)
So your test data should look like this:
Than you can import the data in python (I use Anaconda 3.5 with spider: python 3.6) I saved the file in .txt format. but any other format is possible (.csv etc.)
#Import libraries
import seaborn as sns
import pandas as pd
import numpy as np
import os
#Open data.txt which is stored in a repository
os.chdir(r'C:\Users\DarthVader\Desktop\Graph')
f = open('data.txt')
#Get data in a list splitting by semicolon
data = []
for l in f:
v = l.strip().split(';')
data.append(v)
f.close()
#Convert list as dataframe for plot purposes
df = pd.DataFrame(data, columns = ['ID', 'Value', 'Value2','Category'])
#pop out first row with header
df2 = df.iloc[1:]
#Change variables to be plotted as numeric types
df2[['Value','Value2']] = df2[['Value','Value2']].apply(pd.to_numeric)
#Make plot with red color with values below 0.6 and green color with values above 0.6
sns.lmplot( x="Value", y="Value2", data=df2, fit_reg=False, hue='Category', legend=False, palette=dict(high="#2ecc71", low="#e74c3c"))
Your output should look like this.

Reading data from csv and create a graph

I have a csv file with data in the following format -
Issue_Type DateTime
Issue1 03/07/2011 11:20:44
Issue2 01/05/2011 12:30:34
Issue3 01/01/2011 09:44:21
... ...
I'm able to read this csv file, but what I'm unable to achieve is to plot a graph or rather trend based on the data.
For instance - I'm trying to plot a graph with X-axis as Datetime(only Month) and Y-axis as #of Issues. So I would show the trend in line-graphy with 3 lines indicating the pattern of issue under each category for the month.
I really don't have a code for plotting the graph and hence can't share any, but so far I'm only reading the csv file. I'm not sure how to proceed further to plot a graph
PS: I'm not bent on using python - Since I've parsed csv using python earlier I though of using the language, but if there is an easier approach using some other language - I would be open explore that as well.
A way to do this is to use dataframes with pandas.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";") #Reads the csv
df.index = pd.to_datetime(df["DateTime"]) #Set the index of the dataframe to the DateTime column
del df["DateTime"] #The DateTime column is now useless
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_Type"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m')) #This will only show the month number on the graph
This assumes that Issue1/2/3 are integers, I assumed they were as I didn't really understand what they were supposed to be.
Edit: This should do the trick then, it's not pretty and can probably be optimised, but it works well:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";")
df.index = pd.to_datetime(df["DateTime"])
del df["DateTime"]
list=[]
for Issue in df["Issue_Type"]:
list.append(int(Issue[5:]))
df["Issue_number"]=list
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_number"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m'))
plt.show()
The first thing you need to do is to parse the datetime fields as dates/times. Try using dateutil.parser for that.
Next, you will need to count the number of issues of each type in each month. The naive way to do that would be to maintain lists of lists for each issue type, and just iterate through each column, see which month and which issue type it is, and then increment the appropriate counter.
When you have such a frequency count of issues, sorted by issue types, you can simply plot them against dates like this:
import matplotlib.pyplot as plt
import datetime as dt
dates = []
for year in range(starting_year, ending_year):
for month in range(1, 12):
dates.append(dt.datetime(year=year, month=month, day=1))
formatted_dates = dates.DateFormatter('%b') # Format dates to only show month names
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(issues[0], dates) # To plot just issues of type 1
ax.plot(issues[1], dates) # To plot just issues of type 2
ax.plot(issues[2], dates) # To plot just issues of type 3
ax.xaxis.set_major_formatter(formatted_dates) # Format X tick labels
plt.show()
plt.close()
honestly, I would just use R. check this link out on downloading / setting up R & RStudio.
data <- read.csv(file="c:/yourdatafile.csv", header=TRUE, sep=",")
attach(data)
data$Month <- format(as.Date(data$DateTime), "%m")
plot(DateTime, Issue_Type)

Categories

Resources