Plot csv - define legend labels - python

I created a csv-file (with pandas and the help of a friend) like the one in the picture.
Now I need to plot this file.
The first column is the time and should be used for x data. The rest is y data.
For the legend I just want the first row to be used for the labels, like "T_HS_Netz_03" for the second column.
Could not figure out how to do this.
My first attempt:
csv_data = pd.read_csv('file', header=[0, 1], delimiter=';')
ax = csv_data.plot(legend=True)
plt.legend(bbox_to_anchor=(0., 1.0, 1.0, 0.), loc=3, ncol=2, mode="expand")
plt.show()
But this includes the second row in the labels too and the x ticks does not match the data (0.9 - 3.2).
Second attempt:
csv_data = pd.read_csv('file', header=[0, 1], delimiter=';')
x =csv_data.iloc[1:, [0]]
y = csv_data.iloc[1:, 1:]
plt.legend()
plt.plot(x, y)
This does not show any labels
The resulting plot should be something like
Thanks

You have to open your cvs file with numpy from example. Then, you can plot columns :
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("file", delimiter="", dtype=[("time", "f4"), ("column1", "f8"), ("column2", "f8")])
figure_1 = plt.plot(data['time'], data['column1'])
figure_2 = plt.plot(data['time'], data['column2'])
plt.legend(loc='upper right')
plt.xlabel('data')
plt.ylabel('time')
plt.show()
You should get the good result ;)

Related

Handling Timestamps Python Panda

I am new to Python and trying to learn as much as I can.
I am trying to create a live graph with Matplotlib by reading from a CSV file.
It seems that I am having a TypeError: value, I am guessing from the timestamp format.
From what I read on Pandas infobase, The date_parser should take care of this, but i am unsure on using properly.
I would like to use the Timestamp in the 2nd column of the CSV as the X Axis, and then plot the rest of the data as Y values.
The CSV looks like this:
1,11:24:30,null,0,3,4,5,6,7,8,9,10,11,12
1,11:24:33,null,0,3,4,5,6,7,8,9,10,11,12
1,11:24:35,null,0,3,4,5,6,7,8,9,10,11,12
1,11:24:38,null,0,3,4,5,6,7,8,9,10,11,12
1,11:24:41,null,0,3,4,5,6,7,8,9,10,11,12
1,11:24:43,null,0,3,4,5,6,7,8,9,10,11,12
My code is below:
import random
from itertools import count
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
plt.style.use('fivethirtyeight')
x_vals = []
y_vals = []
index = count()
def animate(i):
data = pd.read_csv('C:/Python/20220124.csv', names=["Pass", "Time", "1", "2"], header=
None, parse_dates= True)
y1 = data['Pass']
y2 = data['Time']
y3 = data['1']
y4 = data['2']
plt.cla()
plt.plot(y1, label='Pass/Fail', lw=3, c='c', marker='o', markersize=4, mfc='k')
plt.plot(y2, label='Time', lw=3, c='c', marker='o', markersize=4, mfc='k')
plt.plot(y3, label='1', lw=2, ls='--', c='k')
plt.plot(y4, label='2', lw=2, ls='--', c='k')
plt.legend(loc='upper left')
plt.tight_layout()
ax = plt.gca()
xlim_low, xlim_high = ax.get_xlim()
ax.set_xlim(xlim_low, xlim_high)
y1offset = 1.0
y1max = (y4.max() + y1offset)
current_ymax = y1max
y1min = (y3.min() - y1offset)
current_ymin = y1min
ax.set_ylim(current_ymin, current_ymax)
ani = FuncAnimation(plt.gcf(), animate, interval=1000)
plt.tight_layout()
plt.show()
Thanks for any help!
I state that I am not an expert on pandas or matplotlib.
Looking at your code I think that the problem lies in the data definition of the CSV file.
You pass to read_csv the array names with 4 fields, but your CSV has lots more columns.
Trying your code, if I remove from the CSV the data in excess and use only four fields per line the plot is drawn.
As I stated, I don't know these libraries, but as I understood pandas read_csv is used to load data into a data structure (the DataFrame as I read from the docs).
Something probably is going wrong when the function attempts to read more data and parse them as timestamps producing the TypeError.
But is a big guess, I think is better to go back to the Pandas docs!

plotting pandas dataframe date

I have a pandas dataframe with 27 columns for electricity consumption, the first column represents the date and time for a two year duration and the other columns have a recorded hourly values for electricity consumption for 26 houses during two years. What I'm doing is clustering using k-means. Whenever I try to plot the date on the x-axis and the values of electricity consumption on the y-axis I have a problem which is x and y must have the same size. I try to reshape and the problem is not being solved.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)
I always get the same error message, X and Y must have save size, try to reshape your data. When I tried to reshape the data it did not work because the date column's size is always smaller than the size of the rest columns.
I think what you are essentially doing is a time series clustering of all households to find similar electricity usage pattern over time.
For that, each timestamp becomes a 'feature', while each household's usage becomes your data row. This will make it easier to apply sklearn clustering methods, which are typically in the form of method.fit(x) where x represents the features (pass the data as 2D array that has the shape of (row, column)). So your data needs to be transposed.
The refactored code is as such:
# what you have done
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)
# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()
and you should see something like this (don't mind the data shown, I created some dummy data that is similar to yours).
Next, for clustering, this is what you can do:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_
# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans
Finally, for plotting, you can produce the scatter chart you wanted using the code below:
import matplotlib.pyplot as plt
color = ['red','green','blue']
plt.figure(figsize=(16,4))
for index, row in df.iterrows():
plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)
for index, cluster_center in enumerate(kmeans.cluster_centers_):
plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()
But I would recommend plotting line plots for individual clusters, more visually appealing (to me):
plt.figure(figsize=(16,16))
for cluster_index in [0,1,2]:
plt.subplot(3,1,cluster_index + 1)
for index, row in df.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)
plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'Cluster {cluster_index}', fontsize=20)
plt.tight_layout()
plt.show()
Cheers!

How can I plot slice of certain DataFrame for each row with different color?

I would like to plot certain slices of my Pandas Dataframe for each rows (based on row indexes) with different colors.
My data look like the following:
I already tried with the help of this tutorial to find a way but I couldn't - probably due to a lack of skills.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("D:\SOF10.csv" , header=None)
df.head()
#Slice interested data
C = df.iloc[:, 2::3]
#Plot Temp base on row index colorfully
C.apply(lambda x: plt.scatter(x.index, x, c='g'))
plt.show()
Following is my expected plot:
I was also wondering if I could displace the mean of each row of the sliced data which contains 480 values somewhere in the plot or in the legend beside of plot! Is it feasible (like the following picture) to calculate the mean and displaced somewhere in the legend or by using small font size displace next to its own data in graph ?
Data sample: data
This gives the plot without legend
C = df.iloc[:,2::3].stack().reset_index()
C.columns = ['level_0', 'level_1', 'Temperature']
fig, ax = plt.subplots(1,1)
C.plot('level_0', 'Temperature',
ax=ax, kind='scatter',
c='level_0', colormap='tab20',
colorbar=False, legend=True)
ax.set_xlabel('Cycles')
plt.show()
Edit to reflect modified question:
stack() transform your (sliced) dataframe to a series with index (row, col)
reset_index() reset the double-level index above to level_0 (row), level_1 (col).
set_xlabel sets the label of x-axis to what you want.
Edit 2: The following produces scatter with legend:
CC = df.iloc[:,2::3]
fig, ax = plt.subplots(1,1, figsize=(16,9))
labels = CC.mean(axis=1)
for i in CC.index:
ax.scatter([i]*len(CC.columns[1:]), CC.iloc[i,1:], label=labels[i])
ax.legend()
ax.set_xlabel('Cycles')
ax.set_ylabel('Temperature')
plt.show()
This may be an approximate answer. scatter(c=, cmap= can be used for desired coloring.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import itertools
df = pd.DataFrame({'a':[34,22,1,34]})
fig, subplot_axes = plt.subplots(1, 1, figsize=(20, 10)) # width, height
colors = ['red','green','blue','purple']
cmap=matplotlib.colors.ListedColormap(colors)
for col in df.columns:
subplot_axes.scatter(df.index, df[col].values, c=df.index, cmap=cmap, alpha=.9)

How to plot pandas dataframe in same figure without destroying their actual orientation?

I was trying to do a comparison of runtime between Naive matrix multiplication and Strassen's. For this, I was recording the runtime for a different dimension of the matrices. Then I was trying to plot the result in the same graph for the comparison.
But the problem is the plotting is not showing the proper result.
Here is the data...
2 3142
3 3531
4 4756
5 5781
6 8107
The leftmost column is denoting n, the dimension and rightmost column is denoting execution time.
The above data is for Naive method and the data for Strassen is in this pattern too.
I'm inserting this data to a pandas dataframe. And after plotting the data the image looks like this:
Here blue is for Naive and green is for Strassen's
This is certainly not true as Naive cannot be constant. But my code was correct. SO I decided to plot them separately and these are the result:
Naive
Strassen
As you can see it might happen because the scaling in Y axis is not the same?
Is this the reason?
The code I'm implementing for plotting is:
fig = plt.figure()
data_naive = pd.read_csv('naive.txt', sep="\t", header=None)
data_naive.columns = ["n", "time"]
plt.plot(data_naive['n'], data_naive['time'], 'g')
data_strassen = pd.read_csv('strassen.txt', sep="\t", header=None)
data_strassen.columns = ["n", "time"]
plt.plot(data_strassen['n'], data_strassen['time'], 'b')
plt.show()
fig.savefig('figure.png')
What I tried to work out?
fig = plt.figure()
data_naive = pd.read_csv('naive.txt', sep="\t", header=None)
data_naive.columns = ["n", "time"]
data_strassen = pd.read_csv('strassen.txt', sep="\t", header=None)
data_strassen.columns = ["n", "time"]
ax = data_naive.plot(x='n', y='time', c='blue', figsize=(20,10))
data_strassen.plot(x='n', y='time', c='green', figsize=(20,10), ax=ax)
plt.savefig('comparison.png')
plt.show()
But no luck!!!
How to plot them in the same figure without altering their actual orientation?
IIUC: Here is a solution using twinx
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(10, 100, (12,2)))
df[1] = np.random.dirichlet(np.ones(12)*1000., size=1)[0]
fig, ax1 = plt.subplots()
ax1.plot(df[0], color='r')
#Plot the secondary axis in the right side
ax2 = ax1.twinx()
ax2.plot(df[1], color='k')
fig.tight_layout()
plt.show()
Result produced:

Using pandas/matplotlib/python, I cannot visualize my csv file as clusters

My csv file is,
https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv
I want to visualize this csv file as clusters.
My ideal result would be the following image.(Higher points (red zone) would be higher energy consumption and lower points (blue zone) would be lower energy consumption.)
I want to set x-axis as dates (e.g. 2011-04-18), y-axis as time (e.g. 13:22:00), and z-axis as energy consumption (e.g. 925.840613752523).
I successfully visualized the csv data file as values per 30mins with the following program.
from matplotlib import style
from matplotlib import pylab as plt
import numpy as np
style.use('ggplot')
filename='total_watt.csv'
date=[]
number=[]
import csv
with open(filename, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in csvreader:
if len(row) ==2 :
date.append(row[0])
number.append(row[1])
number=np.array(number)
import datetime
for ii in range(len(date)):
date[ii]=datetime.datetime.strptime(date[ii], '%Y-%m-%d %H:%M:%S')
plt.plot(date,number)
plt.title('Example')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()
I also succeeded to visualize the csv data file as values per day with the following program.
from matplotlib import style
from matplotlib import pylab as plt
import numpy as np
import pandas as pd
style.use('ggplot')
filename='total_watt.csv'
date=[]
number=[]
import csv
with open(filename, 'rb') as csvfile:
df = pd.read_csv('total_watt.csv', parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
import datetime
for ii in range(len(date)):
date[ii]=datetime.datetime.strptime(date[ii], '%Y-%m-%d %H:%M:%S')
plt.plot(date,number)
plt.title('Example')
plt.ylabel('Y axis')
plt.xlabel('X axis')
df.plot()
plt.show()
Although I could visualize the csv file as values per 30mins and per days, I do not have any idea to visualize the csv data as clusters in 3D..
How can I program it...?
Your main issue is probably just reshaping your data so that you have date along one dimension and time along the other. Once you do that you can use whatever plotting you like best (here I've used matplotlib's mplot3d, but it has some quirks).
What follows takes your data and reshapes it appropriately so you can then plot a surface that I believe is what your are looking for. The key is using the pivot method, which restructures your data by date and time.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
fname = 'total_watt.csv'
# Read in the data, but I skipped setting the index and made sure no data
# is lost to a nonexistent header
df = pd.read_csv(fname, parse_dates=[0], header=None, names=['datetime', 'watt'])
# We want to separate the date from the time, so create two new columns
df['date'] = [x.date() for x in df['datetime']]
df['time'] = [x.time() for x in df['datetime']]
# Now we want to reshape the data so we have dates and times making the result 2D
pv = df.pivot(index='time', columns='date', values='watt')
# Not every date has every time, so fill in the subsequent NaNs or there will be holes
# in the surface
pv = pv.fillna(0.0)
# Now, we need to construct some arrays that matplotlib will like for X and Y values
xx, yy = np.mgrid[0:len(pv),0:len(pv.columns)]
# We can now plot the values directly in matplotlib using mplot3d
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(xx, yy, pv.values, cmap='jet', rstride=1, cstride=1)
ax.grid(False)
# Now we have to adjust the ticks and ticklabels - so turn the values into strings
dates = [x.strftime('%Y-%m-%d') for x in pv.columns]
times = [str(x) for x in pv.index]
# Setting a tick every fifth element seemed about right
ax.set_xticks(xx[::5,0])
ax.set_xticklabels(times[::5])
ax.set_yticks(yy[0,::5])
ax.set_yticklabels(dates[::5])
plt.show()
This gives me (using your data) the following graph:
Note that I've assumed when plotting and making the ticks that your dates and times are linear (which they are in this case). If you have data with uneven samples, you'll have to do some interpolation before plotting.

Categories

Resources