dates from csv files , how can i graph it

dates from csv files , how can i graph it - python

I am new in using python.
I am trying to graph 2 variables in Y1 and Y2 (secondary y axis) , and the date in the x axis from a csv file.
I think my main problem is with converting the date in csv.
Moreover is it possible to save the 3 graphs according to the ID (A,B,C)... Thanks a lot.
I added the CSV file that I have and an image of the figure that i am looking for.
Thanks a lot for your advice
ID date Y1 Y2
A 40480 136 83
A 41234 173 23
A 41395 180 29
A 41458 124 60
A 41861 158 27
A 42441 152 26
A 43009 155 51
A 43198 154 38
B 40409 185 71
B 40612 156 36
B 40628 165 39
B 40989 139 77
B 41346 138 20
B 41558 132 85
B 41872 157 58
B 41992 120 91
B 42245 139 43
B 42397 131 34
B 42745 114 68
C 40711 110 68
C 40837 156 38
C 40946 110 63
C 41186 161 46
C 41243 187 20
C 41494 122 55
C 41970 103 19
C 42183 148 78
C 42247 115 33
C 42435 132 92
C 42720 187 43
C 43228 127 28
C 43426 183 45

Try the matplotlib library, if i understood right, it should work.
from mpl_toolkits import mplot3d
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')
Data for a three-dimensional line
zaxis = y1
xaxis = date
yaxis = y2
ax.plot3D(xaxis, yaxis, zaxis, 'red')
Data for three-dimensional scattered points
zdat = y1
xdat = date
ydat = y2
ax.scatter3D(xdat, ydat, zdat, c=xdat, cmap='Greens')

If I understand you correctly, you are looking for three separate graphs for ID=A, ID=B, ID=C. Here is how you could get that:
import pandas as pd
import pylab as plt
data = pd.read_csv('data.dat', sep='\t') # read your datafile, you might have a different name here
for i, (label, subset) in enumerate(data.groupby('ID')):
plt.subplot(131+i)
plt.plot(subset['date'], subset['Y1'])
plt.plot(subset['date'], subset['Y2'], 'o')
plt.title('ID: {}'.format(label))
plt.show()
Note that this treats your dates as integers (same as in the datafile).

Related

Optimize dataframe fill and refill Python Pandas

I have changed the column names and have added new columns too.
I am having a numpy array that I have to fill in the respective dataframe columns.
I am getting a delayed response in filling the dataframe using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df = df.tail(1000)
DISPLAY_IN_TRAINING = []
Slice_Middle_Piece_X = slice(None,-1, None)
Slice_Middle_Piece_Y = slice(-1, None)
input_slicer = slice(None, None)
output_slice = slice(None, None)
seq_len = 15 # choose sequence length
n_steps = seq_len - 1
Disp_Data = df
def Generate_DataSet(stock,
df_clone,
seq_len
):
global DISPLAY_IN_TRAINING
data_raw = stock.values # convert to numpy array
data = []
len_data_raw = data_raw.shape[0]
for index in range(0, len_data_raw - seq_len + 1):
data.append(data_raw[index: index + seq_len])
data = np.array(data);
test_set_size = int(np.round(30 / 100 * data.shape[0]));
train_set_size = data.shape[0] - test_set_size;
x_train, y_train = Get_Data_Chopped(data[:train_set_size])
print("Training Sliced Successful....!")
df_train_candle = df_clone[n_steps : train_set_size + n_steps]
if len(DISPLAY_IN_TRAINING) == 0:
DISPLAY_IN_TRAINING = list(df_clone)
df_train_candle.columns = DISPLAY_IN_TRAINING
return [x_train, y_train, df_train_candle]
def Get_Data_Chopped(data_related_to):
x_values = []
y_values = []
for index,iter_values in enumerate(data_related_to):
x_values.append(iter_values[Slice_Middle_Piece_X,input_slicer])
y_values.append([item for sublist in iter_values[Slice_Middle_Piece_Y,output_slice] for item in sublist])
x_values = np.asarray(x_values)
y_values = np.asarray(y_values)
return [x_values,y_values]
x_train, y_train, df_train_candle = Generate_DataSet(df,
Disp_Data,
seq_len
)
df_train_candle.reset_index(drop = True, inplace = True)
df_columns = list(df_train_candle)
df_outputs_name = []
OUTPUT_COLUMN = df.columns
for output_column_name in OUTPUT_COLUMN:
df_outputs_name.append(output_column_name + "_pred")
for i in range(len(df_columns)):
if df_columns[i] == output_column_name:
df_columns[i] = output_column_name + "_orig"
break
df_train_candle.columns = df_columns
df_pred_names = pd.DataFrame(columns = df_outputs_name)
df_train_candle = df_train_candle.join(df_pred_names, how="outer")
for row_index, row_value in enumerate(y_train):
for valueindex, output_label in enumerate(OUTPUT_COLUMN):
df_train_candle.loc[row_index, output_label + "_orig"] = row_value[valueindex]
df_train_candle.loc[row_index, output_label + "_pred"] = row_value[valueindex]
print(df_train_candle.head())
The shape of my y_train is (195, 24) and the dataframe shape is (195, 48). Now I am trying to optimize and make the process work faster. The y_train may change shape to say (195, 1) or (195, 5).
So please can someone tell me what other way (optimized way) for doing the above process? I want a general solution that could fit anything without loosing the data integrity and is faster too.
If teh data size increases from 1000 to 2000 the process become slow. Please advise how to make it faster.
Sample Data df looks like this with shape (1000, 8)
A B C D E F G H
64272 195 215 239 272 22 11 33 55
64273 196 216 240 273 22 11 33 55
64274 197 217 241 274 22 11 33 55
64275 198 218 242 275 22 11 33 55
64276 199 219 243 276 22 11 33 55
The output looks like this:
A_orig B_orig C_orig D_orig E_orig F_orig G_orig H_orig A_pred B_pred C_pred D_pred E_pred F_pred G_pred H_pred
0 10 30 54 87 22 11 33 55 10 30 54 87 22 11 33 55
1 11 31 55 88 22 11 33 55 11 31 55 88 22 11 33 55
2 12 32 56 89 22 11 33 55 12 32 56 89 22 11 33 55
3 13 33 57 90 22 11 33 55 13 33 57 90 22 11 33 55
4 14 34 58 91 22 11 33 55 14 34 58 91 22 11 33 55
Please generate csv columns with 1000 or more lines and see that the program becomes slower. I want to make it faster. I hope this is good to go for understanding.

Making separate plots with unique identifiers in Python using CSV file

I have a CSV file where one column has a unique identifier (a,b,c...) and I would like to make separate plots based on this identifier (so a separate line on the same graph for a,b and so forth).
SSID Time RSSI
0 a 13:14:42 -33
1 a 13:14:46 -30
2 a 13:14:49 -31
3 a 13:14:52 -31
4 a 13:14:55 -35
.. ... ... ...
64 b 13:15:43 -58
65 b 13:15:46 -56
66 b 13:15:50 -65
67 b 13:15:53 -52
68 b 13:15:57 -65
What I've written plots every point together in one line, but how can I plot them on the same graph, but have them separated based on the identifier?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
temp = np.genfromtxt('file.csv', delimiter=',')
plt.figure()
plt.plot(temp)
plt.show
Thank you!

reshape so that SSID are columns
simple pandas plot()
df = pd.read_csv(io.StringIO(""" SSID Time RSSI
0 a 13:14:42 -33
1 a 13:14:46 -30
2 a 13:14:49 -31
3 a 13:14:52 -31
4 a 13:14:55 -35
64 b 13:15:43 -58
65 b 13:15:46 -56
66 b 13:15:50 -65
67 b 13:15:53 -52
68 b 13:15:57 -65"""), sep="\s+")
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, figsize=[10,6])
df.set_index(["SSID","Time"]).unstack(0).droplevel(0,1).plot(ax=ax)

How to filter clusters produced by DBSCAN based on size?

I have applied DBSCAN to perform clustering on a dataset consisting of X, Y and Z coordinates of each point in a point cloud. I want to plot only the clusters which have less than 100 points. This is what I have so far:
clustering = DBSCAN(eps=0.1, min_samples=20, metric='euclidean').fit(only_xy)
plt.scatter(only_xy[:, 0], only_xy[:, 1],
c=clustering.labels_, cmap='rainbow')
clusters = clustering.components_
#Store the labels
labels = clustering.labels_
#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])
print(counts)
Output:
[1278 564 208 47 36 30 191 54 24 18 40 915 26 20
24 527 56 677 63 57 61 1544 512 21 45 187 39 132
48 55 160 46 28 18 55 48 35 92 29 88 53 55
24 52 114 49 34 34 38 52 38 53 69]
So I have found the number of points in each cluster, but I'm not sure how to select only the clusters which have less than 100 points.

You may find indexes of the labels where you have counts less than 100:
ls, cs = np.unique(labels,return_counts=True)
dic = dict(zip(ls,cs))
idx = [i for i,label in enumerate(labels) if dic[label] <100 and label >= 0]
Then you may apply resulting index to your DBSCAN results and labels like (more or less):
plt.scatter(only_xy[idx, 0], only_xy[idx, 1],
c=clustering.labels_[idx], cmap='rainbow')

I think if you run this code, you can get the labels, and cluster components of the cluster with size more than 100:
from collections import Counter
labels_with_morethan100=[label for (label,count) in Counter(clustering.labels_).items() if count>100]
clusters_biggerthan100= clustering.components_[np.isin(clustering.labels_[clustering.labels_>=0], labels_with_morethan100)]

How to create contours over points with Basemap?

Having a table "tempcc" of value with x,y geografic coords (don't know attaching files here, there is 86 rows in my csv):
X Y Temp
0 35.268 55.618 1.065389
1 35.230 55.682 1.119160
2 35.508 55.690 1.026214
3 35.482 55.652 1.007834
4 35.289 55.664 1.087598
5 35.239 55.655 1.099459
6 35.345 55.662 1.066117
7 35.402 55.649 1.035958
8 35.506 55.643 0.991939
9 35.526 55.688 1.018137
10 35.541 55.695 1.017870
11 35.471 55.682 1.033929
12 35.573 55.668 0.985559
13 35.547 55.651 0.982335
14 35.425 55.671 1.042975
15 35.505 55.675 1.016236
16 35.600 55.681 0.985532
17 35.458 55.717 1.063691
18 35.538 55.720 1.037523
19 35.230 55.726 1.146047
20 35.606 55.707 1.003364
21 35.582 55.700 1.006711
22 35.350 55.696 1.087173
23 35.309 55.677 1.088988
24 35.563 55.687 1.003785
25 35.510 55.764 1.079220
26 35.334 55.736 1.119026
27 35.429 55.745 1.093300
28 35.366 55.752 1.119061
29 35.501 55.745 1.068676
.. ... ... ...
56 35.472 55.800 1.117183
57 35.538 55.855 1.134721
58 35.507 55.834 1.129712
59 35.256 55.845 1.211969
60 35.338 55.823 1.174397
61 35.404 55.835 1.162387
62 35.460 55.826 1.138965
63 35.497 55.831 1.130774
64 35.469 55.844 1.148516
65 35.371 55.510 0.945187
66 35.378 55.545 0.969400
67 35.456 55.502 0.902285
68 35.429 55.517 0.925932
69 35.367 55.710 1.090652
70 35.431 55.490 0.903296
71 35.284 55.606 1.051335
72 35.234 55.634 1.088135
73 35.284 55.591 1.041181
74 35.354 55.587 1.010446
75 35.332 55.581 1.015004
76 35.356 55.606 1.023234
77 35.311 55.545 0.997468
78 35.307 55.575 1.020845
79 35.363 55.645 1.047831
80 35.401 55.628 1.021373
81 35.340 55.629 1.045491
82 35.440 55.643 1.017227
83 35.293 55.630 1.063910
84 35.370 55.623 1.029797
85 35.238 55.601 1.065699
I try to create isolines with:
from numpy import meshgrid,linspace
data=tempcc
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),\
lon_0 = np.mean(tempcc['X'].values),\
llcrnrlon=35,llcrnrlat=55.3, \
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
x = linspace(m.llcrnrlon, m.urcrnrlon, data.shape[1])
y = linspace(m.llcrnrlat, m.urcrnrlat, data.shape[0])
xx, yy = meshgrid(x, y)
m.contour(xx, yy, data,latlon=True)
#pt.legend()
m.scatter(tempcc['X'].values, tempcc['Y'].values, latlon=True)
#m.contour(x,y,data,latlon=True)
But I can't manage correctly, although everything seems to be fine. As far as I understand I have to make a 2D matrix of values, where i is lat, and j is lon, but I can't find the example.
The result I get
as you see, region is correct, but interpolation is not good.
What's the matter? Which parameter have I forgotten?

You could use a Triangulation and then call tricontour() instead of contour()
import matplotlib.pyplot as plt
from matplotlib.tri import Triangulation
from mpl_toolkits.basemap import Basemap
import numpy
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),
lon_0 = np.mean(tempcc['X'].values),
llcrnrlon=35,llcrnrlat=55.3,
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
triMesh = Triangulation(tempcc['X'].values, tempcc['Y'].values)
tctr = m.tricontour(triMesh, tempcc['Temp'].values,
levels=numpy.linspace(min(tempcc['Temp'].values),
max(tempcc['Temp'].values), 7),
latlon=True)

Adjusting y-lim Scale in the Plot (matplotlib, pandas) to Achieve Same Scale for Both Plots

I have a dataframe which looks something like this:
AgeGroups Factor Cancer Frequency
0 00-05 B Yes 223
1 00-05 A No 108
2 00-05 A Yes 0
3 00-05 B No 6575
4 11-15 B Yes 143
5 11-15 A No 5
6 11-15 A Yes 1
7 11-15 B No 3669
8 16-20 B Yes 395
9 16-20 A No 28
10 16-20 A Yes 1
11 16-20 B No 6174
12 21-25 B Yes 624
13 21-25 A No 80
14 21-25 A Yes 2
15 21-25 B No 8173
16 26-30 B Yes 968
17 26-30 A No 110
18 26-30 A Yes 2
19 26-30 B No 9143
20 31-35 B Yes 1225
21 31-35 A No 171
22 31-35 A Yes 5
23 31-35 B No 9046
24 36-40 B Yes 1475
25 36-40 A No 338
26 36-40 A Yes 21
27 36-40 B No 8883
28 41-45 B Yes 2533
29 41-45 A No 782
.. ... ... ... ...
54 71-75 A Yes 2441
55 71-75 B No 15992
56 76-80 B Yes 4614
57 76-80 A No 5634
58 76-80 A Yes 1525
59 76-80 B No 10531
60 81-85 B Yes 1869
61 81-85 A No 2893
62 81-85 A Yes 702
63 81-85 B No 5692
64 86-90 B Yes 699
65 86-90 A No 1398
66 86-90 A Yes 239
67 86-90 B No 3081
68 91-95 B Yes 157
69 91-95 A No 350
70 91-95 A Yes 47
71 91-95 B No 1107
72 96-100 B Yes 31
73 96-100 A No 35
74 96-100 A Yes 2
75 96-100 B No 230
76 >100 B Yes 5
77 >100 A No 1
78 >100 A Yes 1
79 >100 B No 30
80 06-10 B Yes 112
81 06-10 A No 6
82 06-10 A Yes 0
83 06-10 B No 2191
with the code:
by_factor = counts.groupby(level='Factor')
k = by_factor.ngroups
fig, axes = plt.subplots(1, k, sharex=True, sharey=False, figsize=(15, 8))
for i, (gname, grp) in enumerate(by_factor):
grp.xs(gname, level='Factor').plot.bar(
stacked=True, rot=45, ax=axes[i], title=gname)
fig.tight_layout()
I got a beautiful chart, which seems like this:
This actually served what I was looking for until I realized I wanted to re-adjust my y-axis in such a way that I could have same scale for y-axis in both of the charts. If you look at the right chart 'B', the y-axis has scale of 25000 and chart 'A' has the scale of 10000. Can anyone suggest what would be the best possible approach to have same scale on both charts?.
I tried:
plt.ylim([0,25000])
which did rather nothing or didn't change anything in chart 'A' because this bascially only changes y-axis of chart 'B'.
I would highly appreciate any suggestion to achieve same scale for both plots.

Set ylim min and max values for every axis in a cycle:
for ax in axes: ax.set_ylim([0,25000])

You may of course simply call .set_ylim() on the respective axes. The drawback of this is that you would need to know the limits to set by doing so.
The following solutions do not have this requirement:
sharey
In your code you explicitely set sharey=False. If you change it to True, you get a shared yaxis. You can then use plt.ylim([0,25000]) to limit the axes, but you don't have to, since they are shared and will adjust automatically.
Minimal example:
import matplotlib.pyplot as plt
fig, (ax, ax2) = plt.subplots(ncols=2, sharex=True, sharey=True)
ax.plot([1,3,2])
ax2.plot([2,3,1])
plt.show()
As can be seen the ticklabels of the shared axes are hidden, which might be desirable in many cases.
join shared axes
Having two axes, you can make them sharing the same scale using
import matplotlib.pyplot as plt
fig, (ax, ax2) = plt.subplots(ncols=2, sharex=True, sharey=False)
ax.get_shared_x_axes().join(ax, ax2)
ax.plot([1,3,2])
ax2.plot([2,3,1])
plt.show()
Here the ticklabels stay visible. If you don't want that you can turn them off via ax2.set_yticklabels([]).

Using plt.ylim() will only adjust the axes for the last figure that was plotted. In order to change the y limits for a specific plot you need to use ax.set_ylim().
So in your case it would be
axes[0].set_ylim(0,25000)
axes[1].set_ylim(0,25000)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dates from csv files , how can i graph it - python

Related

Optimize dataframe fill and refill Python Pandas

Making separate plots with unique identifiers in Python using CSV file

How to filter clusters produced by DBSCAN based on size?

How to create contours over points with Basemap?

Adjusting y-lim Scale in the Plot (matplotlib, pandas) to Achieve Same Scale for Both Plots

Categories

Resources