plotting data from a list in python - python

I need to plot the velocities of some objects(cars).
Each velocity are being calculated through a routine and written in a file, roughly through this ( I have deleted some lines to simplify):
thefile_v= open('vels.txt','w')
for car in cars:
velocities.append(new_velocity)
if len(car.velocities) > 4:
try:
thefile_v.write("%s\n" %car.velocities) #write vels once we get 5 values
thefile_v.close
except:
print "Unexpected error:", sys.exc_info()[0]
raise
The result of this is a text file with list of velocities for each car.
something like this:
[0.0, 3.8, 4.5, 4.3, 2.1, 2.2, 0.0]
[0.0, 2.8, 4.0, 4.2, 2.2, 2.1, 0.0]
[0.0, 1.8, 4.2, 4.1, 2.3, 2.2, 0.0]
[0.0, 3.8, 4.4, 4.2, 2.4, 2.4, 0.0]
Then I wanted to plot each velocity
with open('vels.txt') as f:
lst = [line.rstrip() for line in f]
plt.plot(lst[1]) #lets plot the second line
plt.show()
This is what I found. The values are taken as a string and put them as yLabel.
I got it working through this:
from numpy import array
y = np.fromstring( str(lst[1])[1:-1], dtype=np.float, sep=',' )
plt.plot(y)
plt.show()
What I learnt is that, the set of velocity lists I built previously were treated as lines of data.
I had to convert them to arrays to be able to plot them. However the brackets [] were getting into the way. By converting the line of data to string and removing the brackets through this (i.e. [1:-1]).
It is working now, but I'm sure there is a better way of doing this.
Any comments?

Just say you had the array [0.0, 3.8, 4.5, 4.3, 2.1, 2.2, 0.0], to graph this the code would look something like:
import matplotlib.pyplot as plt
ys = [0.0, 3.8, 4.5, 4.3, 2.1, 2.2, 0.0]
xs = [x for x in range(len(ys))]
plt.plot(xs, ys)
plt.show()
# Make sure to close the plt object once done
plt.close()
if you wanted to have different intervals for the x axis then:
interval_size = 2.4 #example interval size
xs = [x * interval_size for x in range(len(ys))]
Also when reading your values from the text file make sure that you have converted your values from strings back to integers. This maybe why your code is assuming your input is the y label.

The example is not complete, so some assumptions must be made here. In general, use numpy or pandas to store your data.
Suppose car is an object, with a velocity attribute, you can write all velocities in a list, save this list as text file with numpy, read it again with numpy and plot it.
import numpy as np
import matplotlib.pyplot as plt
class Car():
def __init__(self):
self.velocity = np.random.rand(5)
cars = [Car() for _ in range(5)]
velocities = [car.velocity for car in cars]
np.savetxt("vels.txt", np.array(velocities))
####
vels = np.loadtxt("vels.txt")
plt.plot(vels.T)
## or plot only the first velocity
#plt.plot(vels[0]
plt.show()

Just one possible easy solution. Use the map function. Say in your file, you have the data stored like, without any [ and ] non-convertible letters.
#file_name: test_example.txt
0.0, 3.8, 4.5, 4.3, 2.1, 2.2, 0.0
0.0, 2.8, 4.0, 4.2, 2.2, 2.1, 0.0
0.0, 1.8, 4.2, 4.1, 2.3, 2.2, 0.0
0.0, 3.8, 4.4, 4.2, 2.4, 2.4, 0.0
Then the next step is;
import matplotlib.pyplot as plt
path = r'VAR_DIRECTORY/test_example.txt' #the full path of the file
with open(path,'rt') as f:
ltmp = [list(map(float,line.split(','))) for line in f]
plt.plot(ltmp[1],'r-')
plt.show()
In top, I just assume you want to plot the second line, 0.0, 2.8, 4.0, 4.2, 2.2, 2.1, 0.0. Then here is the result.

Related

Pyqt step size for lists with a large gap between average and maximum value

I am building a tool in pyqt that has a slider which sorts areas of geomtries.
In some cases there can be data with extremely large gap between average or mininum area value and the maximum.
Let it be like:
areas = [0.5, 1.0, 1.3, 1.7, 2.6, 3.0, 3.5, 3.9, 4.0, 4.1, 1023.8, 3245.4, 3734.3]
When I set min and max values for slider, about 95 % of slider bar are values from 5 to 3245.4. I am looking for some solution to make a some kind of dynamic step size for slider in order to see every possible value from areas list. I.e some kind of this
0________________________________5.0____________________________________3734.3
^ ^ ^
If you have a list of values and don't want something like a log slider, you can subclass QSlider and handle the values based on the indexes in the list.
A quick example:
class Slider(QSlider):
doubleValueChanged = pyqtSignal(float)
def __init__(self, values, parent=None):
super().__init__(parent)
self.values = values
self.setOrientation(Qt.Horizontal)
self.setRange(0, len(self.values) - 1)
self.setTickInterval(1)
self.valueChanged.connect(lambda index: self.doubleValueChanged.emit(self.values[index]))
if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
slider = Slider([0.5, 1.0, 1.3, 1.7, 2.6, 3.0, 3.5, 3.9, 4.0, 4.1, 1023.8, 3245.4, 3734.3])
slider.doubleValueChanged.connect( lambda value: print("New value", value))
slider.show()
sys.exit(app.exec_())
I created a new signal because QSlider works only with integers.
If you want to add the labels, you can redefine QSlider.paintEvent

Missing first entry when writing data to csv using numpy.savetxt()

I'm trying to write a numpy array to a .csv using numpy.savetxt using a comma delimiter, however it's missing the very first entry (row 1 column 1), and I have no idea why.
I'm fairly new to programming in Python, and this might be simply a problem with the way I'm calling numpy.savetxt or maybe the way I'm defining my array. Anyway here's my code:
import numpy as np
import csv
# preparing csv file
csvfile = open("np_csv_test.csv", "w")
columns = "ymin, ymax, xmin, xmax\n"
csvfile.write(columns)
measurements = np.array([[0.9, 0.3, 0.2, 0.4],
[0.8, 0.5, 0.2, 0.3],
[0.6, 0.7, 0.1, 0.5]])
np.savetxt("np_csv_test.csv", measurements, delimiter = ",")
I expected four columns with 3 rows under the headers ymin, ymax, xmin, and xmax, and I did, but I'm missing 0.9. As in, row 2 column 1 of my .csv is empty, and in Notepad I'm getting:
ymin, ymax, xmin, xmax
,2.999999999999999889e-01,2.000000000000000111e-01,4.000000000000000222e-01
8.000000000000000444e-01,5.000000000000000000e-01,2.000000000000000111e-01,2.999999999999999889e-01
5.999999999999999778e-01,6.999999999999999556e-01,1.000000000000000056e-01,5.000000000000000000e-01
What am I doing wrong?
When you call np.savetxt with a path to the output file, it will try to overwrite any existing file, which is not what you want. Here's how you can write your desired file with column headers:
import numpy as np
# preparing csv file
columns = "ymin, ymax, xmin, xmax"
measurements = np.array([[0.9, 0.3, 0.2, 0.4],
[0.8, 0.5, 0.2, 0.3],
[0.6, 0.7, 0.1, 0.5]])
np.savetxt("np_csv_test.csv", measurements, delimiter = ",", header=columns)
As pointed out by Andy in the comments, you can get np.savetxt to append to an existing file by passing in a file handle instead of a file name. So another valid way to get the file you want would be:
import numpy as np
import csv
# preparing csv file
csvfile = open("np_csv_test.csv", "w")
columns = "ymin, ymax, xmin, xmax\n"
csvfile.write(columns)
measurements = np.array([[0.9, 0.3, 0.2, 0.4],
[0.8, 0.5, 0.2, 0.3],
[0.6, 0.7, 0.1, 0.5]])
np.savetxt(csvfile, measurements, delimiter = ",")
# have to close the file yourself in this case
csvfile.close()

Display more attributes in the decision tree

I am currently viewing the decision tree using the following code. Is there a way that we can export some calculated fields as output too?
For example, is it possible to display the sum of an input attribute at each node, i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:]
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)
#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
There is plenty of discussion about decision trees in scikit-learn on the github page. There are answers on this SO question and this scikit-learn documentation page that provide the framework to get you started. With all the links out of the way, here are some functions that allow a user to address the question in a generalizable manner. The functions could be easily modified since I don't know if you mean all the leaves or each leaf individually. My approach is the latter.
The first function uses apply as a cheap way to find the indices of the leaf nodes. It's not necessary to achieve what you're asking, but I included it as a convenience since you mentioned you want to investigate leaf nodes and leaf node indices may be unknown a priori.
def find_leaves(X, clf):
"""A cheap function to find leaves of a DecisionTreeClassifier
clf must be a fitted DecisionTreeClassifier
"""
return set(clf.apply(X))
Result on the example:
find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}
The following function will return an array of values that satisfy the conditions of node and feature, where node is the index of the node from the tree that you want values for and feature is the column (or feature) that you want from X.
def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
"""this function will return an array of values
from the input array X. Array values will be limited to
1. samples that passed through <node>
2. and from the feature <feature>.
clf must be a fitted DecisionTreeClassifier
"""
leaf_ids = find_leaves(X, clf)
if (require_leaf and
node not in leaf_ids):
print("<require_leaf> is set, "
"select one of these nodes:\n{}".format(leaf_ids))
return
# a sparse array that contains node assignment by sample
node_indicator = clf.decision_path(X)
node_array = node_indicator.toarray()
# which samples at least passed through the node
samples_in_node_mask = node_array[:,node]==1
return X[samples_in_node_mask, feature]
Applied to the example:
values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)
array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
6.3, 6.5, 6.2, 5.9])
Now the user can perform whatever mathematical operation is desired on the subset of samples for a given feature.
i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
print("There are {} total samples in this node, "
"{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))
There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993
Update
After re-reading the question, this is the only solution I can put together quickly that doesn't involve modifying scikit source code for export.py. Code below still relies on previously defined functions. This code modifies the dotstring via pydot and networkx.
# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]
import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)
# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
nfv = node_feature_values(X, alg, node=n, feature=feature)
MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())
# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph
Notice all the leaves have the sum of values from X for feature 0.
I think the best way to accomplish what you're asking would be to modify tree.py and/or export.py to natively support this feature.

Aligning two data sets in Python

I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.
You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series
You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal

Pure Python faster than Numpy? can I make this numpy code faster?

I need to compute the min, max, and mean from a specific list of faces/vertices. I tried to optimize this computing with the use of Numpy but without success.
Here is my test case:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
'''
Module Started 22 févr. 2013
#note: test case comparaison numpy vs python
#author: Python4D/damien
'''
import numpy as np
import time
def Fnumpy(vertices):
np_vertices=np.array(vertices)
_x=np_vertices[:,:,0]
_y=np_vertices[:,:,1]
_z=np_vertices[:,:,2]
_min=[np.min(_x),np.min(_y),np.min(_z)]
_max=[np.max(_x),np.max(_y),np.max(_z)]
_mean=[np.mean(_x),np.mean(_y),np.mean(_z)]
return _mean,_max,_min
def Fpython(vertices):
list_x=[item[0] for sublist in vertices for item in sublist]
list_y=[item[1] for sublist in vertices for item in sublist]
list_z=[item[2] for sublist in vertices for item in sublist]
taille=len(list_x)
_mean=[sum(list_x)/taille,sum(list_y)/taille,sum(list_z)/taille]
_max=[max(list_x),max(list_y),max(list_z)]
_min=[min(list_x),min(list_y),min(list_z)]
return _mean,_max,_min
if __name__=="__main__":
vertices=[[[1.1,2.2,3.3,4.4]]*4]*1000000
_t=time.clock()
print ">>NUMPY >>{} for {}s.".format(Fnumpy(vertices),time.clock()-_t)
_t=time.clock()
print ">>PYTHON>>{} for {}s.".format(Fpython(vertices),time.clock()-_t)
The results are:
Numpy:
([1.1000000000452519, 2.2000000000905038, 3.3000000001880174], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998]) for 27.327068618s.
Python:
([1.100000000045252, 2.200000000090504, 3.3000000001880174], [1.1, 2.2, 3.3], [1.1, 2.2, 3.3]) for 1.81366938593s.
Pure Python is 15x faster than Numpy!
The reason your Fnumpy is slower is that it contains an additional step not done by Fpython: the creation of a numpy array in memory. If you move the line np_verticies=np.array(verticies) outside of Fnumpy and the timed section your results will be very different:
>>NUMPY >>([1.1000000000452519, 2.2000000000905038, 3.3000000001880174], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998]) for 0.500802s.
>>PYTHON>>([1.100000000045252, 2.200000000090504, 3.3000000001880174], [1.1, 2.2, 3.3], [1.1, 2.2, 3.3]) for 2.182239s.
You can also speed up the allocation step significantly by providing a datatype hint to numpy when you create it. If you tell Numpy you have an array of floats, then even if you leave the np.array() call in the timing loop it will beat the pure python version.
If I change np_vertices=np.array(vertices) to np_vertices=np.array(vertices, dtype=np.float_) and keep it in Fnumpy, the Fnumpy version will beat Fpython even though it has to do a lot more work:
>>NUMPY >>([1.1000000000452519, 2.2000000000905038, 3.3000000001880174], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998], [1.1000000000000001, 2.2000000000000002, 3.2999999999999998]) for 1.586066s.
>>PYTHON>>([1.100000000045252, 2.200000000090504, 3.3000000001880174], [1.1, 2.2, 3.3], [1.1, 2.2, 3.3]) for 2.196787s.
As already pointed out by others, your problem is the conversion from list to array. By using the appropriate numpy functions for that, you will beat Python. I modified the main part of your program:
if __name__=="__main__":
_t = time.clock()
vertices_np = np.resize(np.array([ 1.1, 2.2, 3.3, 4.4 ], dtype=np.float64),
(1000000, 4, 4))
print "Creating numpy vertices: {}".format(time.clock() - _t)
_t = time.clock()
vertices=[[[1.1,2.2,3.3,4.4]]*4]*1000000
print "Creating python vertices: {}".format(time.clock() - _t)
_t=time.clock()
print ">>NUMPY >>{} for {}s.".format(Fnumpy(vertices_np),time.clock()-_t)
_t=time.clock()
print ">>PYTHON>>{} for {}s.".format(Fpython(vertices),time.clock()-_t)
Running your code with the modifed main part results on my machine in:
Creating numpy vertices: 0.6
Creating python vertices: 0.01
>>NUMPY >>([1.1000000000452519, 2.2000000000905038, 3.3000000001880174],
[1.1000000000000001, 2.2000000000000002, 3.2999999999999998], [1.1000000000000001,
2.2000000000000002, 3.2999999999999998]) for 0.5s.
>>PYTHON>>([1.100000000045252, 2.200000000090504, 3.3000000001880174], [1.1, 2.2, 3.3],
[1.1, 2.2, 3.3]) for 1.91s.
Although the array creation is still somewhat longer with Numpy tools as the creation of the nested lists with python's list multiplication operator (0.6s versus 0.01s), you gain a factor of ca. 4 for the run-time relevant part of your code. If I replace the line:
np_vertices=np.array(vertices)
with
np_vertices = np.asarray(vertices)
to avoid the copying of a big array, the running time of the numpy function even goes down to 0.37s on my machine, being more than 5 times faster then the pure python version.
In your real code, if you know the number of vertices in advance, you can preallocate the appropriate array via np.empty(), then fill it with the appropriate data, and pass it to the numpy-version of your function.

Categories

Resources