Struggling to correctly utilize the str.find() function - python

I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong?
I have a matrix where the 1st column is numbers and the 2nd is an abbreviation assigned to those letters. the abbrevations are either ED, LI or NA, I'm trying to find the positions that correspond to those letters so that I can plot a scatter graph that is colour coded to match those 3 groups.
mat=sio.loadmat('PBMC_extract.mat') #loading the data file into python
data=mat['spectra']
data_name=mat['name'] #calling in varibale
data_name = pd.DataFrame(data_name) #converting intoa readable matrix
pca=PCA(n_components=20) # preforms pca on data with 20 components
pca.fit(data) #fits to data set
datatrans=pca.transform(data) #transforms data using PCA
# plotting the graph that accounts for majority of data and noise
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
fig = plt.figure()
ax1 = Axes3D(fig)
#str.find to find individual positions of anticoagulants
str.find(data_name,'ED')
#renaming data for easiness
x_data=datatrans[0:35,0]
y_data=datatrans[0:35,1]
z_data=datatrans[0:35,2]
x2_data=datatrans[36:82,0]
y2_data=datatrans[36:82,1]
z2_data=datatrans[36:82,2]
x3_data=datatrans[83:97,0]
y3_data=datatrans[83:97,1]
z3_data=datatrans[83:97,2]
# scatter plot of score of PC1,2,3
ax1.scatter(x_data, y_data, z_data,c='b', marker="^")
ax1.scatter(x2_data, y2_data, z2_data,c='r', marker="o")
ax1.scatter(x3_data, y3_data, z3_data,c='g', marker="s")
ax1.set_xlabel('PC 1')
ax1.set_ylabel('PC 2')
ax1.set_zlabel('PC 3')
plt.show()
the error that keeps showing up is the following;
File "/Users/emma/Desktop/Final year project /working example of colouring data", line 49, in <module>
str.find(data_name,'ED')
TypeError: descriptor 'find' requires a 'str' object but received a 'DataFrame'

The error is because the find method expects a str object instead of a DataFrame object. As PiRK mentioned the problem is you're replacing the data_name variable here:
data_name = pd.DataFrame(data_name)
I believe it should be:
data = pd.DataFrame(data_name)
Also, although str.find(data_name,'ED') works, the suggested way to is to pass only the search term like this:
data_name.find('ED')

the proper syntax would be
data_name.find('ED')
look at the examples here
https://www.programiz.com/python-programming/methods/string/find
EDIT 1
though I just noticed data_name is a pandas DataFrame, so that won't work? What exactly are you trying to do?
your broken function call isn't even returning into a variable? So it's hard to answer your question?

Related

Is there a way to use precomputed distances in Seaborn with the colour bar/bar code representation?

I'm trying to use Seaborn to plot my dendrogram because its graphics are much better than the dendrograms I produced so far, but I need my graphs to show specific information and I'm having trouble getting Seaborn to do everything. I'm using a precomputed distance matrix because I'm using a novel distance measure this is making things more difficult. I have so far:
Data = read in the data, a dict of pairwise distance measures: {(User1, User2) : 0.0001445, (User1, User3): 0.593983, etc.......}
keys = [sorted(k) for k in Data.keys()]
values = Data.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))
Links = sch.linkage(distances, 'ward')
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
Labels = pd.Series(labels)
lut = dict(zip(Labels.unique(), "rbgycm"))
row_colours = Labels.map(lut)
g = sns.clustermap(distances) #gives me a dendrogram and heatmap successfully.
g = sns.clustermap(distances, row_color=row_colours) #gives me and error message
I want to use the second command to include the barcode style representation of each 'user' that a cluster, as in the 3rd example on Seaborn's clustermap docs, but I get the error
g = sns.clustermap(Links, row_colors=row_colours)
File "/Users/.../seaborn/_decorators.py", line 46, in inner_f
return f(**kwargs)
File "/Users/.../seaborn/matrix.py", line 1400, in clustermap
colors_ratio=colors_ratio, cbar_pos=cbar_pos)
File "/Users/.../seaborn/matrix.py", line 813, in __init__
self._preprocess_colors(data, row_colors, axis=0)
File "/Users/.../seaborn/matrix.py", line 877, in _preprocess_colors
colors = colors.reindex(data.index)
AttributeError: 'numpy.ndarray' object has no attribute 'index'
In addition, Seaborn gives me the truncated dendrogram and then a heatmap of the 4 main clusters. I can only assume this is something generic, or it chose where to optimally cut the dendrogram. I hope to be able to specify the number of clusters, as I need 6. Is this possible?
This is what I have so far:
But I want to add this colour bar:
Everything I've found on Seaborn includes a preloaded dataset and classic distance measure, and I'm new to using libraries and packages so I don't really know what it needs for this to work. I've tried using some wrappers to change the types but haven't got anything to work. I don't understand what difference the colourbar makes, why this brings a requirement for a new type despite the colour deciding command resulting in the same types and entries at the Iris example on the Seaborn site. It can manage the dendrogram from my Links variable. The colours are just labels and corresponding colours with indexes. I don't know what I'm missing to make it run.

Problems with matplotlib and datetime

I am trying to plot the observed and calculated values of a time series parameter using matplotlib. The observed data are stored in an XL file which I read using openpyxl and convert to a pandas dataframe. The simulated values are read as elapsed days which I convert to numpy datetime using
delt = get_simulated_time()
t0 = np.datetime64('2004-01-01T00:00:00')
tsim = t0 + np.asarray(delt).astype('timedelta64[D]')
I plot the data using the following code snippet
df = obs_data_df.query("block=='blk-7'")
pobs = df['pres']
tobs = df['date']
tobs = np.array(tobs, dtype='datetime64')
print(type(tobs), np.min(tobs), np.max(tobs))
axs.plot(tobs, pobs, '.', color='g', label='blk-7, obs', markersize=8)
tsim = np.array(curr_sim_obj.tsim, dtype='datetime64')
print(type(tsim), np.min(tsim), np.max(tsim))
axs.plot(tsim, curr_sim_obj.psim[:, 0], '-', color='g', label='blk-7, sim', linewidth=1)
The results of the print statements are:
print(type(tobs), np.min(tobs), np.max(tobs))
... <class 'numpy.ndarray'> 2004-06-01T00:00:00.000000000 2020-06-01T00:00:00.000000000
print(type(tsim), np.min(tsim), np.max(tsim))
... <class 'numpy.ndarray'> 2004-01-01T00:00:00 2020-07-20T00:00:00
These types look OK but I get this error message from matplotlib:
ValueError: view limit minimum -36907.706903627106 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
I don't understand why I am getting this message since the print statements indicate that the data are consistent. I tried investigating further using
print(np.dtype(tsim), np.min(tobs), np.max(tobs))
but get this error message:
TypeError: data type not understood
This has confused me even further since I set the tobs data type in the preceding statement. I have to say that I am really confused about the differences in the way that python, pandas and numpy handle dates and the various code kludges above reflect workarounds that I have picked up along the way. I would basically like to know how to plot the two different time series on the same plot so all suggestions very welcome. Thank you in advance!
Update:
While cutting down the code to get a simpler case that reproduced the error I found the following code buried in the plotting routine:
axs.plot(10*np.random.randn(100), 10*np.random.randn(100), 'o')
This was left over from testing the plot routine. Once I removed this the errors disappeared. I guess I need to check my code more carefully ...
The solution was to use the matplotlib.dates.num2date function on both sets of data.

Multiple labels in Matplotlib

I've created a plot and I want the first item that I plot to have a label that is partly a string and partly element 0 of array "t". I tried setting the variable the_initial_state equal to a string:
the_initial_state = str('the initial state')
And the plotting as follows:
plt.figure(5)
fig = plt.figure(figsize=(6,6), dpi=1000)
plt.rc("font", size=10)
plt.title("Time-Dependent Probability Density Function")
plt.xlabel("x")
plt.xlim(-10,10)
plt.ylim(0,0.8)
plt.plot(x,U,'k--')
**plt.plot(x,Pd[0],'r',label= the_initial_state, label =t[0])**
plt.plot(x,Pd[1],'m',label=t[1])
plt.plot(x,Pd[50],'g',label=t[50])
plt.plot(x,Pd[100],'c',label=t[100])
plt.legend(title = "time", bbox_to_anchor=(1.05, 0.9), loc=2, borderaxespad=0.)
But I receive an "invalid syntax" error for the line that is indicated by ** **.
Is there any way to have a label that contains a string and an element of an array to a fixed number of decimal places?
'the initial state' already is a string, so you do not need to cast it again.
I do not see a syntax error for the moment, but surely you cannot set the label twice.
Concattenating a string and a float in python can e.g. be done using the format function.
the_initial_state = 'the initial state {}'.format(t[0])
plt.plot(x,Pd[0],'r',label= the_initial_state)
should work.
There is a nice page outside explaining the format syntax. For example, to format the float 2.345672 to 2 decimal places, use
"{:.2f}".format(2.345672)

Imshow and pcolor throw errors when trying to create test pattern-style bars

I am trying to create an image to use as a test pattern for a new colormap I'm creating. The map is supposed to have nine unique colors with breaks at the integers from 0-8. The colormap itself is fine, but I can't seem to generate the image itsel.
I'm using pandas to make the test array like this:
mask=pan.DataFrame(index=np.arange(0,100),columns=np.arange(1,91))
mask.ix[:,1:10]=0.0
mask.ix[:,11:20]=1.0
mask.ix[:,21:30]=2.0
mask.ix[:,31:40]=3.0
mask.ix[:,41:50]=4.0
mask.ix[:,51:60]=5.0
mask.ix[:,61:70]=6.0
mask.ix[:,71:80]=7.0
mask.ix[:,81:90]=8.0
Maybe not the most elegant method, but it creates the array I want.
However, when I try to plot it using either imshow or pcolor I get an error. So:
fig=plt.figure()
ax=fig.add_subplot(111)
image=ax.imshow(mask)
fig.canvas.draw()
yields the error: "TypeError: Image data can not convert to float"
and substituting pcolor for imshow yields this error: "AttributeError: 'float' object has no attribute 'view'"
However, when I replace he values in mask with anything else - say random numbers - it plots just fine:
mask=pan.DataFrame(values=rand(100,90),index=np.arange(0,100),columns=np.arange(1,91))
fig=plt.figure()
ax=fig.add_subplot(111)
image=ax.imshow(mask)
fig.canvas.draw()
yields the standard colored speckle one would expect (no errors).
The problem here is that your dataframe is full of objects, not numbers. You can see it if you do mask.dtypes. If you want to use pandas dataframes, create mask by specifying the data type:
mask=pan.DataFrame(index=np.arange(0,100),columns=np.arange(1,91), dtype='float')
otherwise pandas cannot know which data type you want. After that change your code should work.
However, if you want to just test the color maps with integers, then you might be better off using simple numpy arrays:
mask = np.empty((100,90), dtype='int')
mask[:, :10] = 0
mask[:, 10:20] = 1
...
And, of course, there are shorter ways to do that filling, as well. For example:
mask[:] = np.arange(90)[None,:] / 10

Parsing a file and creating histogram with python

I have to create an histogram from a source file that I have to parse:
for line in fp:
data = line.split('__')
if(len(data)==3 and data[2]!='\n' and data[1]!=''):
job_info = data[0].split(';')
[...]
job_times_req = data[2].split(';')
if(len(job_times_req)==6):
cpu_req = job_times_req[3]
The parsing is correct, I have try it, but now I would like to create an histogram on how many time I have called the X cpu. Example if I have called the first one 10 times, the second 4 times and so on I would like to see the hist of this.
I have try something like:
a.append(cpu_req )
plt.hist(a, 100)
plt.xlabel('CPU N°', fontsize=20)
plt.ylabel('Number of calls', fontsize= 20)
plt.show()
but is not working, how can I store the data in the correct way to show them in a histogram?
Solved with a simple cast
a.append(int(cpu_req))

Categories

Resources