I'm not great at matplotlib, but I need to use it for some work I am doing. I have a set of 9 columns of data, with around 100k lines. I want to produce a scatter plot, and I don't care about the rows, they're meaningless for my purposes. What I need is for the values to be plotted as a scatter against which column they are in, regardless of which row they a part of.
This is all loaded in from a text file in a simple 2D array using numpy.loadtxt. It's just a set of numbers, so any substitution of random numbers should work. I'm just not sure how to manipulate it in a way that the scatter command will like. I often get it giving me errors like I'm giving it too few arguments, or it just iterates over the array (or arrays if I separate them), in ways I do not anticipate.
My first thought is that I could somehow break it down into a set of series by column, but I don't think the scatter command will take that. Any help would be very much appreciated.
The scatter function takes in two lists that have the same length. To access a single column n of your numpy array, just use data[:, n]. Since you want to correspond all columns with their column number, you need to also create a list that has the same length of data, with only the column number as elements. To create the plot you want, just do the following:
for i in range(9):
plt.scatter([i + 1] * len(data), data[:, i])
Related
When I build a scatterplot of this data, you can see see that the one large value (462) is completely swamping even being able to see some of the other points.
Does anyone know of a specific way to normalize this data, so that the small dots can see be seen, while maintaining a link between the size of the dot and the value size. I'm thinking would either of these make sense:
(1) Set a minimum value for the size a dot can be
(2) Do some normalization of the data somehow, but I guess the large data point will always be 462 compared to some of the other points with a value of 1.
Just wondering how other people get around this, so they don't actually miss seeing some points on the plot that are actually there? Or I guess is the most obvious answer just don't scale the points by size, and then add a label to each point somehow with the size.
you can clip() https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.clip.html the values used for size param
full solution below
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame(
{"Class": np.linspace(-8, 4, 25), "Values": np.random.randint(1, 40, 25)}
).assign(Class=lambda d: "class_" + d["Class"].astype(str))
df.iloc[7, 1] = 462
px.scatter(df, x="Class", y="Values", size=df["Values"].clip(0, 50))
This isn't really a question linking to Python directly, but more to plotting styles. There are several ways to solve the issue in your case:
Split the data into equally sized categories and assign colorlabels. Your legend would look something like this in this case:
0 - 1: color 1
2 - 20: color 2
...
The way to implement this is to split your data into the sets you want and plotting seperate scatter plots each with a new color. See here or here for examples
The second option that is frequently used is to use the log of the value for the bubble size. You would just have to point that out quite clearly in your legend.
The third option is to limit marker size to an arbitrary value. I personally am not a bit fan of this method since it changes the information shown in a degree that the other alternatives don't, but if you add a data callout, this would still be legitimate.
These options should be fairly easy to implement in code. If you are having difficulties, feel free to post runnable sample code and we could implement an example as well.
I have a certain data set called "df" existing of 5000 rows and 32 columns. I want to plot 16 graphs by using a for loop. There are two problems that I cannot overcome:
The plot does not show when using this code:
proef_numbers = [1,2,3,4,5]
def plot_results(df, proef_numbers, title):
for proef in proef_numbers:
for test in range(1,2,3,4,5):
S_data = df[f"S_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]
F_data = df[f"F_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]-F0
plt.plot(S_data, F_data, label=f"Proef {proef} test {test}" )
plt.xlabel('Time [s]')
plt.ylabel('Force [N]')
plt.title(f"Proef {proef}, test {test}")
plt.legend()
plt.show()
After this I tried something else and restructured my data set and I wanted to use the following for loop:
for i in range(1,17):
plt.plot(df[i],df[i+16])
plt.show()
Then I get the error:
KeyError: 1
For some reason, I cannot even print(df[1]) anymore. It will give me "KeyError: 1" also. As you have probably guessed by now I am very new to python.
There are a couple problems with the code that could be causing problems.
First, the range function behaves differently from how you use it in the top code block. Range is defined as range(start, end, step) where the start number is included in the range, the end number is not included, and the step is 1 by default. The way that the top code is now, it should not even run. If you want to make it easier to understand for yourself, you could replace range(1,5) (range(1,2,3,4,5) in the code above) with [1,2,3,4] since you can use a for statement to iterate over a list like you can for a range object.
Also, how are you calling the function? In the code example that you gave, you don't have the call to the function. If you don't call the function, it does not execute the code. If you don't want to use a function, that is okay, but it will change the code to be the code below. The function just makes the code more flexible if you want to make different variations of plots.
proef_numbers = [1,2,3,4]
for proef in proef_numbers:
for test in range(1,5):
S_data = df[f"S_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]
F_data = df[f"F_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]-F0
plt.plot(S_data, F_data, label=f"Proef {proef} test {test}" )
plt.xlabel('Time [s]')
plt.ylabel('Force [N]')
plt.title(f"Proef {proef}, test {test}")
plt.legend()
plt.show()
I tested it with dummy data from your other question, and it seems to work.
For your other question, it seems that you want to try to index columns by number, right? As this question shows, you can use .iloc for your pandas dataframe to locate by index (instead of column name). So you will change the second block of code to this:
for i in range(1,17):
plt.plot(df.iloc[:,i],df.iloc[:,i+16])
plt.show()
For this, the df.iloc[:,i] means that you are looking at all the rows (when used by itself, : means all of the elements) and i means the ith column. Keep in mind that python is zero indexed, so the first column would be 0. In that case, you might want to change range(1,17) to range(0,16) or simply range(16) since range defaults to a start value of 0.
I would highly recommend against locating by index though. If you have good column names, you should use those instead since it is more robust. When you select a column by name, you get exactly what you want. When you select by index, there could be a small chance of error if your columns get shuffled for some strange reason.
If you want to see multiple plots at the same time, like a grid of plots, I suggest looking at using sublots:
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html
For indexing your dataframe you should use .loc method. Have a look at:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
Since you are new to python, I would suggest learning using NumPy arrays. You can convert your dataframe directly to a NumPy array then plot slices of it.
I am using Pandas and trying to create a histogram for the frequency of float numbers (in bins)
I don't need to set those myself to any special value, just get a reasonable looking histogram, showing the frequency of the data points in reasonable binning or continues axis.
Trying the use this code
vals, bins = np.histogram(total["ratio"].tolist(), bins=10)
total[["ratio"]].plot().hist(x="ratio", bins=bins)
I also tried using 10 as the parameter for bins. Got the same result
Taken from the docs
And still get the axis not binned, which creates a large mess.
What am I missing here in order to bin the data?
Update
Using df["ratio"].hist() worked like a charm.
What was the reason that doing the actual same by df[["ratio"]].plot().hist() did not work?
I have built a python program processing the probability of various datasets. I input 'manually' various mean values and standard deviations, and that works, however I need to automate it so that I can upload all my data through a text or csv file. I've got so far but now have a nested for loop query I think with indices problems, but some background follows...
My code works for a small dataset where I can manually key in 6-8 parameters working but now I need to automate it and upload various inputs of unknown sizes by csv / text file. I am copying my existing code and amending it where appropriate but I have run into a problem.
I have a 2_D numpy-array where some probabilities have been reverse sorted. I have a second array which gives me the value of 68.3% of each row, and I want to trim the low value 31.7% data.
I need a solution which can handle an unspecified number of rows.
My pre-existing code worked for a single one-dimensional array was
prob_combine_sum= np.sum(prob_combine)
#Reverse sort the probabilities
prob_combine_sorted=sorted(prob_combine, reverse=True)
#Calculate 1 SD from peak Prob by multiplying Total Prob by 68.3%
sixty_eight_percent=prob_combine_sum*0.68269
#Loop over the sorted list and append the 1SD data into a list
#onesd_prob_combine
onesd_prob_combine=[]
for i in prob_combine_sorted:
onesd_prob_combine.append(i)
if sum(onesd_prob_combine) > sixty_eight_percent:
break
That worked. However, now I have a multi-dimensional array, and I want to take the 1 standard deviation data from that multi-dimensional array and stick it in another.
There's probably more than one way of doing this but I thought I would stick to the for loop, but now it's more complicated by the indices. I need to preserve the data structure, and I need to be able to handle unlimited numbers of rows in the future.
I simulated some data and if I can get this to work with this, I should be able to put it in my program.
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]])
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
#Task transfer data from sorted_probabilities to target array on
condition that value in each target row is less than the value in the
sd_test array.
#Ignore the problem that data transferred won't add up to 68.3%.
My real data-sample is very big. I just need a way of trimmining
and transferring.
for row in sorted_probabilities:
for element in row:
target_array[row].append[i]
if sum(target[row]) > sd_test[row]:
break
Error: IndexError: index 9 is out of bounds for axis 0 with size 4
I know it's not a very good attempt. My problem is that I need a solution which will work for any 2D array, not just one with 4 rows.
I'd be really grateful for any help.
Thank you
Edit:
Can someone help me out with this? I am struggling.
I think the reason my loop will not work is that the 'index' row I am using is not a number, but in this case a row. I will have a think about this. In meantime has anyone a solution?
Thanks
I tried the following code after reading the comments:
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter]=sorted_probabilities[counter][element]
if target_array[counter] > sd_test[counter]:
break
I get an error: IndexError: index 9 is out of bounds for axis 0 with size 9
I think it's because I am trying to add to numpy array of pre-determined dimensions? I am not sure. I am going to try another tack now as I can not do this with this approach. It's having to maintain the rows in the target array that is making it difficult. Each row relates to an object, and if I lose the structure it will be pointless.
I recommend you use pandas. You can read directly the csv in a dataframe and do multiple operations on columns and such, clean and neat.
You are mixing numpy arrays with python lists. Better use only one of these (numpy is preferred). Also try to debug your code, because it has either syntax and logical errors. You don't have variable i, though you're using it as an index; also you are using row as index while it is a numpy array, but not an integer.
I strongly recommend you to
0) debug your code (at least with prints)
1) use enumerate to create both of your for loops;
2) replace append with plain assigning, because you've already created an empty vector (target_array). Or initialize your target_array as empty list and append into it.
3) if you want to use your solution for any 2d array, wrap your code into a function
Try this:
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],
[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]]
)
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter] = element # Here I removed the code that produced error
if target_array[counter] > sd_test[counter]:
break
I've got a bunch of data that I read in from a telnet machine...It's x, y data, but it's just comma separated like x1,y1,x2,y2,x3,y3,...
Here's a sample:
In[28] data[1:500]
Out[28] b'3.00000000E+007,-5.26880000E+001,+3.09700000E+007,-5.75940000E+001,+3.19400000E+007,-5.56250000E+001,+3.29100000E+007,-5.69380000E+001,+3.38800000E+007,-5.40630000E+001,+3.48500000E+007,-5.36560000E+001,+3.58200000E+007,-5.67190000E+001,+3.67900000E+007,-5.51720000E+001,+3.77600000E+007,-5.99840000E+001,+3.87300000E+007,-5.58910000E+001,+3.97000000E+007,-5.35160000E+001,+4.06700000E+007,-5.48130000E+001,+4.16400000E+007,-5.52810000E+001,+4.26100000E+007,-5.64690000E+001,+4.35800000E+007,-5.3938'
I want to plot this as a line graph with matplot lib. I've tried the struct package for converting this into a list of doubles instead of bytes, but I ran into so many problems with that...Then there's the issue of what to do with the scientific notation...I want to obviously perserve the magnitude that it's trying to convey, but I can't just do that by naively swapping the byte encodings to what they would mean with a double encoding.
I'm trying all sorts of things that I would normally try with C, but I can't help but think there's a better way with Python!
I think I need to get the x's and y's into a numpy array and do so without losing any of the exponential notation...
Any ideas?
First convert your data to numbers with for example:
data = b'3.00000000E+007,-5.26880000E+001,+3.09700000E+007,-5.75940000E+001'
numbers = map(float, data.split(','))
Now slice the array to get x and y-data seperately and plot it with matplotlib
x = numbers[::2]
y = numbers[1::2]
import matplotlib.pyplot as plt
plt.plot(x, y)
plt.show()
to read your items, two at a time (x1, y1), try this: iterating over two values of a list at a time in python
and then (divide and conquer style), deal with the scientific notation separately.
Once you've got two lists of numbers, then the matplotlib documentation can take it from there.
That may not be a complete answer, but it should get you a start.