This is my first post on StackOverflow. I've never asked here before because all of it found its answer by browsing here, except this one time.
I am trying to plot date to have a schematic representation but I am not sure what is the 'best way' to do it and how to achieve what I thought of.
I figured this is the best way to represent the dataset I have :
So each data has
a x-axis start
a x-axis end
a y-axis value
a z-axis value
My data are stored into csv file like
start | stop | y-value | z-value
I thought about using heatmap to do so but I am not sure if :
This is the best way to do it (The overlap can be problematic to handle)
is there is an easy way to do it (should I manually add all the required point between start & stop?
If I want to highlight some data, can I change (just for some of it) the z-color scale?
I thought a little help from here might clarify things :)
Cheers,
Little update : The way I was heading is working. I am not sure whether is is the best aproach but at least it is working.
I still need to make formating better but this is more or less what one can get :
So, the way I implemented it is to build a matrix and for each data fill each discreet point of the matrix with the highest z-value.
Related
I am a novice at python so I apologize if this is confusing. I am trying to create a 6 variable venn diagram. I was trying to use matplotlib-venn, however the problem I am having is creating the sets is turning out to be impossible for me. My data is thousands of rows long with a unique index and each column has boolean values for each category. It looks something like this:
|A|B|C|D|E|F|
|0|0|1|0|1|1|
|1|1|0|0|0|0|
|0|0|0|1|0|0|
Ideally I'd like to make a venn diagram which would show that these # of people overlap with category A and B and C. How would I go about doing this? If anyone would be able to point me in the right direction, I'd be really grateful.
I found this person had a similiar problem with me and his solution at the end of that forum is what I'd like to end up at except with 6 variables: https://community.plotly.com/t/how-to-visualize-3-columns-with-boolean-values/36181/4
Thank you for any help!
Perhaps you might try to be more specific about your needs and what you have tried.
Making a six-set Venn diagram is not trivial at all, ever more so if you want to make the areas proportional. I made a program in C++ (nVenn) with a translation to R (nVennR) that can do that. I suppose it might be used from python, but I have never tried and I do not know if that is what you want. Also, interpreting six-set Venn diagrams is not easy, you may want to check upSet for a different kind of representation. In the meantime, I can point you to a web page I made that explains how nVenn works (link).
I am trying to plot the availability of my network per hour. So,I have a massive dataframe containing multiple variables including the availability and hour. I can clearly visualise everything I want on my plot I want to plot when I do the following:
mond_data= mond_data.groupby('Hour')['Availability'].mean()
The only problem is, if I bracket the whole code and plot it (I mean this (the code above).plot); I do not get any value on my x-axis that says 'Hour'.How can plot this showing the values of my x-axis (Hour). I should have 24 values as the code above bring an aaverage for the whole day for midnight to 11pm.
Here is how I solved it.
plt.plot(mon_data.index,mond_data.groupby('Hour')['Availability'].mean())
for some reason python was not plotting the index, only if called. I have not tested many cases. So additional explanation to this problem is still welcome.
Question is simple, how do you read those graphs? I read their explanation and it doesn't make sense to me.
I was reading TensorFlow's newly updated readme file for TensorBoard and in it it tries to explain what a "histogram" is. First it clarifies that its not really a histogram:
Right now, its name is a bit of a misnomer, as it doesn't show
histograms; instead, it shows some high-level statistics on a
distribution.
I am trying to figure out what their description is actually trying to say.
Right now I am trying to parse the specific sentence:
Each line on the chart represents a percentile in the distribution
over the data: for example, the bottom line shows how the minimum
value has changed over time, and the line in the middle shows how the
median has changed.
The first question I have is, what do they mean by "each line". There are horizontal axis and there are lines that make a square grid on the graph or maybe the plotted lines, themselves. Consider a screen shot from the TensorBoard example:
What are they referring to with "lines"? In the above example what are the lines and percentiles that they are talking about?
Then the readme file tries to provide more detail with an example:
Reading from top to bottom, the lines have the following meaning:
[maximum, 93%, 84%, 69%, 50%, 31%, 16%, 7%, minimum]
However, its unclear to me what they are talking about. What is lines and what percentiles?
It seems that they are trying to replace this in the future, but meanwhile, I am stuck with this. Can someone help me understand how to use this?
The lines that they are talking about are described below:
as for the meaning of percentile, check out the wikipedia article,
basically, the 93rd percentile means that 93% of the values are situated below the 93rd percentile line
I'm currently pumping out some histograms with matplotlib. The issue is that because of one or two outliers my whole graph is incredibly small and almost impossible to read due to having two separate histograms being plotted. The solution I am having problems with is dropping the outliers at around a 99/99.5 percentile. I have tried using:
plt.xlim([np.percentile(df,0), np.percentile(df,99.5)])
plt.xlim([df.min(),np.percentile(df,99.5)])
Seems like it should be a simple fix, but I'm missing some key information to make it happen. Any input would be much appreciated, thanks in advance.
To restrict focus to just the middle 99% of the values, you could do something like this:
trimmed_data = df[(df.Column > df.Column.quantile(0.005)) & (df.Column < df.Column.quantile(0.995))]
Then you could do your histogram on trimmed_data. Exactly how to exclude outliers is more of a stats question than a Python question, but basically the idea I was suggesting in a comment is to clean up the data set using whatever methods you can defend, and then do everything (plots, stats, etc.) on only the cleaned dataset, rather than trying to tweak each individual plot to make it look right while still having the outlier data in there.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to take a 1.8mB txt file. There are couple of header lines afterwards its all space separated data. I can pull the data off using pandas. What I'm wanting to do with the data is:
1) Cut out the non essential data. ie the first 1675 lines, roughly I want to remove and the last 3-10 lines, varies day to day, I also want to remove. I can remove the first lines, kind of. The big problem with this idea I'm having right now is knowing for sure where the 1675 pointer location is. Using something like
df = df[df.year > 1978]
only moves the initial 'pointer' to 1675. If I try
dataf = df[df.year > 1978]
it just gives me a pure copy of what I would have with the first line. It still keeps the pointer to the same 1675 start point. It won't allow me to access any of the first 1675 rows but they are still obviously there.
df.year[0]
It comes back with an error suggesting row 0 doesn't exist. I have to go out and search to find what the first readable row is...instead of flat out removing the rows and moving the new pointer up to 0 it just moves the pointer to 1675 and won't allow access to anything lower than that. I still haven't found a way to be able to determine what the last row number is through programming, through the shell it's easy but I need to be able to do it through the program so I can set up the loop for point 2.
2) I want to be able to take averages of the data, 'x' day moving averages and create a new column with the new data once I have calculated the moving average. I think I can create the new column with the Series statement...I think...I haven't tried it yet though as I haven't been able to get this far yet.
3) After all this and some more math I want to be able to graph the data with a homemade graph. I think this should be easy once I have everything else completed. I have already created the sample graph and can plot the points/lines on the graph once I have the data to work with.
Is panda the right lib for the project or should I be trying to use something else? So far the more research I do...the more lost I get as everything I keep trying gets me a little further but sets me even further back at the same time. In something similar I saw mention using something else when wanting to do math on the data block. Their wasn't any indication as to what he used though.
It sounds like you main trouble is indexing. If you want to refer to the "first" thing in a DataFrame, use df.iloc[0]. But DataFrame indexes are really powerful regardless.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I think you are headed in the right direction. Pandas gives you nice, high level control over your data so that you can manipulate it much more easily than using traditional logic. It will take some work to learn. Work through their tutorials and you should be fine. But don't gloss over them or you'll miss some important details.
I'm not sure why you are concerned that the lines you want to ignore aren't being deleted as long as they aren't used in your analysis, it doesn't really matter. Unless you are facing memory constraints, it's probably irrelevant. But, if you do find you can't afford to keep them around, I'm sure there is a way to really remove them, even if it's a bit sideways.
Processing a few megabytes worth of data is pretty easy these days and Pandas will handle it without any problems. I believe you can easily pass pandas data to numpy for your statistical calculations. You should double check that, though, before taking my word for it. Also, they mention matplotlib on the pandas website, so I am guessing it will be easy to do basic graphing as well.