I am plotting some columns of a csv using Pandas/Matplotlib. The index column is the time in seconds (which has very high number).
For example:
401287629.8
401287630.8
401287631.7
401287632.8
401287633.8
401287634.8
I need this to be printed as my xticklabel when i plot. But it is changing the number format as shown below:
plt.figure()
ax = dfPlot.plot()
legend = ax.legend(loc='center left', bbox_to_anchor=(1,0.5))
labels = ax.get_xticklabels()
for label in labels:
label.set_rotation(45)
label.set_fontsize(10)
I couldn't find a way for the xticklabel to print the exact value rather than shortened version of it.
This is essentially the same problem as How to remove relative shift in matplotlib axis
The solution is to tell the formatter to not use an offset
ax.get_xaxis().get_major_formatter().set_useOffset(False)
Also related:
useOffset=False in config file?
https://github.com/matplotlib/matplotlib/issues/2400
https://github.com/matplotlib/matplotlib/pull/2401
If it's not rude of me to point out, you're asking for a great deal of precision from a single chart. Your sample data shows a six-second difference over two times that are both over twelve and a half-years long.
You have to cut your cloth to your measure on this one. If you want to keep the years, you can't keep the seconds. If you want to keep the seconds, you can't have the years.
Related
I need to reduce or manually set the number of ticks on the x-axis of a Matplotlib line plot. This question has been asked many times here, I've gone through as many of those answers as I can find and through the Matplotlib docs and I haven't found a solution I can get working so I'm hoping for some help.
I have a Python dictionary with two sets of key:value pairs - datetime.datetime and float. There's hundreds of values in each set - but here's a snippet of the first elements just for reference:
ws_kline_dict_01 = {'time': [datetime.datetime(2023, 2, 15, 10, 35, 8)], 'close': [22183.07]}
I've converted that dictionary to a Pandas dataframe so I can see it more easily in Jupyter and also stripped out the year, month and day from 'time' using:
df_kline_dict_01 = pd.DataFrame(ws_kline_dict_01)
df_kline_dict_01['time'] = df_kline_dict_01['time'].dt.strftime('%H:%M:%S')
When I plot this via Matplotlib using 'time' as the x-axis - it prints every value as a tick which is way too cluttered (see 'Plot: Post-Panda format' below).
If I leave the datetime.datetime in its original form - Matplotlib seems to auto-select how many values it displays and it displays "Day Hour:Minutes" instead of "Hour:Minutes:Seconds" - which isn't working for me (see 'Plot: Pre-Panda format' below).
I've tried plt.locator_params(axis='x', nbins=n) - but this is giving me an error message:
"UserWarning: 'set_params()' not defined for locator of type <class 'matplotlib.category.StrCategoryLocator'>".
For reference - this is the code I'm using to produce the plot:
plt.plot(df_kline_dict_01['time'], df_kline_dict_01['close'], color = 'green', label = 'close')
plt.xticks(rotation=45, ha='right')
plt.show()
How do I (at least) reduce or (ideally) explicitly set the number of values/ticks shown on the x-axis?
Seems like this should be a pretty simple formatting task - but so far it's beating me and I'd appreciate some help getting this sorted.
Plot: Pre-Panda format
Plot: Post-Panda format
Here is a possible solution using the .xaxis.set_major_locator() method. You can adjust the max_xticks variable to suit your use-case.
...
df_kline_dict_01['time'] = df_kline_dict_01['time'].dt.strftime('%H:%M:%S')
fig, ax = plt.subplots()
ax.plot(df_kline_dict_01['time'], df_kline_dict_01['close'], color='green', label='close')
max_xticks = 6
ax.xaxis.set_major_locator(ticker.MaxNLocator(max_xticks))
plt.xticks(rotation=45, ha='right')
plt.show()
Note: I assigned max_xticks = 6 so it helps you understand the code otherwise you could just set the value in .MaxNLocator(6) in the next line of code.
Put some parameters for the locations like $plt.xticks(np.arange(min,max,step),rotation=45, ha='right')$
fill the min and max and steps as you wish
So I am working with some data for a science fair project, and I am extremely new to pandas and matplotlib/pyplot. I am currently trying to make a graph of some data (a bar graph) and have been able to do so fine. I split my DataFrame into two parts: the name and the values themselves:
data = pd.read_csv('results.csv')
data = data.sort_values(by=['Accuracy'], ascending=False)
accuracy = data['Accuracy']
names = data['Name']
This works fine. And when I go to make my graph it also works fine:
plt.bar(names, accuracy)
plt.title('Accuracy Below 97%')
plt.ylabel('Accuracy in Percent')
plt.show()
But the only problem is that when I do this, my names are too long so it ends up as a sort of blur:
I also have around 40 data points which I understand is probably too many to be able see the names anyways, but the names are around 30 characters long so even if I reduced the amount of data points in a graph, it still would probably not work.
So I then I just assumed that I would remove names from plt.bar(names, accuracy) but this throws the error:
TypeError: bar() missing 1 required positional argument: 'height'
So I realized that I need a width value, and since the number of data point was 42 I then tried:
plt.bar(42, accuracy)
But this creates a weird graph that I am not looking for:
So my question is: how do I remove the names from the graph while keeping the actual graph the same?
Any help is greatly appreciated. Thanks!
if you want to remove the xticks labels from the graph
you can do
plt.xticks([])
Also, you can adjust the x-axis limits to remove the labels completely.
plt.xticks([])
plt.xlim(-0.5, len(accuracy)-0.5)
Here is what you want but you can handle those with this link instead of deleting the problem.
datetime x-axis matplotlib labels causing uncontrolled overlap
ax = data[['Accuracy','Name']].plot(title='Accuracy Below 97%')
ax.get_xaxis().set_visible(False)
pyplot.show()
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I am trying to plot three lines on one figure. I have data for three years for three sites and i am simply trying to plot them with the same x axis and same y axis. The first two lines span all three years of data, while the third dataset is usually more sparse. Using the object-oriented axes matplotlib format, when i try to plot my third set of data, I get points at the end of the graph that are out of the range of my third set of data. my third dataset is structured as tuples of dates and values such as:
data=
[('2019-07-15', 30.6),
('2019-07-16', 20.88),
('2019-07-17', 16.94),
('2019-07-18', 11.99),
('2019-07-19', 13.76),
('2019-07-20', 16.97),
('2019-07-21', 19.9),
('2019-07-22', 25.56),
('2019-07-23', 18.59),
...
('2020-08-11', 8.33),
('2020-08-12', 10.06),
('2020-08-13', 12.21),
('2020-08-15', 6.94),
('2020-08-16', 5.51),
('2020-08-17', 6.98),
('2020-08-18', 6.17)]
where the data ends in August 2020, yet the graph includes points at the end of 2020. This is happening with all my sites, as the first two datasets stay constant knowndf['DATE'] and knowndf['Value'] below.
Here is the problematic graph.
And here is what I have for the plotting:
fig, ax=plt.subplots(1,1,figsize=(15,12))
fig.tight_layout(pad=6)
ax.plot(knowndf['DATE'], knowndf['Value1'],'b',alpha=0.7)
ax.plot(knowndf['DATE'], knowndf['Value2'],color='red',alpha=0.7)
ax.plot(*zip(*data), 'g*', markersize=8) #when i plot this set of data i get nonexistent points
ax.tick_params(axis='x', rotation=45) #rotating for aesthetic
ax.set_xticks(ax.get_xticks()[::30]) #only want every 30th tick instead of every daily tick
I've tried ax.twinx() and that gives me two y axis that doesn't help me since i want to use the same x-axis and y-axis for all three sites. I've tried not using the axes approach, but there are things that come with axes that i need to plot with. Please please help!
I would like to remove the flat lines on my graph by keeping the labels x.
I have this code which gives me a picture
dates = df_stock.loc[start_date:end_date].index.values
x_values = np.array([datetime.datetime.strptime(d, "%Y-%m-%d %H:%M:%S") for d in dates])
fig, ax = plt.subplots(figsize=(15,9))
# y values
y_values = np.array(df_stock.loc[start_date:end_date, 'Bid'])
# plotting
_ = ax.plot(x_values, y_values, label='Bid')
# formatting
formatter = mdates.DateFormatter('%m-%d %H:%M')
ax.xaxis.set_major_formatter(formatter)
The flat lines correspond to data which does not exist I would like to know if it is possible not to display them while keeping the gap of the x labels.
thank you so much
You want to have time on the x-axis and time is equidistant -- independent whether you have data or not.
You now have several options:
don't use time on the x-axis but samples/index
do as in 1. but change the ticks & labels to draw time again (but this time not equidistantly)
make the value-vector equidistant and use NaNs to fill the gaps
Why is this so?
Per default, matplotlib produces a line plot, which connects the points with lines using the order in which they are presented. In contrast to this a scatter plot just plots the individual points, not suggesting any underlying order. You achieve the same result as if you would use a line plot without markers.
In general, you have 3-4 options
use the plot command but only plot markers (add linestyle='')
use the scatter command.
if you use NaNs, plotdoes not know what to plot and plots nothing (but also won't connect non-existing points with lines)
use a loop and plot connected sections as separate lines in the same axes
options 1/2 are the easiest if you want to do almost no changes on your code. Option 3 is the most proper and 4 mimics this result.