I am trying to get a bar plot for feature importance in the XGBoost classifier. It should have worked but it didn't. I tried too many times. Can you check the code below and tell me what is wrong with it?
feat_import=clf.feature_importances_
feat_names=X.columns
sorted_idx=clf.feature_importances_.argsort()[-20:]
plt.barh(feat_names[sorted_idx], clf.feature_importances_[sorted_idx])
It takes the features that the most important ones. However, it plots them unsorted.
When I use just numbers instead of column names I take the sorted bar graph.
plt.barh(range(20),feat_import[sorted_idx])
I couldn't figure out the problem here.
Related
I have a DataFrame containing data about different projects. I tried to create a bar chart, representing the number of users for each project and this 'filtered' depending if the project has been audited before or after 2014.
My problem is that I would like to have all of the bars ranked from the biggest one to the smallest one, and not one on one side and the other one on the other side. I think it's quite hard to understand but with the following pictures, it will be much clearer.
I tried the following:
applications = applications.sort_values(by='NombreUtilisateur2017', ascending=False)
fig = px.bar(applications, x='AppCode', y='NombreUtilisateur2017', color='test_avant_2014')
fig.show()
Here is my output:
current output
But, I would like my graph to look like this:
expected output
So I am making a program to plot a bar graph for a probability data set. The data set is not stored, at least I don't want it to. I need to plot a bar for every possibility,and I want the bars to be dynamic. Dynamic in the sense that I don't want them to be plotted by counting the occurrence of each item from the stored data set as I said the data set is not stored. I want the bars to generate with the data simultaneously. \n
I was trying to use python lists. So the bars would look something like, 36[****************]. But I can't think of using them dynamically. I am left with two possibilities, one that I generate like 60-120 bars (which is stupid). Or I store the data (which increases my work and execution time and load). And I also can't think of other things. So suggest me something please!
I am trying to plot the availability of my network per hour. So,I have a massive dataframe containing multiple variables including the availability and hour. I can clearly visualise everything I want on my plot I want to plot when I do the following:
mond_data= mond_data.groupby('Hour')['Availability'].mean()
The only problem is, if I bracket the whole code and plot it (I mean this (the code above).plot); I do not get any value on my x-axis that says 'Hour'.How can plot this showing the values of my x-axis (Hour). I should have 24 values as the code above bring an aaverage for the whole day for midnight to 11pm.
Here is how I solved it.
plt.plot(mon_data.index,mond_data.groupby('Hour')['Availability'].mean())
for some reason python was not plotting the index, only if called. I have not tested many cases. So additional explanation to this problem is still welcome.
I am trying to plot this time series in a chart, but the canvas is empty.
As you can see in the image above, my time series is quite simple. I want to plot DATE in x-axis and PAYEMS in the y-axis.
At first, I was getting an error because my dates were strings, so I converted it in cell 11.
You do not want to use a tsplot to plot a time series. The name is a bit confusing, but as the documentation puts it, tsplot is "intended to be used with data where observations are nested within sampling units that were measured at multiple timepoints". As a rule of thumb: If you understand this sentence, you will know when to use it, if you don't understand this sentence, don't use it. Apart, tsplot will even be removed or significantly altered in the future, so its use is deprecated.
But that doesn't matter, because you can directly use pandas to plot the time series.
df.plot(x="Date", y="Payems")
I'm currently pumping out some histograms with matplotlib. The issue is that because of one or two outliers my whole graph is incredibly small and almost impossible to read due to having two separate histograms being plotted. The solution I am having problems with is dropping the outliers at around a 99/99.5 percentile. I have tried using:
plt.xlim([np.percentile(df,0), np.percentile(df,99.5)])
plt.xlim([df.min(),np.percentile(df,99.5)])
Seems like it should be a simple fix, but I'm missing some key information to make it happen. Any input would be much appreciated, thanks in advance.
To restrict focus to just the middle 99% of the values, you could do something like this:
trimmed_data = df[(df.Column > df.Column.quantile(0.005)) & (df.Column < df.Column.quantile(0.995))]
Then you could do your histogram on trimmed_data. Exactly how to exclude outliers is more of a stats question than a Python question, but basically the idea I was suggesting in a comment is to clean up the data set using whatever methods you can defend, and then do everything (plots, stats, etc.) on only the cleaned dataset, rather than trying to tweak each individual plot to make it look right while still having the outlier data in there.