I'm trying to filter some data from my plot. Specifically it is the 'hair' at the top of the cycle that I wish to remove.
The 'hair' is due to some numerical errors and I'm not aware of any method in python or how to write a code that would filter away points that don't occur frequently.
I wish to filter the data as I plot it.
You can use Smooth function for this. to get rid of noice,
import pandas as pd
import matplotlib.pyplot as plt
# Plot the Raw Data
# plt. Your Data >>>Bla Bla Blaa
smooth_data = pd.rolling_mean(ts,5).plot(style='k')
plt.show()
Related
I'm working on my school project which asks me to create a bar plot. I'm unable to understand the function, can anyone please help?
def get_barplot(f_dict,title):
"""
******* CHANGE 2 (50 points) **********
Shows and saves the Bar Plot
"""
#Uncomment and fill the blanks
freq_df = pd.DataFrame(f_dict._______,columns=['key','value']) #coverts the dictionary as dataframe
bar_plot = ___.barplot(_________________________)
bar_plot.set(title=title+'_BarPlot',xlabel='Words', ylabel='Count') #Setting title and labels
plt.xticks(rotation=45) #Rotating the each word beacuse of the length of the words
plt.show()
bar_plot.figure.savefig(title+'_barplot.png',bbox_inches='tight') #saving the file
This is the code. Can anyone please let me know what should i write in the blanks given? I've spent the last hour trying to understand but I can't
I tried to use different methods but it didnt work.
It is always useful to look at the API documentation when trying to understand the library functions.
Blank 1: In the first line of your code you are trying to create a Pandas data frame from a dictionary. The first argument for pd.DataFrame is the data (see pandas.DataFrame). In this case, the items in your dictionary i.e. f_dict.items(). The columns parameter provides you a clue here as these are "key" and "value" i.e. an item in the dictionary.
Blanks 2 and 3: I assume you are using Seaborn which has a .barplot method (see seaborn.barplot). I also assume that this has been imported with the alias sns. Seaborn's .barplot method takes a data frame as the first argument which in this case would be the data frame you created in the first line of your code i.e. sns.barplot(data=freq_df).
Firstly, you must pass to the dataframe method not just a dictionary, but its items:
freq_df = pd.DataFrame(f_dict.items(),columns=['key','value'])
Next, you need to create a barplot. Pandas has a slightly different method for creating a barplot (.plot.bar()), in your case you use .barplot, which corresponds to the method from the seaborn library.
As I understand it, you need to build a barplot for the frequency of values. The following code does this:
bar_plot = sns.barplot(x = 'value', y = freq_df['value'].value_counts(), data = freq_df)
And make sure you import the seaborn library. The abbreviation sns is usually used for it:
import seaborn as sns
I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......
I want to add a key so that I'm able to know which color is which column in my data frame. I made this by df.column_name.plot.density() multiple times. I've seen other examples with the key but I haven't been able to locate the code that adds it in.
In matplotlib, the display you're talking about is called a legend. I'm not sure if it's the same in pandas, but it's worth looking at!
Since your example didn't include enough code for me to try it out, I didn't.
Don't plot the variables one by one. Use df.plot.density(). If you want to plot a subset of variables: df.plot[var_list].density(). If you want to plot them one by one for some reason you may need to use label argument in plot function and add a legend at the end.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.normal(size = (10,4)),
columns = ["Col1", "Col2", "Col3", "Col4"])
df.plot.density()
plt.show()
I bumped into a problem when plotting a pandas series.
When plotting the series with a datetime x-axis, x-axis is accordingly relabeled when zooming, i.e. it works fine:
from matplotlib import pyplot as plt
from numpy.random import randn
from pandas import Series,date_range
import numpy as np, pandas as pd
date_index = date_range('1/1/2016', periods=6*24*7, freq='10Min')
ts = Series(randn(len(date_index)), index=date_index)
ts.plot(); plt.show()
However, when i redefine the series index as strings, a strange thing happens, the zoom does not work properly anymore (the limits seem not to change)
sindex=np.vectorize(lambda s: s.strftime('%d.%m %H:%M'))(ts.index.to_pydatetime())
ts = Series(randn(len(date_index)), index=sindex)
ts.plot(); plt.show()
Is this a bug or do i misuse/misunderstand ? advice/help would be very welcome.
I also noticed that plotting with kind='bar' is comparatively to default incredibly slow (with longer vectors), and i am not sure what would be the origin of that...
When you format your date labels as strings before plotting, you lose all the actual date information; they're just strings now. This means that pandas / matplotlib can't reformat the tick labels when you zoom. See the first paragraph after the plot here.
For you second question, bar plot will draw a tick and bar for every data point. For large series this gets expensive. At this time pandas bar plots are not hooked into the auto-formatting like like plot is. You can do a bar plot directly with matplotlib though, and suppress some of the ticks yourself.
I've been able to import and plot multiple columns of data against the same x axis (time) with legends, from csv files using genfromtxt as shown in this link:
Matplotlib: Import and plot multiple time series with legends direct from .csv
The above simple example works fine if all cells in the csv file contain data. However some of my cells have missing data, and some of the parameters (columns) only include data points every e.g. second or third time increment.
I want to plot all the parameters on the same time axis as previously; and if one or more data points in a column are missing, I want the plot function to skip the missing data points for that parameter and only draw lines between the points that are available for that parameter.
Further, I'm trying to find a generic solution which will automatically plot in the above style directly from the csv file for any number of columns, time points, missing data points etc., when these are not known in advance.
I've tried using the genfromtxt options missing_values and filling_values, as shown in my non-working example below; however I want to skip the missing data points rather than assign them the value '0'; and in any case with this approach I seem to get "ValueError: could not convert string to float" when missing data points are encountered.
Plotting multiple parameters against time on the same plot, whilst dealing with occasional or regularly skipped values must be a pretty common problem for the scientific community.
I'd be very grateful for any suggestions for an elegant solution using genfromtxt.
Non-working code and demo data below. Many thanks in anticipation.
Demo data: 'Data.csv':
Time,Parameter_1,Parameter_2,Parameter_3
0,10,12,11
1,20,,
2,25,23,
3,30,,30
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('DemoData.csv', delimiter=',', dtype=None, missing_values='', filling_values = 0)
names = (arr[0])
for n in range (1,len(names)):
plt.plot (arr[1:,0],arr[1:,n],label=names[n])
plt.legend()
plt.show()
I think if you set usemask =True in your genfromtxt command, it will do what you want. Probably don't want filling_values set either
arr = np.genfromtxt('DemoData.csv', delimiter=',', dtype=None, missing_values='', usemask=True)
you can then plot using something like this:
for n in range (1,len(names)):
plot(arr[1:,0][logical_not(arr[1:,n].mask)], arr[1:,n].compressed())