Adding a key on a density graph with Pandas - python

I want to add a key so that I'm able to know which color is which column in my data frame. I made this by df.column_name.plot.density() multiple times. I've seen other examples with the key but I haven't been able to locate the code that adds it in.

In matplotlib, the display you're talking about is called a legend. I'm not sure if it's the same in pandas, but it's worth looking at!
Since your example didn't include enough code for me to try it out, I didn't.

Don't plot the variables one by one. Use df.plot.density(). If you want to plot a subset of variables: df.plot[var_list].density(). If you want to plot them one by one for some reason you may need to use label argument in plot function and add a legend at the end.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.normal(size = (10,4)),
columns = ["Col1", "Col2", "Col3", "Col4"])
df.plot.density()
plt.show()

Related

Creating a bar plot in python

I'm working on my school project which asks me to create a bar plot. I'm unable to understand the function, can anyone please help?
def get_barplot(f_dict,title):
"""
******* CHANGE 2 (50 points) **********
Shows and saves the Bar Plot
"""
#Uncomment and fill the blanks
freq_df = pd.DataFrame(f_dict._______,columns=['key','value']) #coverts the dictionary as dataframe
bar_plot = ___.barplot(_________________________)
bar_plot.set(title=title+'_BarPlot',xlabel='Words', ylabel='Count') #Setting title and labels
plt.xticks(rotation=45) #Rotating the each word beacuse of the length of the words
plt.show()
bar_plot.figure.savefig(title+'_barplot.png',bbox_inches='tight') #saving the file
This is the code. Can anyone please let me know what should i write in the blanks given? I've spent the last hour trying to understand but I can't
I tried to use different methods but it didnt work.
It is always useful to look at the API documentation when trying to understand the library functions.
Blank 1: In the first line of your code you are trying to create a Pandas data frame from a dictionary. The first argument for pd.DataFrame is the data (see pandas.DataFrame). In this case, the items in your dictionary i.e. f_dict.items(). The columns parameter provides you a clue here as these are "key" and "value" i.e. an item in the dictionary.
Blanks 2 and 3: I assume you are using Seaborn which has a .barplot method (see seaborn.barplot). I also assume that this has been imported with the alias sns. Seaborn's .barplot method takes a data frame as the first argument which in this case would be the data frame you created in the first line of your code i.e. sns.barplot(data=freq_df).
Firstly, you must pass to the dataframe method not just a dictionary, but its items:
freq_df = pd.DataFrame(f_dict.items(),columns=['key','value'])
Next, you need to create a barplot. Pandas has a slightly different method for creating a barplot (.plot.bar()), in your case you use .barplot, which corresponds to the method from the seaborn library.
As I understand it, you need to build a barplot for the frequency of values. The following code does this:
bar_plot = sns.barplot(x = 'value', y = freq_df['value'].value_counts(), data = freq_df)
And make sure you import the seaborn library. The abbreviation sns is usually used for it:
import seaborn as sns

How do I display Grouped Bar Chartfor multiple fields? (Altair)

I have the following dataset
I want to display this in some kind of diagram: the parameters should be located on the X-axis: confirmed, deaths, recovered. They must be defined for each region_name. The Y axis should be the sum of these values. I read about the melt () method in the official documentation, but I didn't quite understand how to use it.
I need to get something like this, only in the following form.
You have wide-form data; you need to convert it to long-form data. You can either do that in pandas using melt() or a similar method, or you can use Altair's transform_fold. You can read more about this in https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data
For your data, it might look something like this:
import pandas as pd
import altair as alt
data = pd.read_csv('data_from_screenshot.csv')
alt.Chart(data).transform_fold(
["confirmed", "deaths", "recovered"],
as_=["field", "value"]
).mark_bar().encode(
x="field:N",
y="sum(value):Q",
column="region_name:N"
)

How to show more categories in a line plot of a pivot table

I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......

Filter data from python plot

I'm trying to filter some data from my plot. Specifically it is the 'hair' at the top of the cycle that I wish to remove.
The 'hair' is due to some numerical errors and I'm not aware of any method in python or how to write a code that would filter away points that don't occur frequently.
I wish to filter the data as I plot it.
You can use Smooth function for this. to get rid of noice,
import pandas as pd
import matplotlib.pyplot as plt
# Plot the Raw Data
# plt. Your Data >>>Bla Bla Blaa
smooth_data = pd.rolling_mean(ts,5).plot(style='k')
plt.show()

Matplotlib/Genfromtxt: Multiple plots against time, skipping missing data points, from .csv

I've been able to import and plot multiple columns of data against the same x axis (time) with legends, from csv files using genfromtxt as shown in this link:
Matplotlib: Import and plot multiple time series with legends direct from .csv
The above simple example works fine if all cells in the csv file contain data. However some of my cells have missing data, and some of the parameters (columns) only include data points every e.g. second or third time increment.
I want to plot all the parameters on the same time axis as previously; and if one or more data points in a column are missing, I want the plot function to skip the missing data points for that parameter and only draw lines between the points that are available for that parameter.
Further, I'm trying to find a generic solution which will automatically plot in the above style directly from the csv file for any number of columns, time points, missing data points etc., when these are not known in advance.
I've tried using the genfromtxt options missing_values and filling_values, as shown in my non-working example below; however I want to skip the missing data points rather than assign them the value '0'; and in any case with this approach I seem to get "ValueError: could not convert string to float" when missing data points are encountered.
Plotting multiple parameters against time on the same plot, whilst dealing with occasional or regularly skipped values must be a pretty common problem for the scientific community.
I'd be very grateful for any suggestions for an elegant solution using genfromtxt.
Non-working code and demo data below. Many thanks in anticipation.
Demo data: 'Data.csv':
Time,Parameter_1,Parameter_2,Parameter_3
0,10,12,11
1,20,,
2,25,23,
3,30,,30
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('DemoData.csv', delimiter=',', dtype=None, missing_values='', filling_values = 0)
names = (arr[0])
for n in range (1,len(names)):
plt.plot (arr[1:,0],arr[1:,n],label=names[n])
plt.legend()
plt.show()
I think if you set usemask =True in your genfromtxt command, it will do what you want. Probably don't want filling_values set either
arr = np.genfromtxt('DemoData.csv', delimiter=',', dtype=None, missing_values='', usemask=True)
you can then plot using something like this:
for n in range (1,len(names)):
plot(arr[1:,0][logical_not(arr[1:,n].mask)], arr[1:,n].compressed())

Categories

Resources