I have the following dataset
I want to display this in some kind of diagram: the parameters should be located on the X-axis: confirmed, deaths, recovered. They must be defined for each region_name. The Y axis should be the sum of these values. I read about the melt () method in the official documentation, but I didn't quite understand how to use it.
I need to get something like this, only in the following form.
You have wide-form data; you need to convert it to long-form data. You can either do that in pandas using melt() or a similar method, or you can use Altair's transform_fold. You can read more about this in https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data
For your data, it might look something like this:
import pandas as pd
import altair as alt
data = pd.read_csv('data_from_screenshot.csv')
alt.Chart(data).transform_fold(
["confirmed", "deaths", "recovered"],
as_=["field", "value"]
).mark_bar().encode(
x="field:N",
y="sum(value):Q",
column="region_name:N"
)
Related
I'm working on my school project which asks me to create a bar plot. I'm unable to understand the function, can anyone please help?
def get_barplot(f_dict,title):
"""
******* CHANGE 2 (50 points) **********
Shows and saves the Bar Plot
"""
#Uncomment and fill the blanks
freq_df = pd.DataFrame(f_dict._______,columns=['key','value']) #coverts the dictionary as dataframe
bar_plot = ___.barplot(_________________________)
bar_plot.set(title=title+'_BarPlot',xlabel='Words', ylabel='Count') #Setting title and labels
plt.xticks(rotation=45) #Rotating the each word beacuse of the length of the words
plt.show()
bar_plot.figure.savefig(title+'_barplot.png',bbox_inches='tight') #saving the file
This is the code. Can anyone please let me know what should i write in the blanks given? I've spent the last hour trying to understand but I can't
I tried to use different methods but it didnt work.
It is always useful to look at the API documentation when trying to understand the library functions.
Blank 1: In the first line of your code you are trying to create a Pandas data frame from a dictionary. The first argument for pd.DataFrame is the data (see pandas.DataFrame). In this case, the items in your dictionary i.e. f_dict.items(). The columns parameter provides you a clue here as these are "key" and "value" i.e. an item in the dictionary.
Blanks 2 and 3: I assume you are using Seaborn which has a .barplot method (see seaborn.barplot). I also assume that this has been imported with the alias sns. Seaborn's .barplot method takes a data frame as the first argument which in this case would be the data frame you created in the first line of your code i.e. sns.barplot(data=freq_df).
Firstly, you must pass to the dataframe method not just a dictionary, but its items:
freq_df = pd.DataFrame(f_dict.items(),columns=['key','value'])
Next, you need to create a barplot. Pandas has a slightly different method for creating a barplot (.plot.bar()), in your case you use .barplot, which corresponds to the method from the seaborn library.
As I understand it, you need to build a barplot for the frequency of values. The following code does this:
bar_plot = sns.barplot(x = 'value', y = freq_df['value'].value_counts(), data = freq_df)
And make sure you import the seaborn library. The abbreviation sns is usually used for it:
import seaborn as sns
I'm trying to solve a Kaggle Competition to get deeper into data science knowledge. I'm dealing with an issue with seaborn library. I'm trying to plot a distribution of a feature along the date but the relplot function is not able to print the datetime value. On the output, I see a big black box instead of values.
Here there is my code, for plotting:
rainfall_types = list(auser.loc[:,1:])
grid = sns.relplot(x='Date', y=rainfall_types[0], kind="line", data=auser);
grid.fig.autofmt_xdate()
Here there is the
Seaborn.relpot output and the head of my dataset
I found the error. Pratically, when you use pandas.read_csv(dataset), if your dataset contains datetime column they are parsed as object, but python read these values as 'str' (string). So when you are going to plot them, matplotlib is not able to show them correctly.
To avoid this behaviour, you should convert the datetime value into datetime object by using:
df = pandas.read_csv(dataset, parse_date='Column_Date')
In this way, we are going to indicate to pandas library that there is a date column identified by the key 'Column_Date' and it has to be converted into datetime object.
If you want, you could use the Column Date as index for your dataframe, to speed up the analyis along the time. To do it add argument index='Column_Date' at your read_csv.
I hope you will find it helpful.
I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......
I want to add a key so that I'm able to know which color is which column in my data frame. I made this by df.column_name.plot.density() multiple times. I've seen other examples with the key but I haven't been able to locate the code that adds it in.
In matplotlib, the display you're talking about is called a legend. I'm not sure if it's the same in pandas, but it's worth looking at!
Since your example didn't include enough code for me to try it out, I didn't.
Don't plot the variables one by one. Use df.plot.density(). If you want to plot a subset of variables: df.plot[var_list].density(). If you want to plot them one by one for some reason you may need to use label argument in plot function and add a legend at the end.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.normal(size = (10,4)),
columns = ["Col1", "Col2", "Col3", "Col4"])
df.plot.density()
plt.show()
I want the use the index of a pandas DataFrame as x value for a seaborn plot. However, this raises a value error.
A small test example:
import pandas as pd
import seaborn as sns
sns.lineplot(x='index',y='test',hue='test2',data=pd.DataFrame({'test':range(9),'test2':range(9)}))
It raises:
ValueError: Could not interpret input 'index'
Is it not possible to use the index as x values? What am I doing wrong?
Python 2.7, seaborn 0.9
I would rather prefer to use it this way. You need to remove hue as I assume it has a different purpose which doesn't apply in your current DataFrame because you have a single line. Visit the official docs here for more info.
df=pd.DataFrame({'test':range(9),'test2':range(9)})
sns.lineplot(x=df.index, y='test', data=df)
Output
You would need to make sure the string you provide to the x argument is actually a column in your dataframe. The easiest solution to achieve that is to reset the index of the dataframe to convert the index to a column.
sns.lineplot(x='index', y='test', data=pd.DataFrame({'test':range(9),'test2':range(9)}).reset_index())
I know it's an old question, and maybe this wasn't around back then, but there's a much simpler way to achieve this:
If you just pass a series from a dataframe as the 'data' parameter, seaborn will automatically use the index as the x values.
sns.lineplot(data=df.column1)