Complex dataframe plotting with Pandas / Matplotlib - python

I'd like to create a single time-series graph from a pandas dataframe that looks like the following:
*sample of a simplified version of my dataframe:*
index to_network count
201401 net_1 100
201401 net_2 200
201401 net_3 150
201402 net_1 300
201402 net_2 250
201403 net_1 175
Ultimately, the final graph should be a time-series line graph (x-axis being the index and the y-axis being 'count') with multiple lines, and each line being a network in the to_network column (e.g., one line should be net_1).
I've been reading the 'python for data analysis' book, but they don't appear to be this complex.

Does it work?
df.groupby('to_network').count.plot()
If you want to show the date correctly, you can try:
df.index=pd.to_datetime(df.index,format='%Y%m')

The default behavior of plot in pandas is to use the index as an x-axis and plot one line per column. So you want to reshape your data frame to mirror that structure. You can do the following:
df.pivot_table(index='index', columns = 'to_network', values = 'count', aggfunc = 'sum').plot()
This will pivot your df (which is in the long format ala ggplot style) into a frame from which pandas default plot behavior will produce your desired result of one line per network type with index as the x-axis and count as the value.

To answer your question, I have checked in a notebook here: http://nbviewer.ipython.org/github/ericmjl/Stack-Overflow-Answers/blob/master/20141020%20Complex%20Pandas%20Plotting/Untitled0.ipynb
The core idea is to do a groupby, and then plot only the column that you're interested in.
Code is also pasted below here:
df = pd.read_csv("data.csv")
df.groupby("to_network")['count'].plot()
Also, be sure to add in Daniele's contribution, where you format the index correctly:
df.index=pd.to_datetime(df.index,format='%Y%m')
For attribution, I have up-voted her answer in addition to citing it here.
I hope this answers the question; if it did, please accept the answer!

Related

How to plot specific range of values on a histogram in python using pandas?

I've imported data from a .csv. There is a column within the data referring to number of people. The number of people ranges from 1 - 100 for each different input. My goal is to only plot on a histogram the inputs where number of people is less than 50.
I know how to plot the histogram.
df['people'].hist()
But, how do I specify the range of people?
I've tried df[df['people']< 50].hist() but that did not work.
I know this should be easy but I just don't get it! This is using python and pandas.
Try it with query function
df.query("people < 50")['people'].hist()
I used some sample data and tried to plot the histogram with df['people'][df['people']<50].hist() but df[df['people']<50].hist() also seems to work for me.
df = pd.DataFrame(
[1,1,2,3,3,5,7,8,9,10,
10,11,11,13,13,15,16,17,18,18,
18,19,20,21,21,23,24,24,25,25,
25,25,26,26,26,27,27,27,27,27,
29,30,30,31,33,34,34,34,35,36,
36,37,37,38,38,39,40,41,41,42,
43,44,45,45,46,47,48,48,49,50,
51,52,53,54,55,55,56,57,58,60,
61,63,64,65,66,68,70,71,72,74,
75,77,81,83,84,87,89,90,90,91], columns=['people'])
df.head()
df['people'][df['people']<50].hist()
I have attached a screenshot of the histogram.

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

Why the plot is not stacked bar in python(pandas)?

I have metled a data using pd.melt function in pandas and pivoted the table keeping the name and year as id. Then I have got the table which I want. But, while ploting the graph, its not proper(means I am not getting what I want). The below is the code which gives the work done so far.
I have prefered to do this method since i have other variable with same name and years.(may be some other method exists)
But I want the graph something like, having bars representing 'Estimated Number of Pregnacies' for each state(including all india) over the years as side by side bars.
How to achieve this?
Here's a minimal example of what you are doing. Hope this gives you some hint:
# sample data
df = pd.DataFrame({'name': ['a','a','b','b','c','c'],
'class' : [1,2,1,2,1,2],
'vals':[122,1122,3342,4431,4311,1989]})
# use groupby on columns you want to see on x axis
df.groupby(['name','class'])['vals'].sum().unstack().plot(kind='bar')

From series to matplotlib-acceptable format

I have grouped some variables using groupby and now I want to plot them and edit the plot using matplotlib. In the code below, I have ploted the data using pandas, which gives me very little room to edit the graph (I think).
a = df_08.groupby('new_time').symbol.count()/len(set(df_08['date']))
a.plot()
The problem with using matplotlib and doing
plt.plot()
is that my data for 'a', after using 'groupby' is not in Series format for pandas and matplotlib does not accept that.
'a' comes out like this, in Series format:
new_time symbol
09:30 224.2
09:31 133.8
09:32 117.6
09:33 113.5
09:34 108.4
The first column has the name 'Index', but I can't seem to treat it as the column name. I would like the first column to be on the x axis and the second column to be on the y axis.
Anyway, I guess my question is how to transform the data from Series to matplotlib acceptable format.

Factorplot with multiindex dataframe

This is the dataframe I am working with:
(only the first two years don't have data for country 69 I will fix this). nkill being the number of killed for that year summed from the original long form dataframe.
I am trying to do something similar to this plot:
However, with the country code as a hue. I know there are similar posts but none have helped me solve this, thank you in advance.
By Hue I mean that in the seaborn syntactical use As pictured in this third picture. See in this example Hue creates a plot for every type of variable in that column. So if I had two country codes in the country column, for every year it would plot two bars (one for each country) side by side.
Just looking at the data it should be possible to directly use the hue argument.
But first you would need to create actual columns from the dataframe
df.reset_index(inplace=True)
Then something like
sns.barplot(x = "year", y="nkill", hue="country", data=df)
should give you the desired plot.

Categories

Resources