I have metled a data using pd.melt function in pandas and pivoted the table keeping the name and year as id. Then I have got the table which I want. But, while ploting the graph, its not proper(means I am not getting what I want). The below is the code which gives the work done so far.
I have prefered to do this method since i have other variable with same name and years.(may be some other method exists)
But I want the graph something like, having bars representing 'Estimated Number of Pregnacies' for each state(including all india) over the years as side by side bars.
How to achieve this?
Here's a minimal example of what you are doing. Hope this gives you some hint:
# sample data
df = pd.DataFrame({'name': ['a','a','b','b','c','c'],
'class' : [1,2,1,2,1,2],
'vals':[122,1122,3342,4431,4311,1989]})
# use groupby on columns you want to see on x axis
df.groupby(['name','class'])['vals'].sum().unstack().plot(kind='bar')
Related
I'm trying to make a line graph for my dataframe that has the names of 10 customers on the X axis and their amount of purchases they made on the Y axis.
I have over 100 customers in my data frame, so I created a new data frame that is grouped by customers and which shows the sum of their orders and I wish to only display the top 10 customers on my graph.
I have tried using
TopCustomers.nlargest(10, 'Company', keep='first')
But I run into the error nlargest() got multiple values for argument 'keep' and if I don't use keep, I get told it's a required argument.
TopCustomers is composed of TopCustomers = raw.groupby(raw['Company'])['Orders'].sum()
Sorting is not required at the moment, but it'd be good to know in advance.
On an additional Note: The list of customer's name is rather lengthy and, after playing with some dummy data, I see that the labels for the X axis are stacked on top of each other, is there a way to make it bigger so that all 10 are clearly visible? and maybe mark a dot where the X,Y meets?
we can do sort_values and tail
TopCustomers.sort_values().tail(10)
I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.
I would like to filter a portion of data I have on a condition. Is it possible with Altair?
I am using the below code to plot a chart.
alt.Chart(deliveries).mark_bar().encode(
alt.X('batsman', sort=alt.EncodingSortField(field='sum(batsman_runs)', op='count', order='descending')),
alt.Y('sum(batsman_runs)'),
tooltip=['batsman', 'sum(batsman_runs)']
).properties(height=600, width=3000).interactive()
But since this has lot of data, there are many bars in my chart. I would like to restrict the bars in my chart by giving a condition like showing data for those batsman who scored above 4000 runs.
I tried using transform_filter(), but is not working with aggregate functions( I am using 'sum' here).
alt.Chart(deliveries).mark_bar().encode(
alt.X('batsman', sort=alt.EncodingSortField(field='sum(batsman_runs)', op='count', order='descending')),
alt.Y('sum(batsman_runs)'),
tooltip=['batsman', 'sum(batsman_runs)']
).properties(height=600, width=3000).interactive().transform_filter(datum.sum(batsman_runs) > 4000)
Is there a way to achieve this functionality of filtering required data by giving a condition?
In order to reference an aggregate within a filter transform, it needs to be computed within an aggregate transform rather than in the encoding shorthand.
Something like this should work:
alt.Chart(deliveries).transform_aggregate(
total_runs='sum(batsman_runs)',
groupby=['batsman']
).transform_filter(
"datum.total_runs > 4000"
).mark_bar().encode(
alt.X('batsman:Q', sort=alt.EncodingSortField(field='total_runs', op='count', order='descending')),
alt.Y('total_runs:Q'),
tooltip=['batsman:Q', 'total_runs:Q']
).properties(height=600, width=3000).interactive()
This is the dataframe I am working with:
(only the first two years don't have data for country 69 I will fix this). nkill being the number of killed for that year summed from the original long form dataframe.
I am trying to do something similar to this plot:
However, with the country code as a hue. I know there are similar posts but none have helped me solve this, thank you in advance.
By Hue I mean that in the seaborn syntactical use As pictured in this third picture. See in this example Hue creates a plot for every type of variable in that column. So if I had two country codes in the country column, for every year it would plot two bars (one for each country) side by side.
Just looking at the data it should be possible to directly use the hue argument.
But first you would need to create actual columns from the dataframe
df.reset_index(inplace=True)
Then something like
sns.barplot(x = "year", y="nkill", hue="country", data=df)
should give you the desired plot.
I'd like to create a single time-series graph from a pandas dataframe that looks like the following:
*sample of a simplified version of my dataframe:*
index to_network count
201401 net_1 100
201401 net_2 200
201401 net_3 150
201402 net_1 300
201402 net_2 250
201403 net_1 175
Ultimately, the final graph should be a time-series line graph (x-axis being the index and the y-axis being 'count') with multiple lines, and each line being a network in the to_network column (e.g., one line should be net_1).
I've been reading the 'python for data analysis' book, but they don't appear to be this complex.
Does it work?
df.groupby('to_network').count.plot()
If you want to show the date correctly, you can try:
df.index=pd.to_datetime(df.index,format='%Y%m')
The default behavior of plot in pandas is to use the index as an x-axis and plot one line per column. So you want to reshape your data frame to mirror that structure. You can do the following:
df.pivot_table(index='index', columns = 'to_network', values = 'count', aggfunc = 'sum').plot()
This will pivot your df (which is in the long format ala ggplot style) into a frame from which pandas default plot behavior will produce your desired result of one line per network type with index as the x-axis and count as the value.
To answer your question, I have checked in a notebook here: http://nbviewer.ipython.org/github/ericmjl/Stack-Overflow-Answers/blob/master/20141020%20Complex%20Pandas%20Plotting/Untitled0.ipynb
The core idea is to do a groupby, and then plot only the column that you're interested in.
Code is also pasted below here:
df = pd.read_csv("data.csv")
df.groupby("to_network")['count'].plot()
Also, be sure to add in Daniele's contribution, where you format the index correctly:
df.index=pd.to_datetime(df.index,format='%Y%m')
For attribution, I have up-voted her answer in addition to citing it here.
I hope this answers the question; if it did, please accept the answer!