Best way to plot horizontal data and see it clearly? - python

I have my dataframe object df which looks like this:
product 7.month 8.month 9.month 10.month 11.month 12.month 1.month 2.month 3.month 4.month 5.month 6.month
0 phone 68 137 202 230 143 220 110 173 187 149 204 90
1 television <same kind of numerical data>
2
3
4
...
I would like to plot this data, but I'm not sure how to plot this, because months are horizontal (columns) and also have around 20 products (rows) in my dataframe, so people could read from it

Transpose the dataframe
df1 = df.T
and now plot df1

I agree and recommend Aavesh's approach. However, if it is absolutely necessary to access the data horizontally, then you can use list(df.iloc[index]) where index is the index of the row.
Then plot.

Related

Grouping rows by proximity of floats in a python pandas dataframe

It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters.
I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.
something like:
t (ns)
cluster
71
1524957248.4375
1
72
1524957265.625
1
699
14624846476.5625
2
700
14624846653.125
2
701
14624846661.287
2
1161
25172864926.5625
3
1160
25172864935.9375
3
Thanks for your help!
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3
You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)

Grouped bar plot with categorical column count

I have this dataframe that has many columns, 2 of them are y and poutcome. Both of them are categorical data. I make a grouped bar plot based on y with sub bar plot poutcome. I tried to create a group by that results this
y poutcome
no failure 427
other 159
success 46
unknown 3368
yes failure 63
other 38
success 83
unknown 337
Based on that grouped by dataframe, I thought it will result to a graph that looks like this, the legend and the colored bars will be failure,success,other and unknown and they will be grouped by yes and no (in example graph, 4 and 5 would be yes and no). You got the gist.
The group by is this bank.groupby(['y','poutcome'])['poutcome'].count() But instead shows like above, mine show like this
How do I make it like the first graph? The bars represents the poutcome and they are grouped by y
You should be able to just unstack(level=1) before plotting:
grouped.unstack(level=1).plot.bar()
This is the dataframe that I used initally:
y poutcome count
0 no failure 427
1 no other 159
2 no success 46
3 no unknown 3368
4 yes failure 63
5 yes other 38
6 yes success 83
7 yes unknown 337
You should be able to get this from your groupby by setting as_index=False
You can then use DataFrame.pivot to arrange the values for plotting:
df.pivot(index="poutcome", columns="y", values="count")
# y no yes
# poutcome
# failure 427 63
# other 159 38
# success 46 83
# unknown 3368 337
# and plot that:
df.pivot(index="poutcome", columns="y", values="count").plot.bar()
Alternatively to have poutcome in the legend swap the index and columns parameters:
df.pivot(index="y", columns="poutcome", values="count").plot.bar()

Pandas: how to apply weight column to create a new dataframe with weighted data

I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!

How to plot a pandas dataframe?

Hi am new to python and trying to plot a dataframe.
subject name marks
0 maths ankush 313
1 maths anvesh 474
2 maths amruth 264
3 science ankush 81
4 socail ankush 4
5 maths anirudh 16470
6 science anvesh 568
7 socail anvesh 5
8 science amruth 15
am looking to plot the bar graph something like as shown in the figure.
Thank You for your help.
The problem is two-fold.
What format does data need to be in to produce bar chart?
How to get data into that format?
For the chart you want, you need the names in the x-axis in the index of the dataframe and the subjects as columns.
This requires a pivot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0)
subject maths science socail
name
amruth 264 15 0
anirudh 1647 0 0
ankush 313 81 4
anvesh 474 568 5
And the subsequent plot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0).plot.bar()
The above is a very good answer. However since you are new to python, pandas, & matplotlib, I thought I would share a blog post I have found really good in showing the basics of matplotlib and how it is combined with pandas.
http://pbpython.com/effective-matplotlib.html?utm_content=buffer76b10&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
I hope you find it useful

pandas - percentage of matr

Afternoon,
I am trying to recreate a table but replacing the raw numbers with percentage of the total column. For instance, i have:
Code 03/31/2016 12/31/2015 09/30/2015
F55 425 387 369
F554 109 106 106
F508 105 105 106
the desired output is a new dataframe, with the numbers replaced by the percentage with the total being the sum of the column (03/31/2016 = 425+109+105)
Code 03/31/2016 12/31/2015 09/30/2015
F55 66.5% 64.7% 63.5%
F554 17% 17.7% 18.2%
F508 16.4% 17.5% 18.2%
thanks for your help
I'm sure there's a more elegant answer somewhere but this will work:
df['03/31/2016'].apply(lambda x : x/df['03/31/2016'].sum())
or if you want to do this for the entire dataframe:
df.apply(lambda x : x/x.sum(), axis=0)

Categories

Resources