My data (in a pandas DataFrame) looks like this:
a = pd.DataFrame({"age" : [25, 25, 25, 26, 26, 25, 25, 27, 27, 25, 26, 26, 25, 25],
"category" : [1, 2, 2, 3, 1, 3, 1, 1, 2, 3, 2, 2, 1, 2]})
I want a stacked barplot - each bar should reflect how many are in an age group and what percentage within the agegroup belongs to which category. The code
a.pivot_table('category', 'age').plot(kind='bar')
returns only the mean of the category instead of stacking it the way I want it.
Perhaps you mean to do something like the following... You need to change the aggregating function in pivot_table (which defaults to 'mean') and then use stacked=True when you plot:
table = a.pivot_table(index='category', columns='age', aggfunc='size')
table.plot(kind='bar', stacked=True)
This yields:
Related
I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf
I want to convert 3 rows as multi level column header in pandas dataframe.
Sample dataframe is,
df = pd.DataFrame({'a':['foo_0', 'bar_0', 1, 2, 3], 'b':['foo_0', 'bar_0', 11, 12, 13],
'c':['foo_1', 'bar_1', 21, 22, 23], 'd':['foo_1', 'bar_1', 31, 32, 33]})
expected output looks like, wherein yellow colored is a column multi level column header.
Thank you,
-Nilesh
I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.
If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])
I would like to make a barchart in python using matplotlib pyplot. The data consists of an index, which is a datetime list, and a number corresponding to that datetime. I have various samples that all belong to the same day. However, when making the bar chart, it only shows the first samples corresponding to a certain datetime, instead of all of them. How can I make a barchart showing every entry?
The index has the following structure:
ind = [datetime.datetime(2017, 3, 1, 0, 0), datetime.datetime(2017, 3, 1, 0, 0),
datetime.datetime(2017, 3, 15, 0, 0), datetime.datetime(2017, 3, 15, 0, 0)]
and the values are just integers:
values = [10, 20, 30, 40]
So when plotting, it only shows the bars 2017-3-1 with value 10, and 2017-3-15 with value 30. How can I make them show all of them?
You can groupby the dates, add the values and then plot the barchart from the same dataframe:
df = pd.DataFrame(data=values, index=ind)
df = df.groupby(df.index).sum()
df.plot(kind='bar')
If what you want is all values to appear in the plot, regardless of the date, you can simply use:
df.plot(kind='bar')
And entries with duplicate date will be plotted independently.
I am trying to make a column graph where the y-axis is the mean grain size, the x-axis is the distance along the transect, and each series is a date and/or number value (it doesn't really matter).
I have been trying a few different methods in Excel 2010 but I cannot figure it out. My hope is that, lets say at the first location, 9, there will be three columns and then at 12 there will be two columns. If it matter at all, lets say the total distance is 50. The result of this data should have 7 sets of columns along the transect/x-axis.
I have tried to do this using python but my coding knowledge is close to nil. Here is my code so far:
import numpy as np
import matplotlib.pyplot as plt
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371, 1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
If someone happen to know of a code to use, it would be very helpful. A recommendation for how to do this in Excel would be awesome too.
There's a plotting library called seaborn, built on top of matplotlib, that does this in one line. Your example:
import numpy as np
import seaborn as sns
from matplotlib.pyplot import show
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371,
1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
ax = sns.barplot(x=distance, y=grainsize, hue=series, palette='muted')
ax.set_xlabel('distance')
ax.set_ylabel('grainsize')
show()
You will be able to do a lot even as a total newbie by editing the many examples in the seaborn gallery. Use them as training wheels: edit only one thing at a time and think about what changes.