I would like to create a stacked bar plot from the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15686114 3 1
1 2 27537963 1 1
2 3 23448904 1 2
3 4 1213184 1 3
4 5 14185448 3 2
5 6 13064600 3 3
6 7 27043180 2 2
7 8 11732405 2 1
8 9 14773871 2 3
There would be 2 bars in the plot. One for RECL_LCC and other for RECL_PI. There would be 3 sections in each bar corresponding to the unique values in RECL_LCC and RECL_PI i.e 1,2,3 and would sum up the COUNT for each section. So far, I have something like this:
df = df.convert_objects(convert_numeric=True)
sub_df = df.groupby(['RECL_LCC','RECL_PI'])['COUNT'].sum().unstack()
sub_df.plot(kind='bar',stacked=True)
However, I get this plot:
Any idea on how to obtain 2 columns (RECL_LCC and RECL_PI) instead of these 3?
So your problem was that the dtypes were not numeric so no aggregation function will work as they were strings, so you can convert each offending column like so:
df['col'] = df['col'].astype(int)
or just call convert_objects on the df:
df.convert_objects(convert_numeric=True)
Related
I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:
Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1
Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')
I'm trying to pivot data in a way so that the index and columns of the resulting table aren't automatically sorted. An example of the data might be:
X Y Z
1 1 1
3 1 2
2 1 3
4 1 4
1 2 5
3 2 6
2 2 7
4 2 8
The data is interpreted as an X, Y and Z axis. The pivotted result should look like this:
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Instead the result looks like this, where the index and columns are sorted, and the data accordingly:
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
At this point I have lost information about the order in which the measurements were taken. For example say that I would plot the row at Y=1, with X as the X axis and the data value on the Y axis.
This would result in the figures in this picture. On the right is how I would like the data to be plotted. Does anyone have an idea how to prevent pandas from sorting the index and columns when pivotting a table?
I have an alternative to restore the order, as the ordering is based on the X relative to Y values, for instance, you can restore your X columns ordering by something like this:
import pandas as pd
# using your sample data
df = pd.read_clipboard()
df = df.pivot('Y', 'X', 'Z')
df
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
# re-order your X columns by the values of first Y, for instance
df = df[df.T[1].values]
df
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Not the best approach, but sure it will achieve what you want.
I can transform the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15,686,114 3 1
1 2 27,537,963 1 1
2 3 23,448,904 1 2
3 4 1,213,184 1 3
4 5 14,185,448 3 2
5 6 13,064,600 3 3
6 7 27,043,180 2 2
7 8 11,732,405 2 1
8 9 14,773,871 2 3
into something like this:
RECL_PI 1 2 3
RECL_LCC
1 27,537,963 23,448,904 1,213,184
2 11,732,405 27,043,180 14,773,871
3 15,686,114 14,185,448 13,064,600
by using pandas pivot table:
plot_table = LCC_PI_df.pivot_table(index=['RECL_LCC'], columns='RECL_PI', values='COUNT', aggfunc='sum')
Is there a quick way to create the pivot table with percentage of row totals instead of raw sum of counts?
According to comments, I think you can do that like following. Note that I converted the COUNT column to integers to do this :
#convert strings of the COUNT column to integers
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
LCC_PI_df.COUNT = LCC_PI_df.COUNT.apply(locale.atoi)
plot_table = LCC_PI_df.pivot_table(index=['RECL_LCC'], columns='RECL_PI', values='COUNT', aggfunc='sum')
#Calculate percentages
plot_table = plot_table.apply(lambda x : x / x.sum(), axis=1)
I have a pandas dataframe over here with two columns: participant names and reaction times (note that one participant has more measures oh his RT).
ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4
I would like to get a 2d array from this where every row contains the reaction times for one participant.
[[1,2,1,2]
[3,4,3,4,4]]
In case it's not possible to have a shape like that, the following options for obtaining a good a x b shape would be acceptable for me: fill missing elements with NaN; truncate the longer rows to the size of the shorter rows; fill the shorter rows with repeats of their mean value.
I would go for whatever is easiest to implement.
I have tried to sort this out by using groupby, and I expected it to be very easy to do this but it all gets terribly terribly messy :(
import pandas as pd
import io
data = io.BytesIO(""" ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4""")
df = pd.read_csv(data, delim_whitespace=True)
df.groupby("ID").RT.apply(pd.Series.reset_index, drop=True).unstack()
output:
0 1 2 3 4
ID
bar 3 4 3 4 4
foo 1 2 1 2 NaN