After slicing, I have a multi header Dataframe with two levels, indexed by date, obtained like this:
df = df.iloc[:, df.columns.get_level_values(1).isin({'a','b'})]
Date one two
a b a b
2 2 3 3 3
3 2 3 3 3
4 2 3 3 3
5 2 3 3 3
6 2 3 3 3
7 2 3 3 3
What I would like to do is to plot this Dataframe with a line plot with the Date in axis, the same color for the level 0 and solid/dashed lines for the first level.
I have tried unstacking ie.
df.unstack(level=0).plot(kind='line')
but with no success. The plot as it is now, shows Date in x axis but treat each combination of level 0 and 1 headers as a new entry.
Here is a picture of the plot obtained:
What we would like to implement would be a two levels legend (color/shape of line).
Code Example:
import numpy as np
import pandas as pd
A = np.random.rand(4,4)
C = pd.DataFrame(A, index=range(4), columns=[np.array(['A','A','B','B']), np.array(['a','b','a','b'])])
C.plot(kind='line')
Related
So I have a DataFrame with (amongst others) four colours with numerical values. I want to add a column to the DataFrame that has the maximum of the two sums obtained from summing two columns.
My solutions so far is
from pandas import DataFrame
df = DataFrame(data={'text': ['a','b','c'], 'a':[1,2,3],'b':[2,3,4],'c':[5,4,2],'d':[-2,4,1]})
df['sum1'] = df['a'].add(df['b'])
df['sum2'] = df['c'].add(df['d'])
df['maxsum'] = df[['sum1','sum2']].max(axis=1)
which gives the desired result.
I am pretty sure, there is a more concise way to do this...
There is nothing wrong with your approach. In fact, it is the approach I would take if nothing more than the fact it is easy to read and figure out what you are doing. But if you are looking for another solution, here is one using numpy.ufunc.reduceat
import pandas as pd
import numpy as np
# sample frame
df = pd.DataFrame(data={'text': ['a','b','c'], 'a':[1,2,3],'b':[2,3,4],'c':[5,4,2],'d':[-2,4,1]})
# we skip the first column and convert to an array - df[df.columns[1:]].values
# we specify the indicies to slice - np.arange(len(df.columns[1:]))[::2]
# then find the max
df['max'] = np.max(np.add.reduceat(df[df.columns[1:]].values,
np.arange(len(df.columns[1:]))[::2],
axis=1),
axis=1)
text a b c d max
0 a 1 2 5 -2 3
1 b 2 3 4 4 8
2 c 3 4 2 1 7
Not that it much more concised, but instead of your current approach you can apply one-shot assignment:
df = df.assign(sum1=df[['a', 'b']].sum(1), sum2=df[['c', 'd']].sum(1),
maxsum=lambda df: df[['sum1','sum2']].max(1))
text a b c d sum1 sum2 maxsum
0 a 1 2 5 -2 3 3 3
1 b 2 3 4 4 5 8 8
2 c 3 4 2 1 7 3 7
I try to use ggplot to plot the dataframe
import pandas as pd
from ggplot import *
df = pd.DataFrame()
df['x'] = [1,2,3,4,5,6]
df['y'] = [1,6,7,2,3,6]
df['id'] = ['a','a','b','b','c','c']
I get the output
x y id
0 1 1 a
1 2 6 a
2 3 7 b
3 4 2 b
4 5 3 c
5 6 6 c
I wish to plot 3 segments with different colors distinguished by 'id'.
ggplot(df,aes(x='x',y='y',colour='id')) + geom_line()
The output contains only the first segment 'a'
output
What's the problem of my codes?
I'm trying to pivot data in a way so that the index and columns of the resulting table aren't automatically sorted. An example of the data might be:
X Y Z
1 1 1
3 1 2
2 1 3
4 1 4
1 2 5
3 2 6
2 2 7
4 2 8
The data is interpreted as an X, Y and Z axis. The pivotted result should look like this:
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Instead the result looks like this, where the index and columns are sorted, and the data accordingly:
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
At this point I have lost information about the order in which the measurements were taken. For example say that I would plot the row at Y=1, with X as the X axis and the data value on the Y axis.
This would result in the figures in this picture. On the right is how I would like the data to be plotted. Does anyone have an idea how to prevent pandas from sorting the index and columns when pivotting a table?
I have an alternative to restore the order, as the ordering is based on the X relative to Y values, for instance, you can restore your X columns ordering by something like this:
import pandas as pd
# using your sample data
df = pd.read_clipboard()
df = df.pivot('Y', 'X', 'Z')
df
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
# re-order your X columns by the values of first Y, for instance
df = df[df.T[1].values]
df
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Not the best approach, but sure it will achieve what you want.
I would like to create a stacked bar plot from the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15686114 3 1
1 2 27537963 1 1
2 3 23448904 1 2
3 4 1213184 1 3
4 5 14185448 3 2
5 6 13064600 3 3
6 7 27043180 2 2
7 8 11732405 2 1
8 9 14773871 2 3
There would be 2 bars in the plot. One for RECL_LCC and other for RECL_PI. There would be 3 sections in each bar corresponding to the unique values in RECL_LCC and RECL_PI i.e 1,2,3 and would sum up the COUNT for each section. So far, I have something like this:
df = df.convert_objects(convert_numeric=True)
sub_df = df.groupby(['RECL_LCC','RECL_PI'])['COUNT'].sum().unstack()
sub_df.plot(kind='bar',stacked=True)
However, I get this plot:
Any idea on how to obtain 2 columns (RECL_LCC and RECL_PI) instead of these 3?
So your problem was that the dtypes were not numeric so no aggregation function will work as they were strings, so you can convert each offending column like so:
df['col'] = df['col'].astype(int)
or just call convert_objects on the df:
df.convert_objects(convert_numeric=True)
I have a pandas dataframe over here with two columns: participant names and reaction times (note that one participant has more measures oh his RT).
ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4
I would like to get a 2d array from this where every row contains the reaction times for one participant.
[[1,2,1,2]
[3,4,3,4,4]]
In case it's not possible to have a shape like that, the following options for obtaining a good a x b shape would be acceptable for me: fill missing elements with NaN; truncate the longer rows to the size of the shorter rows; fill the shorter rows with repeats of their mean value.
I would go for whatever is easiest to implement.
I have tried to sort this out by using groupby, and I expected it to be very easy to do this but it all gets terribly terribly messy :(
import pandas as pd
import io
data = io.BytesIO(""" ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4""")
df = pd.read_csv(data, delim_whitespace=True)
df.groupby("ID").RT.apply(pd.Series.reset_index, drop=True).unstack()
output:
0 1 2 3 4
ID
bar 3 4 3 4 4
foo 1 2 1 2 NaN