I have a pandas dataframe over here with two columns: participant names and reaction times (note that one participant has more measures oh his RT).
ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4
I would like to get a 2d array from this where every row contains the reaction times for one participant.
[[1,2,1,2]
[3,4,3,4,4]]
In case it's not possible to have a shape like that, the following options for obtaining a good a x b shape would be acceptable for me: fill missing elements with NaN; truncate the longer rows to the size of the shorter rows; fill the shorter rows with repeats of their mean value.
I would go for whatever is easiest to implement.
I have tried to sort this out by using groupby, and I expected it to be very easy to do this but it all gets terribly terribly messy :(
import pandas as pd
import io
data = io.BytesIO(""" ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4""")
df = pd.read_csv(data, delim_whitespace=True)
df.groupby("ID").RT.apply(pd.Series.reset_index, drop=True).unstack()
output:
0 1 2 3 4
ID
bar 3 4 3 4 4
foo 1 2 1 2 NaN
Related
I am creating a dataframe like this.
np.random.seed(2)
df=pd.DataFrame(np.random.randint(1,6,(6,6)))
out[]
0 1 1 4 3 4 1
1 3 2 4 3 5 5
2 5 4 5 3 4 4
3 3 2 3 5 4 1
4 5 4 2 3 1 5
5 5 3 5 3 2 1
spliting the dataframe into 3,3 matrix like below, it will have 16 matrix.
dfs=[]
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3])
lets print,
dfs[0]
1 1 4
3 2 4
5 4 5
dfs[1]
3 2 4
5 4 5
3 2 3
.
.
.
dfs[15]
5 4 1
3 1 5
3 2 1
writing a function to change the values from each matrix in locations [1,0] and [1,2] to zero,
so that my output will looks like,
dfs[0]
1 1 4
0 2 0
5 4 5
def process(x):
new=[]
for d in x:
d.iloc[1,0]=0
d.iloc[1,2]=0
new.append(d)
print(d)
return new
dfs=process(dfs.copy())
my expected output, is
dfs[0]
1 1 4
0 2 0
5 4 5
but what my function returns is,
dfs[0]
1 1 4
0 0 0
0 0 0
dfs[1]
0 0 0
0 0 0
0 0 0
It producres more zeros in all matrix. I don't know why it is working unexpectedly or what I am doing wrong with my function process please help. Thanks.
Long story short, you are a victim of chained indexing, which can lead to bad things happening.
When you slice the original DataFrame, you get overlapping views.
Modifying one changes the others too, since the second row of one chunk is the first row of another, and the third row of the first chunk is the first row of yet another, and so on...which is why you see non-zero values only at the "edges", since those are unique to a single chunk.
You can make copies of each slice, like this:
def process(x):
new = []
for d in x:
d = d.copy() # each one is now a copy
d.iloc[1, 0]=0
d.iloc[1, 2]=0
new.append(d)
return new
Lastly, note that dfs = process(dfs) is actually fine; you don't need to make a copy of the enclosing list.
Change your code and process function call to get your required output. Also, I used copy in for loop to make subset of dataframe which is independent to change in future, in your case it makes changes to original df which are reflected with all zeros in other dfs list:
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3].copy())
dfs=process(dfs)
I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?
You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.
How do I draw a sample (say, 10% randomly or alternatively every nth row) of rows from within each group inside a dataframe ?
e.g. from when grouping by 'name':
name a b
foo 1 1
foo 4 1
foo 3 3
bar 2 1
bar 3 7
bar 4 3
bar 1 2
I want to get something like:
name a b
foo 4 1
bar 3 7
bar 1 2
many thanks
You can use groupby to group by your name column and then apply sample to randomly get samples from the subgroups.
First, let's see the dummy data:
print(df)
name a b
0 foo 1 1
1 foo 4 1
2 foo 3 3
3 bar 2 1
4 bar 3 7
5 bar 4 3
6 bar 1 2
fraction defines the percentage of random sample. It is set to 0.5 here for your small dummy data set:
fraction = 0.5
result = df.groupby("name", group_keys=False).apply(lambda x: x.sample(frac=fraction))
print(result)
name a b
3 bar 2 1
6 bar 1 2
0 foo 1 1
2 foo 3 3
After slicing, I have a multi header Dataframe with two levels, indexed by date, obtained like this:
df = df.iloc[:, df.columns.get_level_values(1).isin({'a','b'})]
Date one two
a b a b
2 2 3 3 3
3 2 3 3 3
4 2 3 3 3
5 2 3 3 3
6 2 3 3 3
7 2 3 3 3
What I would like to do is to plot this Dataframe with a line plot with the Date in axis, the same color for the level 0 and solid/dashed lines for the first level.
I have tried unstacking ie.
df.unstack(level=0).plot(kind='line')
but with no success. The plot as it is now, shows Date in x axis but treat each combination of level 0 and 1 headers as a new entry.
Here is a picture of the plot obtained:
What we would like to implement would be a two levels legend (color/shape of line).
Code Example:
import numpy as np
import pandas as pd
A = np.random.rand(4,4)
C = pd.DataFrame(A, index=range(4), columns=[np.array(['A','A','B','B']), np.array(['a','b','a','b'])])
C.plot(kind='line')
I would like to create a stacked bar plot from the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15686114 3 1
1 2 27537963 1 1
2 3 23448904 1 2
3 4 1213184 1 3
4 5 14185448 3 2
5 6 13064600 3 3
6 7 27043180 2 2
7 8 11732405 2 1
8 9 14773871 2 3
There would be 2 bars in the plot. One for RECL_LCC and other for RECL_PI. There would be 3 sections in each bar corresponding to the unique values in RECL_LCC and RECL_PI i.e 1,2,3 and would sum up the COUNT for each section. So far, I have something like this:
df = df.convert_objects(convert_numeric=True)
sub_df = df.groupby(['RECL_LCC','RECL_PI'])['COUNT'].sum().unstack()
sub_df.plot(kind='bar',stacked=True)
However, I get this plot:
Any idea on how to obtain 2 columns (RECL_LCC and RECL_PI) instead of these 3?
So your problem was that the dtypes were not numeric so no aggregation function will work as they were strings, so you can convert each offending column like so:
df['col'] = df['col'].astype(int)
or just call convert_objects on the df:
df.convert_objects(convert_numeric=True)