How do I draw a sample (say, 10% randomly or alternatively every nth row) of rows from within each group inside a dataframe ?
e.g. from when grouping by 'name':
name a b
foo 1 1
foo 4 1
foo 3 3
bar 2 1
bar 3 7
bar 4 3
bar 1 2
I want to get something like:
name a b
foo 4 1
bar 3 7
bar 1 2
many thanks
You can use groupby to group by your name column and then apply sample to randomly get samples from the subgroups.
First, let's see the dummy data:
print(df)
name a b
0 foo 1 1
1 foo 4 1
2 foo 3 3
3 bar 2 1
4 bar 3 7
5 bar 4 3
6 bar 1 2
fraction defines the percentage of random sample. It is set to 0.5 here for your small dummy data set:
fraction = 0.5
result = df.groupby("name", group_keys=False).apply(lambda x: x.sample(frac=fraction))
print(result)
name a b
3 bar 2 1
6 bar 1 2
0 foo 1 1
2 foo 3 3
Related
I have a GroupBy object. I want to to remove rows from current group if the same row exists in the previous group. Let's say this is (n-1)th group:
A B
0 foo 0
1 baz 1
2 foo 1
3 bar 1
And this n-th group
A B
0 foo 2
1 foo 1
2 baz 1
3 baz 3
After dropping all duplicates. Result of n-th group:
A B
0 foo 2
3 baz 3
EDIT:
I would like to achieve it without loop if possible
I am using merge with indicator here
yourdf=dfn.merge(df1,indicator=True,how='left').loc[lambda x : x['_merge']!='both']
yourdf
A B _merge
0 foo 2 left_only
3 baz 3 left_only
#yourdf.drop('_merge',1,inplace=True)
Since it is GrouBy Object so you can do with for loop here , using above code for n times
I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?
You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.
In the example below. I am trying to generate a column 'E' that is assigned either [1 or 2] depending on a conditional statement on column A.
I've tried various options but they throw a slicing error. (Should it not be something like this to assign a value to new column 'E'?
df2= df.loc[df['A'] == 'foo']['E'] = 1
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print('Filter the content')
df2= df.loc[df['A'] == 'foo']
print(df2)
# A B C D E
# 0 foo one 0 0 1
# 2 foo two 2 4 1
# 4 foo two 4 8 1
# 6 foo one 6 12 1
# 7 foo three 7 14 1
df3= df.loc[df['A'] == 'bar']
print(df3)
# A B C D E
# 1 bar one 1 2 2
# 3 bar three 3 6 2
# 5 bar two 5 10 2
#Combile df2 and df3 back to df and print df
print(df)
# A B C D E
# 0 foo one 0 0 1
# 1 bar one 1 2 2
# 2 foo two 2 4 1
# 3 bar three 3 6 2
# 4 foo two 4 8 1
# 5 bar two 5 10 2
# 6 foo one 6 12 1
# 7 foo three 7 14 1
What about simply this?
df['E'] = np.where(df['A'] == 'foo', 1, 2)
This does what I think you're trying to do. Create a column E in your dataframe that is 1 if A==foo, and 2 if A!=foo.
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
df['E']=np.ones([df.shape[0],])*2
df.loc[df.A=='foo','E']=1
df.E=df.E.astype(int)
print(df)
Note: Your suggested solution df2= df.loc[df['A'] == 'foo']['E'] = 1 uses serial slicing, rather than taking advantage of loc. To slice df rows by the first conditional and return the column E, you should instead use df.loc[df['A']=='foo','E']
Note II: If you have more than one conditional, you could also use .replace() and pass in a dictionary. In this case mapping foo to 1, bar to 2, and so on.
for brevity (characters)
df.assign(E=df.A.ne('foo')+1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1
for brevity (time)
df.assign(E=(df.A.values != 'foo') + 1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1
I would like to create a stacked bar plot from the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15686114 3 1
1 2 27537963 1 1
2 3 23448904 1 2
3 4 1213184 1 3
4 5 14185448 3 2
5 6 13064600 3 3
6 7 27043180 2 2
7 8 11732405 2 1
8 9 14773871 2 3
There would be 2 bars in the plot. One for RECL_LCC and other for RECL_PI. There would be 3 sections in each bar corresponding to the unique values in RECL_LCC and RECL_PI i.e 1,2,3 and would sum up the COUNT for each section. So far, I have something like this:
df = df.convert_objects(convert_numeric=True)
sub_df = df.groupby(['RECL_LCC','RECL_PI'])['COUNT'].sum().unstack()
sub_df.plot(kind='bar',stacked=True)
However, I get this plot:
Any idea on how to obtain 2 columns (RECL_LCC and RECL_PI) instead of these 3?
So your problem was that the dtypes were not numeric so no aggregation function will work as they were strings, so you can convert each offending column like so:
df['col'] = df['col'].astype(int)
or just call convert_objects on the df:
df.convert_objects(convert_numeric=True)
I have a pandas dataframe over here with two columns: participant names and reaction times (note that one participant has more measures oh his RT).
ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4
I would like to get a 2d array from this where every row contains the reaction times for one participant.
[[1,2,1,2]
[3,4,3,4,4]]
In case it's not possible to have a shape like that, the following options for obtaining a good a x b shape would be acceptable for me: fill missing elements with NaN; truncate the longer rows to the size of the shorter rows; fill the shorter rows with repeats of their mean value.
I would go for whatever is easiest to implement.
I have tried to sort this out by using groupby, and I expected it to be very easy to do this but it all gets terribly terribly messy :(
import pandas as pd
import io
data = io.BytesIO(""" ID RT
0 foo 1
1 foo 2
2 bar 3
3 bar 4
4 foo 1
5 foo 2
6 bar 3
7 bar 4
8 bar 4""")
df = pd.read_csv(data, delim_whitespace=True)
df.groupby("ID").RT.apply(pd.Series.reset_index, drop=True).unstack()
output:
0 1 2 3 4
ID
bar 3 4 3 4 4
foo 1 2 1 2 NaN