Concatenate groups of multiple dataframes - python

I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?

You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)

Related

Keep only rows if the count of an object is maximum

I have a pandas data frame as follows
A B
1 2
1 2
1 0
1 2
2 3
2 3
2 1
3 0
3 0
3 1
3 2
I would like to get the following output
A B
1 2
1 2
1 2
2 3
2 3
3 0
3 0
This means that I need only rows where the count of A is maximum. Is there any solution to this?
Many thanks!
You can combine groupby() with Series.mode():
df_out = df[df.groupby("A")["B"].transform(lambda x: x == x.mode()[0])]
print(df_out)
Prints:
A B
0 1 2
1 1 2
3 1 2
4 2 3
5 2 3
7 3 0
8 3 0
Is this what you're looking for?
df.set_index(['A','B']).loc[df.groupby(['A','B']).size().groupby(level=0).idxmax()].reset_index()

use group by to get n smallest values but with duplicates

Suppose I have pandas DataFrame like this:
>>> df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
>>> df
id value
1 1
1 1
1 1
1 1
1 3
2 1
2 2
2 2
2 3
2 3
2 4
3 1
4 1
I want to get a new DataFrame with top 2 (well really n values) values for each id including duplicates, like this:
id value
0 1 1
1 1 1
3 1 1
4 1 1
5 1 3
6 2 1
7 2 2
8 2 2
9 3 1
10 4 1
I've tried using head() and nsmallest() but I think those will not include duplicates. Is there a better way to do this?
Edited to make it clear I want more than 2 records per group if there are more than 2 duplictes
Use DataFrame.drop_duplicates in first step, then get top values and last use DataFrame.merge:
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 2
6 2 2
7 3 1
8 4 1
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
8 3 1
9 4 1
Or use custom lambda function with GroupBy.transform and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?
You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"
df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Categories

Resources