Concatenate groups of multiple dataframes

Concatenate groups of multiple dataframes - python

I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?

You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)

Related

Keep only rows if the count of an object is maximum

I have a pandas data frame as follows
A B
1 2
1 2
1 0
1 2
2 3
2 3
2 1
3 0
3 0
3 1
3 2
I would like to get the following output
A B
1 2
1 2
1 2
2 3
2 3
3 0
3 0
This means that I need only rows where the count of A is maximum. Is there any solution to this?
Many thanks!

You can combine groupby() with Series.mode():
df_out = df[df.groupby("A")["B"].transform(lambda x: x == x.mode()[0])]
print(df_out)
Prints:
A B
0 1 2
1 1 2
3 1 2
4 2 3
5 2 3
7 3 0
8 3 0

Is this what you're looking for?
df.set_index(['A','B']).loc[df.groupby(['A','B']).size().groupby(level=0).idxmax()].reset_index()

use group by to get n smallest values but with duplicates

Suppose I have pandas DataFrame like this:
>>> df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
>>> df
id value
1 1
1 1
1 1
1 1
1 3
2 1
2 2
2 2
2 3
2 3
2 4
3 1
4 1
I want to get a new DataFrame with top 2 (well really n values) values for each id including duplicates, like this:
id value
0 1 1
1 1 1
3 1 1
4 1 1
5 1 3
6 2 1
7 2 2
8 2 2
9 3 1
10 4 1
I've tried using head() and nsmallest() but I think those will not include duplicates. Is there a better way to do this?
Edited to make it clear I want more than 2 records per group if there are more than 2 duplictes

Use DataFrame.drop_duplicates in first step, then get top values and last use DataFrame.merge:
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 2
6 2 2
7 3 1
8 4 1
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
8 3 1
9 4 1
Or use custom lambda function with GroupBy.transform and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?

You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"

df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!

If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')

Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?

You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate groups of multiple dataframes - python

You can do concat then sort_values df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)

Related

Keep only rows if the count of an object is maximum

use group by to get n smallest values but with duplicates

Python Counting Same Values For Specific Columns

Use groupby and merge to create new column in pandas

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

Categories

Resources