grouping and printing the maximum in a dataframe in python - python

A dataframe has 3 Columns
A B C
^0hand(%s)leg$ 27;30 42;54
^-(%s)hand0leg 39;30 47;57
^0hand(%s)leg$ 24;33 39;54
So column A has regex patterns like this if those patterns are similar for example now row 1 and row 3 is similar so it has to merge the two rows and output only the maximum as below:
Output:
A B C
^0hand(%s)leg$ 27;33 42;54
^-(%s)hand0leg 39;30 47;57
Any leads will be helpful

You could use:
(df.set_index('A').stack()
.str.extract('(\d+);(\d+)').astype(int)
.groupby(level=[0,1]).agg(max).astype(str)
.assign(s=lambda d: d[0]+';'+d[1])['s'] # OR # .apply(';'.join, axis=1)
.unstack(1)
.loc[df['A'].unique()] ## only if the order of rows matters
.reset_index()
)
output:
A B C
0 ^0hand(%s)leg$ 27;33 42;54
1 ^-(%s)hand0leg 39;30 47;57

Related

add/combine columns after searching in a DataFrame

I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)

Search string in dataframe column that contains lists of string and return complete dataframe

I have a dataframe df which has 4 columns 'A','B','C','D'
I have to search for a substring in each column and return the complete dataframe in the search order for example if I get the substring in column B row 3,4,5 then my final df would be having
3 rows. For this I am using df[df['A'].str.contains('string_to _search') and it's working fine but one of the column consist each element in the column as list of strings like in column B
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
so df[df['A'].str.contains('string_to _search') is not working for column B pls suggest how can I search in this column and maintain the order of complete dataframe.
There are lists in column B, so need in statement:
df1 = df[df['B'].apply(lambda x: 'cvb' in x)]
print (df1)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
If want use str.contains then is possible use str.join first, so is possible search also substrings:
df1 = df[df['B'].str.join(' ').str.contains('er')]
print (df1)
A B C D
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
If want search in all columns:
df2 = (df[df.assign(B = df['B'].str.join(' '))
.apply(' '.join, axis=1)
.str.contains('g')]
)
print (df2)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf

Get only matching rows for groups in Pandas groupby

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.

Matching the column names of two pandas data-frames in python

I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.
I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]
Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7
Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]

How to compare column values of pandas groupby object and summarize them in a new column row

I have the following problem: I want to create a column in a dataframe summarizing all values in a row. Then I want to compare the rows of that column to create a single row containg all the values from all columns, but so that each value is only present a single time. As example: I have the following data frame
df1:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
the desired output would now be:
df1_new:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What I am currently trying is result = df1.groupby('Column1'), but then I don't know how to compare the values in the rows of the grouped objects and then write them to the new column and removing the duplicates. I read through the pandas documentation of Group By: split-apply-combine but could not figure out a way to do it. I also wonder if, once I have my desired output, there is a way to check in how many of the lines in the grouped object each value in Column2 of df1_new appeared. Any help on this would be greatly appreciated!
A method by which you can do this would be to apply a function on the grouped DataFrame.
This function would first convert the series (for each group) to a list, and then in the list split each string using , and then chain the complete list into a single list using itertools.chain.from_iterable and then convert that to set so that only unique values are left and then sort it and then convert back to string using str.join . Example -
from itertools import chain
def applyfunc(x):
ch = chain.from_iterable(y.split(',') for y in x.tolist())
return ','.join(sorted(set(ch)))
df1_new = df1.groupby('Column1')['Column2'].apply(func1).reset_index()
Demo -
In [46]: df
Out[46]:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
In [47]: from itertools import chain
In [48]: def applyfunc(x):
....: ch = chain.from_iterable(y.split(',') for y in x.tolist())
....: return ','.join(sorted(set(ch)))
....:
In [49]: df.groupby('Column1')['Column2'].apply(func1).reset_index()
Out[49]:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What about this:
df1
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
df1.groupby('Column1').\
agg(lambda x: ','.join(x).split(','))['Column2'].\
apply(lambda x: ','.join(np.unique(x))).reset_index()
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9

Categories

Resources