add/combine columns after searching in a DataFrame - python

I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present

IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)

Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)

Related

Pandas: select column with most unique values

I have a pandas DataFrame and want to find select the column with the most unique values.
I already filtered the unique values with nunique(). How can I now choose the column with the highest nunique()?
This is my code so far:
numeric_columns = df.select_dtypes(include = (int or float))
unique = []
for column in numeric_columns:
unique.append(numeric_columns[column].nunique())
I later need to filter all the columns of my dataframe depending on this column(most uniques)
Use DataFrame.select_dtypes with np.number, then get DataFrame.nunique with column by maximal value by Series.idxmax:
df = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,2,2], 'c':list('abcd')})
print (df)
a b c
0 1 1 a
1 2 2 b
2 3 2 c
3 4 2 d
numeric = df.select_dtypes(include = np.number)
nu = numeric.nunique().idxmax()
print (nu)
a

grouping and printing the maximum in a dataframe in python

A dataframe has 3 Columns
A B C
^0hand(%s)leg$ 27;30 42;54
^-(%s)hand0leg 39;30 47;57
^0hand(%s)leg$ 24;33 39;54
So column A has regex patterns like this if those patterns are similar for example now row 1 and row 3 is similar so it has to merge the two rows and output only the maximum as below:
Output:
A B C
^0hand(%s)leg$ 27;33 42;54
^-(%s)hand0leg 39;30 47;57
Any leads will be helpful
You could use:
(df.set_index('A').stack()
.str.extract('(\d+);(\d+)').astype(int)
.groupby(level=[0,1]).agg(max).astype(str)
.assign(s=lambda d: d[0]+';'+d[1])['s'] # OR # .apply(';'.join, axis=1)
.unstack(1)
.loc[df['A'].unique()] ## only if the order of rows matters
.reset_index()
)
output:
A B C
0 ^0hand(%s)leg$ 27;33 42;54
1 ^-(%s)hand0leg 39;30 47;57

Pandas-iterate through a dataframe column and concatenate corresponding row values that contain a list

I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)

How to select columns which contain non-duplicate from a pandas data frame

I want to select columns which contain non-duplicate from a pandas data frame and use these columns to make up a subset data frame. For example, I have a data frame like this:
x y z
a 1 2 3
b 1 2 2
c 1 2 3
d 4 2 3
The columns "x" and "z" have non-duplicate values, so I want to pick them out and create a new data frame like:
x z
a 1 3
b 1 2
c 1 3
d 4 3
The can be realized by the following code:
import pandas as pd
df = pd.DataFrame([[1,2,3],[1,2,2],[1,2,3],[4,2,3]],index=['a','b','c','d'],columns=['x','y','z'])
df0 = pd.DataFrame()
for i in range(df.shape[1]):
if df.iloc[:,i].nunique() > 1:
df1 = df.iloc[:,i].T
df0 = pd.concat([df0,df1],axis=1, sort=False)
However, there must be more simple and direct methods. What are they?
Best regards
df[df.columns[(df.nunique()!=1).values]]
Maybe you can try this one-liner.
Apply nunique, then remove columns where nunique is 1:
nunique = df.apply(pd.Series.nunique)
cols_to_drop = nunique[nunique == 1].index
df = df.drop(cols_to_drop, axis=1)
df =df[df.columns[df.nunique()>1]]
assuming columns with all repeated values with give nunique =1 other will be more 1.
df.columns[df.nunique()>1] will give all columns names which fulfill the purpose
simple one liner:
df0 = df.loc[:,(df.max()-df.min())!=0]
or even better
df0 = df.loc[:,(df.max()!=df.min())]

drop duplicate data in a data frame under special condition using pandas (python)

I have a the following data frame:
I want to remove duplicate data in WD column, if they have the same drug_id.
For example, there is two "crying" in WD column with the same drug_id = 32. So I want to remove one of the row that has crying.
How I can do it? I know how to duplicate rows, but I do not know how to add this condition to this code.
df = df.apply(lambda x:x.drop_duplicates())
You can use drop_duplicates with subset parameter which optionally considers certain columns for duplicates:
df.drop_duplicates(subset = ["drug_id", "WD"])
If the upper/lower cases are important for considering duplicates, you could try:
df[~df[['drug_id', 'WD']].apply(lambda x: x.str.lower()).duplicated()]
Where you can convert both drug_id and WD columns to lower case, use duplicated() method to identify duplicated rows and then use the generated logical series to filter out duplicated rows.
Example:
df = pd.DataFrame({"A": [1,1,2,2], "B":[1,2,3,4], "C":[1,1,2,3]})
df
# A B C
#0 1 1 1
#1 1 2 1
#2 2 3 2
#3 2 4 3
df.drop_duplicates(subset=['A', 'C'])
# A B C
#0 1 1 1
#2 2 3 2
#3 2 4 3

Categories

Resources