I have an array with multiple strings, some of them are none(0 or ''), and each of them should have their own conditions if exists. if the array at its position is none, I don't have to apply the filtering.
# df.columns = ['a','b','c','d','e']
# Case 1
l = ['A', 'B', '','' , 123]
## DESIRED FILTERING
df[ (df.a=='A') & (df.b=='B') & (df.e == 123)]
# Case 2
l = ['z','' ,'' ,'', 123]
## DESIRED FILTERING
df[ (df.a=='z') & (df.e == 123) ]
This is my attempt, yet it failed cuz (df.col_name == 'something') returns a series.
#Case 1 for example
check_null = [ i!='' for i in l ] # ->returns [true,false,...]
conditions = [ (df.a==l[0]),(df.b==l[1]),(df.c==l[2]), (df.d==l[3]), (df.e==l[4])]
filt = [conditions[i] for i in range(len(check_null)) if check_null[i]]
df[filt]
How do I manage to get this work?
Create dictionary for non empty values, convert to Series and filtering in boolean indexing:
df = pd.DataFrame(columns = ['a','b','c','d','e'])
df.loc[0] = ['A', 'B','g' ,'h' , 123]
df.loc[1] = ['A', 'B','g' ,'h' , 52]
l = ['A', 'B','' ,'' , 123]
s = pd.Series(dict(zip(df.columns, l))).loc[lambda x: x != '']
df = df[df[s.index].eq(s).all(axis=1)]
print (df)
a b c d e
0 A B g h 123
l = ['A', 'B', '','', '']
s = pd.Series(dict(zip(df.columns, l))).loc[lambda x: x != '']
df = df[df[s.index].eq(s).all(axis=1)]
print (df)
a b c d e
0 A B g h 123
1 A B g h 52
You can use a Series for comparison.
Either ensure that the value matches the Series (df.eq(s)), or (|) that the Series contains a empty string (s.eq('')). Broadcasting magic will do the rest ;)
s = pd.Series(l, index=df.columns)
df2 = df[(df.eq(s)|s.eq('')).all(1)]
Example with ['A', 'B', '', '', 123]:
# input
a b c d e
0 A B C D 123
1 A X C D 456
# output
a b c d e
0 A B C D 123
Related
I have two lists to start with:
delta = ['1','5']
taxa = ['2','3','4']
My dataframe will look like :
data = { 'id': [101,102,103,104,105],
'1_srcA': ['a', 'b','c', 'd', 'g'] ,
'1_srcB': ['a', 'b','c', 'd', 'e'] ,
'2_srcA': ['g', 'b','f', 'd', 'e'] ,
'2_srcB': ['a', 'b','c', 'd', 'e'] ,
'3_srcA': ['a', 'b','c', 'd', 'e'] ,
'3_srcB': ['a', 'b','1', 'd', 'm'] ,
'4_srcA': ['a', 'b','c', 'd', 'e'] ,
'4_srcB': ['a', 'b','c', 'd', 'e'] ,
'5_srcA': ['a', 'b','c', 'd', 'e'] ,
'5_srcB': ['m', 'b','c', 'd', 'e'] }
df = pd.DataFrame(data)
df
I have to do two types of checks on this dataframe. Say, Delta check and Taxa checks.
For Delta checks, based on list delta = ['1','5'] I have to compare 1_srcA vs 1_srcB and 5_srcA vs 5_srcB since '1' is in 1_srcA ,1_srcB and '5' is in 5_srcA, 5_srcB . If the values differ, I have to populate 2. For tax checks (based on values from taxa list), it should be 1. If no difference, it is 0.
So, this comparison has to happen on all the rows. df is generated based on merge of two dataframes. so, there will be only two cols which has '1' in it, two cols which has '2' in it and so on.
Conditions I have to check:
I need to check if columns containing values from delta list differs. If yes, I will populate 2.
need to check if columns containing values from taxa list differs. If yes, I will populate 1.
If condition 1 and condition 2 are satisfied, then populate 2.
If none of the conditions satisfied, then 0.
So, my output should look like:
The code I tried:
df_cols_ = df.columns.tolist()[1:]
conditions = []
res = {}
for i,col in enumerate(df_cols_):
if (i == 0) or (i%2 == 0) :
continue
var = 'cond_'+str(i)
for del_col in delta:
if del_col in col:
var = var + '_F'
break
print (var)
cond = f"df.iloc[:, {i}] != df.iloc[:, {i+1}]"
res[var] = cond
conditions.append(cond)
The res dict will look like the below. But how can i use the condition to populate?
Is the any optimal solution the resultant dataframe can be derived? Thanks.
Create helper function for filter values by DataFrame.filter and compare them for not equal, then use np.logical_or.reduce for processing list of boolean masks to one mask and pass to numpy.select:
delta = ['1','5']
taxa = ['2','3','4']
def f(x):
df1 = df.filter(like=x)
return df1.iloc[:, 0].ne(df1.iloc[:, 1])
d = np.logical_or.reduce([f(x) for x in delta])
print (d)
[ True False False False True]
t = np.logical_or.reduce([f(x) for x in taxa])
print (t)
[ True False True False True]
df['res'] = np.select([d, t], [2, 1], default=0)
print (df)
id 1_srcA 1_srcB 2_srcA 2_srcB 3_srcA 3_srcB 4_srcA 4_srcB 5_srcA 5_srcB \
0 101 a a g a a a a a a m
1 102 b b b b b b b b b b
2 103 c c f c c 1 c c c c
3 104 d d d d d d d d d d
4 105 g e e e e m e e e e
res
0 2
1 0
2 1
3 0
4 2
I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6
I have to replace values from one dataframe with values from another dataframe.
Example bellow works, but I have extra steps in order to replace values in "first" column with values from "new" column and than drop "new" column.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([['A', 'X'],
...: ['B', 'X'],
...: ['C', 'X'],
...: ['A', 'Y'],
...: ['B', 'Y'],
...: ['C', 'Y'],
...: ], columns=['first', 'second'])
In [3]: df
Out[3]:
first second
0 A X
1 B X
2 C X
3 A Y
4 B Y
5 C Y
In [4]: df_tt = pd.DataFrame([['A', 'E'],
...: ['B', 'F'],
...: ], columns=['orig', 'new'])
In [5]: df_tt
Out[5]:
orig new
0 A E
1 B F
In [6]: df = df.merge(df_tt, left_on='first', right_on='orig')
In [7]: df
Out[7]:
first second orig new
0 A X A E
1 A Y A E
2 B X B F
3 B Y B F
In [8]: df['first'] = df['new']
In [9]: df
Out[9]:
first second orig new
0 E X A E
1 E Y A E
2 F X B F
3 F Y B F
In [10]: df.drop(columns=['orig', 'new'])
Out[10]:
first second
0 E X
1 E Y
2 F X
3 F Y
I would like to replace values with no extra steps.
Another solution is using replace:
# Restrict to common entries
df = df[df['first'].isin(df_tt['orig'])]
# Use df_tt as a mapping to replace values in df
df['first'] = df['first'].replace(df_tt.set_index('orig').to_dict()['new'])
Solution very similar to #jezrael, but I like the idea of explicitly using replace, because this is actually what you are doing: replacing values in one dataframe based on another dataframe.
Use isin for filtering with boolean indexing and then map:
df = (df[df['first'].isin(df_tt['orig'])]
.assign(first=lambda x: x['first'].map(df_tt.set_index('orig')['new'])))
print (df)
first second
0 E X
1 F X
3 E Y
4 F Y
Alternative:
df = df[df['first'].isin(df_tt['orig'])]
df['first'] = df['first'].map(df_tt.set_index('orig')['new'])
I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])
I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])