How to discard multiple elements from a set? - python

I am trying to discard elements with length less than 10, but it doesn't work.
a = {'ab', 'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}
a.discard(x for x in a if len(x.split())<9) # discard elements with length<10
print(a)
I got this output:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p', 'ab'}
'ab' doesn't match the condition, I don't know why it's still here?
And my desired output is:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}

You need to call discard on individual items, not on the generator of items to be discarded:
for x in [x for x in a if len(x.split()) < 9]:
a.discard(x)
Be mindful that you can't discard items while iterating through the set, so this will not work:
for x in a:
if len(x.split()) < 9:
a.discard(x)
Although this is beyond your question, I'd like to add that there are better ways to do what you want through set comprehension or set subtraction as suggested in another answer and comments.

You are using the wrong method to remove elements from the set. discard removes an element from the set only if it exists. You want to remove elements based on a condition, so you need to use a different approach. Here's a corrected version of the code:
a = {'ab', 'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}
a = {x for x in a if len(x.split()) >= 9}
print(a)
This code creates a new set with only the elements that meet the condition, and then assigns it back to a. The desired output is achieved:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}

Related

Is it possible to split a column value and add a new column at the same time for dataframe?

I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:
name type
a l
b m
c|d|e n
For this df, I want to flatten it to:
name type
a l
b m
c n
d n
e n
To do this, I used this command:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
Now, I want do one more thing besides above flatten operation:
name type co_occur
a l
b m
c n d
c n e
d n e
That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.
I don't see an easy way to do this by modifying:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
I think this is what you want. Use combinations and piece everything together
from itertools import combinations
import io
data = '''name type
a l
b m
c|d|e n
j|k o
f|g|h|i p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
combos = list(combinations(x['name'].split('|'),2))
# print(combos)
# print(x['type'])
df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
return
# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')
name type co_cur
0 a l
1 b m
0 c n d
1 c n e
2 d n e
0 j o k
0 f p g
1 f p h
2 f p i
3 g p h
4 g p i
5 h p i

Extracting multiple words from pandas dataframe column into same column

Suppose a dataframe consists of two columns A={1,2,3} B={'a b c d', 'e f g h', 'i j k l'}. For A = 2, I would like to change the corresponding entry in column B to 'e f h'. (ie. extract the first, second and last word, not drop the third word, not the same).
It is easy to extract single words using the df.loc[df['colA']=2,'colB'].str.split().str[x], where x= 0,1 and -1, but I'm having difficulty joining the three words back into one string efficiently. The most efficient way I can think of is provided below. Is there a better way of achieving what I'm trying to do? Thanks.
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x]
df.loc[df['colA']=2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
Expected and actual result:
A B
1 a b c d
2 e f h
3 i j k l
You were pretty close to the solution, the only problem is that str[x] returns a value wrapped in a Series object. You could fix this by extracting the value from the Series as shown:
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x].values[0]
df.loc[df['colA']==2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
You can also achieve the same by making use of the apply function
df.loc[df['colA']==2, 'colB'] = df.loc[df['colA']==2,'colB'].apply(lambda x: ' '.join(x.split()[0:2] + [x.split()[-1]]))
How about this:
df = pd.DataFrame(data = {'A': [1,2,3],
'B': ['a b c d', 'e f g h', 'i j k l']})
y = lambda x : df.loc[df['A']==2,'B'].str[0:2*x+2] + df.loc[df['A']==2,'B'].str[-1]
df.loc[df1['A']==2,'B'] = y(1)
Then df is the wanted:
A B
0 1 a b c d
1 2 e f h
2 3 i j k l

Count number of substrings found in string

I've got the following function which checks to see if any of the strings in b is present in a. This works fine.
a = "a b c d c"
b = ["a", "c", "e"]
if any(x in a for x in b):
print True
else:
print False
I would like to modify it to tell me how many of the strings in b where found in a, which in this case is 2 - a and c. Although c is found twice, it shouldn't make a difference.
How can I do this?
Just change any to sum
print(sum(x in a for x in b)) # prints 2
Here's how it is working:
>>> [x in a for x in b]
[True, True, False]
>>> t = [x in a for x in b]
>>> sum(t) # sum() is summing the True values here
2
This can be done with sum(map(lambda x: 1 if x in a else 0, b)) or sum([1 if x in a else 0 for x in b])
this will do what you want:
def anycount(it):
return len([e for e in it if e])
a = "a b c d c"
b = ["a", "c", "e"]
print (anycount(x in a for x in b))
2

How to merge columns containing list values when some column list values are emptly?

I want to merge columns from data frames in such a way:
list_A list_B
A, T, G G, C
B, K
C, L, AG L, AG, K
F, K
The output should be:
list_A list_B list_A&B
A, T, G G, C A, T, G, C
B, K B, K
C, L, AG L, AG, K C, L, AG, K
F, K F, K
I did:
df['list_A&B'] = df['list_A'].astype(list) + ', ' + df['list_B'].astype(list)
And, I am getting:
list_A list_B list_A&B
0 A, T, G G, C A, T, G, G, C
1 NaN B, K NaN
2 C, L, AG L, AG, K C, L, AG, L, AG, K
3 F, K NaN NaN
Here the union of the list is a problem when one of the list is empty. But, why?
I then tried unionizing as string but now the added nan cannot be dropped:
df['list_A&B'] = df['list_A'].astype(str) + ', ' + df['list_B'].astype(str)
which gives:
list_A list_B list_A&B
0 A, T, G G, C A, T, G, G, C
1 NaN B, K nan, B, K
2 C, L, AG L, AG, K C, L, AG, L, AG, K
3 F, K NaN F, K, nan
With this output, I am having difficulty removing 'nan' since they are reported as strings and dropna() and fillna() don't work with it.
Any suggestions !
- K
Looks like those are simply string values going by my interpretation of the data you've provided so far.
Steps:
Concatenate them row-wise using str.catwith sep=',' and na_rep='' which takes care of the NaN values interspersed with the other string chars by treating it as an empty char.
Then, remove unwanted whitespaces present among them and also eliminate empty strings using filter(None,...) after having taken the unique elements of the list using set.
Finally, join the resulting list to make it a string representation of the list but excluding the brackets.
df['list_A&B'] = df['list_A'].str.cat(df['list_B'], ',','') \
.apply(lambda x: ', '.join(list(filter(None, set(re.sub(r"\s+", "", x) \
.split(','))))))
df # df.fillna('') to get back your desired output
Starting DF used:
df = pd.DataFrame({'list_A': ['A, T, G', np.NaN, 'C, L, AG ', 'F, K'],
'list_B': ['G, C', 'B, K', 'L, AG, K', np.NaN]})
df

conditional column output for pandas dataframe

I have a pandas DataFrame looking like this:
nameA statusA nameB statusB
a Q x X
b Q y X
c X z Q
d X o Q
e Q p X
f Q r Q
i want to print the rows of this dataframe based on the following rule: output column nameA if statusA is Q else if statusB is Q output column nameB. and in case statusA and statusB are both Q, both columns nameA and nameB should be output.
is there a oneliner for this?
UPDATE:
expected output:
a,Q
b,Q
z,Q
o,Q
e,Q
f,Q,r,Q
> data['con'] = data['statusA'] + data['statusB']
> data.apply(lambda v: v['nameA'] if v['con'] == 'QX' else v['nameB'] if v['con'] == 'XQ' else v['nameA']+ ','+ v['nameB'], axis=1)
0 a
1 b
2 z
3 o
4 e
5 f,r
dtype: object
You can use string concatenation for producing the exact result.
Some thing like
> data.apply(lambda v: v['nameA']+',Q' if v['con'] == 'QX' else v['nameB'] + ',Q' if v['con'] == 'XQ' else v['nameA']+ ',Q,' + v['nameB'] + ',Q', axis=1)
0 a,Q
1 b,Q
2 z,Q
3 o,Q
4 e,Q
5 f,Q,r,Q
dtype: object

Categories

Resources