Count number of substrings found in string - python

I've got the following function which checks to see if any of the strings in b is present in a. This works fine.
a = "a b c d c"
b = ["a", "c", "e"]
if any(x in a for x in b):
print True
else:
print False
I would like to modify it to tell me how many of the strings in b where found in a, which in this case is 2 - a and c. Although c is found twice, it shouldn't make a difference.
How can I do this?

Just change any to sum
print(sum(x in a for x in b)) # prints 2
Here's how it is working:
>>> [x in a for x in b]
[True, True, False]
>>> t = [x in a for x in b]
>>> sum(t) # sum() is summing the True values here
2

This can be done with sum(map(lambda x: 1 if x in a else 0, b)) or sum([1 if x in a else 0 for x in b])

this will do what you want:
def anycount(it):
return len([e for e in it if e])
a = "a b c d c"
b = ["a", "c", "e"]
print (anycount(x in a for x in b))
2

Related

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.
Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])
IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool
I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Python - Multiple Assignment [duplicate]

This question already has answers here:
Multiple assignment and evaluation order in Python
(11 answers)
Closed 2 years ago.
Recently I was reading through the official Python documentation when I came across the example on how to code the Fibonacci series as follows:
a, b = 0, 1
while a < 10:
print (a)
a, b = b, a + b
which outputs to 0,1,1,2,3,5,8
Since I've never used multiple assignment myself, I decided to hop into Visual Studio to figure out how it worked. I noticed that if I changed the notation to...
a = 0
b = 1
while a < 10:
print (a)
a, b = b, a + b
... the output remains the same.
However, if I change the notation to...
a = 0
b = 1
while a < 10:
print(a)
a = b
b = a + b
... the output changes to 0, 1, 2, 4, 8
The way I understand multiple assignments is that it shrinks what can be done into two lines into one. But obviously, this reasoning must be flawed if I can't apply this logic to the variables under the print(a) command.
It would be much appreciated if someone could explain why this is/what is wrong with my reasoning.
a = 0
b = 1
while a < 10:
print(a)
a = b
b = a + b
In this case, a becomes b and then b becomes the changed a + b
a, b = 0, 1
while a < 10:
print (a)
a, b = b, a+b
In this case, a becomes b and at the same time b becomes the original a + b.
That's why, in your case b becomes the new a + b, which, since a = b, basically means b = b + b. That's why the value of b doubles everytime.
When you do a, b = d, e the order in which assignment happens in from right to left. That is, b is first given the value of e and then the other assignment happens. So when you do a, b = b, a + b what you are effectively writing is,
b = a + b
a = b
Hence the difference.
You can verify this by doing
a = 0
b = 1
while a < 10:
a, b = b, a + b
print(a, b)
the first output is 1 1. So first b becomes 0+1 and then a is given the value of b=a making it also 1.
If you want more details on how this works, you can check out this question.
In a multiple assignment, the right side is always computed first.
In effect,
a, b = b, a + b
is the same as:
b = a + b
a = b

Explain how multiple variable assignment in single line works? (Example: a, b = b, a+b)

Say I have this Python code
def fib2(n): # return Fibonacci series up to n
result = []
a, b = 0, 1
while b < n:
result.append(b)
a, b = b, a+b
return result
For n=1000 this prints:
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
But I don't understand why it's 1 1 2 3.
The issue is with this line:
a, b = b, a+b
What is the order of execution?
The two options I see are:
1:
a = b
b = a+b
2:
b = a+b
a = b
But neither gives me the correct result when I try it manually.
What am I missing?
None of the two options you shared actually describe the working of:
a, b = b, a+b
Above code assigns a with the value of b. And b with the older value of a+b (i.e. in a+b the older value of a). You may consider it as an equivalent of:
>>> temp_a, temp_b = a, b
>>> a = temp_b
>>> b = temp_a + temp_b
Example: Dual variable assignment in one line:
>>> a, b = 3, 5
>>> a, b = b, a+b
>>> a
5
>>> b
8
Equivalent Explicit Logic:
>>> a, b = 3, 5
>>> temp_a, temp_b = a, b
>>> a = temp_b
>>> b = temp_a + temp_b
>>> a
5
>>> b
8
The order of operations in a, b = b, a+b is that the tuple (b, a+b) is constructed, and then that tuple is assigned to the variables (a, b). In other words, the right side of the assignment is entirely evaluated before the left side.
(Actually, starting with Python 2.6, no tuple is actually constructed in cases like this with up to 3 variables - a more efficient series of bytecode operations gets substituted. But this is, by design, not a change that has any observable differences.)
It's python standard way to swap two variables, Here is a working example to clear your doubt,
Python evaluates expressions from left to right. Notice that while
evaluating an assignment, the right-hand side is evaluated before the
left-hand side.
http://docs.python.org/3/reference/expressions.html#evaluation-order
a=[1,2,3,4,5]
for i,j in enumerate(a):
if i==1:
a[i-1],a[i]=a[i],a[i-1]
print(a)
output:
[1, 2, 3, 4, 5]
For more info , read this tutorial

Python: join string list if else oneliner

Working with Pandas, I have to rewrite queries implemented as a dict:
query = {"height": 175}
The key is the attribute for the query and the value could be a scalar or iterable.
In the first part I check if the value is not NaN and scalar.
If this condition holds I write the query expression with the == symbol, but else if the value is Iterable I would need to write the expression with the in keyword.
This is the actual code that I need to fix in order to work also with Iterables.
import numpy as np
from collections import Iterable
def query_dict_to_expr(query: dict) -> str:
expr = " and ".join(["{} == {}"
.format(k, v) for k, v in query.items()
if (not np.isnan(v)
and np.isscalar(v))
else "{} in #v".format(k) if isinstance(v, Iterable)
]
)
return expr
but I got invalid syntax in correspondence with the else statement.
If I understand correctly, you don't need to check the type:
In [47]: query
Out[47]: {'height': 175, 'lst_col': [1, 2, 3]}
In [48]: ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
Out[48]: 'height == 175 and lst_col == [1, 2, 3]'
Demo:
In [53]: df = pd.DataFrame(np.random.randint(5, size=(5,3)), columns=list('abc'))
In [54]: df
Out[54]:
a b c
0 0 0 3
1 4 2 4
2 2 2 3
3 0 1 0
4 0 4 1
In [55]: query = {"a": 0, 'b':[0,4]}
In [56]: q = ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
In [57]: q
Out[57]: 'a == 0 and b == [0, 4]'
In [58]: df.query(q)
Out[58]:
a b c
0 0 0 3
4 0 4 1
You misplaces the if/else in the comprehension. If you put the if after the for, like f(x) for x in iterable if g(x), this will filter the elements of the iterable (and can not be combined with an else). Instead, you want to keep all the elements, i.e. use f(x) for x in iterable where f(x) just happens to be a ternary expression, i.e. in the form a(x) if c(x) else b(x).
Instead, try like this (simplified non-numpy example):
>>> query = {"foo": 42, "bar": [1,2,3]}
>>> " and ".join(["{} == {}".format(k, v)
if not isinstance(v, list)
else "{} in {}".format(k, v)
for k, v in query.items()])
'foo == 42 and bar in [1, 2, 3]'

Pythonic way to do conditionally assign variables

Any suggestions on how to do that in Python?
if x():
a = 20
b = 10
else:
a = 10
b = 20
I can swap them as below, but it's not as clear (nor very pythonic IMO)
a = 10
b = 20
if x():
[a, b] = [b, a]
(a,b) = (20,10) if x() else (10,20)
Swapping values with a, b = b, a is considered idiomatic in Python.
a, b = 10, 20
if x(): a, b = b, a
One nice thing is about this is you do not repeat the 10 and 20, so it is a little DRY-er.

Categories

Resources