how to use "in" condition with many text define in python - python

i want to get correct result from my condition, here is my condition
this is my database
and here is my code :
my define text
#define
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
c1="Country"
c2="City"
c3="<blank>"
and condition ("text" here is passing from select database, ofc using looping - for)
if str(text) in str(country) :
stat=c1
elif str(text) in str(city) :
stat=c2
else :
stat=c3
and i got wrong result for the condition, like this
any solution to make this code work ? it work when just contain 1 text when using "in", but this case define so many text condition.

If i understood you correctly you need.
text = "i was born in paris"
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
def check(text):
for i in country:
if i in text.lower():
return "Country"
for i in city:
if i in text.lower():
return "City"
return "<blank>"
print(check(text))
print(check("I dnt like vacation in america"))
Output:
City
Country

You could be better off using dictionaries. I assume that text is a list:
dict1 = {
"countries" : ['america','indonesia', 'england', 'france'],
"city" : ['new york', 'jakarta', 'london', 'paris']
}
for x in text:
for y in dict1['countries']:
if y in x:
print 'country: ' + x
for z in dict1['city']:
if z in x:
print 'city: ' + x

First of all, check what you are testing.
>>> country = ('america','indonesia', 'england', 'france')
>>> city = ('new york', 'jakarta', 'london', 'paris')
>>>
>>> c1="Country"
>>> c2="City"
>>> c3="<blank>"
Same as your setup. So, you are testing for the presence of a substring.
>>> str(country)
"('america', 'indonesia', 'england', 'france')"
Let's see if we can find a country.
>>> 'america' in str(country)
True
Yes! Unfortunately a simple string test such as the one above, besides involving an unnecessary conversion of the list to a string, also finds things that aren't countries.
>>> "ca', 'in" in str(country)
True
The in test for strings is true if the string to the right contains the substring on the left. The in test for lists is different, however, and is true when the tested list contains the value on the left as an element.
>>> 'america' in country
True
Nice! Have got got rid of the "weird other matches" bug?
>>> "ca', 'in" in country
False
It would appear so. However, using the list inclusion test you need to check every word in the input string rather than the whole string.
>>> "I don't like to vacation in america" in country
False
The above is similar to what you are doing now, but testing list elements rather than the list as a string. This expression generates a list of words in the input.
>>> [word for word in "I don't like to vacation in america".split()]
['I', "don't", 'like', 'to', 'vacation', 'in', 'america']
Note that you may have to be more careful than I have been in splitting the input. In the example above, "america, steve" when split would give ['america,', 'steve'] and neither word would match.
The any function iterates over a sequence of expressions, returning True at the first true member of the sequence (and False if no such element is found). (Here I use a generator expression instead of a list, but the same iterable sequence is generated).
>>> any(word in country for word in "I don't like to vacation in america".split())
True
For extra marks (and this is left as an exercise for the reader) you could write
a function that takes two arguments, a sentence and a list of possible matches,
and returns True if any of the words in the sentence are present in the list. Then you could use two different calls to that function to handle the countries and the
cities.
You could speed things up somewhat by using sets rather than lists, but the principles are the same.

Related

Sort values for both str and int by ranking appearance in a string

I have to sort keywords and values in a string.
This is my attempt:
import re
phrase='$1000 is the price of the car, it is 10 years old. And this sandwish cost me 10.34£'
list1 = (re.findall('\d*\.?\d+', phrase)) #this is to make a list that find all the ints in my phrase and sort them (1000, 10, 10.34)
list2= ['car', 'year', 'sandwish'] #this is to make a list of all the keywords in the phrase I need to find.
joinedlist = list1 + list2 #This is the combination of the 2 lists int and str that are in my sentence (the key elements)
filter1 = (sorted(joinedlist, key=phrase.find)) #This is to find all the key elements in my phrase and sort them by order of appearance.
print(filter1)
Unfortunately, in some cases, because the "sorted" function works by lexical sorting, integrals would be printed in the wrong order. This means that in some cases like this one, the output will be:
['1000', '10', 'car', 'year', 'sandwich', '10.34']
instead of:
['1000', 'car', '10', 'year', 'sandwich', '10.34']
as the car appears before 10 in the initial phrase.
Lexical sorting has nothing to do with it, because your sorting key is the position in the original phrase; all the sorting is done by numeric values (the indices returned by find). The reason that the '10' is appearing "out of order" is that phrase.find returns the first occurrence of it, which is inside the 1000 part of the string!
Rather than breaking the sentence apart into two lists and then trying to reassemble them with a sort, why not just use a single regex that selects the different kinds of things you want to keep? That way you don't need to re-sort them at all:
>>> re.findall('\d*\.?\d+|car|year|sandwish', phrase)
['1000', 'car', '10', 'year', 'sandwish', '10.34']
The issue is that 10 and 1000 each have the same value from Python's default string lookup. Both are found at the start of the string since 10 is a substring of 1000.
You can implement a regex lookup into phrase to implement the method you are attempting by using \b word boundaries so that 10 only matches 10 in your string:
def finder(s):
if m:=re.search(rf'\b{s}\b', phrase):
return m.span()[0]
elif m:=re.search(rf'\b{s}', phrase):
return m.span()[0]
return -1
Test it:
>>> sorted(joinedlist, key=finder)
['1000', 'car', '10', 'year', 'sandwish', '10.34']
It is easier if you turn phrase into a look up list of your keywords however. You will need some treatment for year as a keyword vs years in phrase; you can just use the regex r'\d+\.\d+|\w+' as a regex to find the words and then str.startswith() to test if it is close enough:
pl=re.findall(r'\d+\.\d+|\w+', phrase)
def finder2(s):
try: # first try an exact match
return pl.index(s)
except ValueError:
pass # not found; now try .startswith()
try:
return next(i for i,w in enumerate(pl) if w.startswith(s))
except StopIteration:
return -1
>>> sorted(joinedlist, key=finder2)
['1000', 'car', '10', 'year', 'sandwish', '10.34']

Extract first element from the list that occurs after a particular word

I have a string and a list as follow:
text = 'Sherlock Holmes. PARIS. Address: 221B Baker Street, london. Solving case in Madrid.'
city = ['Paris', 'London', 'Madrid']
I want to extract 1st element from the list that occurs after a word Address.
Here's my approach to problem using nltk
import nltk
loc = None
flag = False
for word in nltk.word_tokenize(text):
if word == 'Address':
flag = True
if flag:
if word.capitalize() in city:
loc = word
break
print(loc)
I am getting result as expected from above which is london.
But in real scenario my text is too large and list of cities too, is there a better way to do this?
The lowest hanging fruit I see is that you can turn city into a set for constant time membership checks. Besides that, consider using the next with default argument to return the next city.
city = {'Paris', 'London', 'Madrid'}
while text:
text = text.partition('Address')[-1].strip()
print(
next((w for w in nltk.word_tokenize(text) if w.capitalize() in city), None))

How to match multiple words(per string) in array with a string containing multiple words

This has been resolved. See bottom of this post for a solution
I'm trying to filter out a continuous loop that has a constant feed of strings coming in(from an API).
Heres an example of the code I'm using -
I have a filter set up with an array like so:
filter_a = ['apples and oranges', 'a test', 'bananas']
a function I found on Stackoverflow like this:
def words_in_string(word_list, a_string):
return set(word_list).intersection(a_string.split())
title = 'bananas'
#(this is a continuously looping thing, so sometimes it
# might be for example 'apples and oranges')
And my if statement:
if words_in_string(filter_a, str(title.lower())):
print(title.lower())
For some reason it would detect 'bananas' but not 'apples and oranges'. It will skip right over strings with multiple words. I'm guessing it's because of the split() but I'm not sure.
Edit:
Here's another example of what I meant:
Match this and make it successful:
title = 'this is 1'
word_list = ['this is','a test']
if title in word_list:
print("successful")
else:
print("unsuccessful")
Edit 2:
Solution
title = 'this is 1'
word_list = ['this is','a test']
if any(item in title for item in word_list):
print("successful")
else:
print("unsuccessful")
I don't think your code makes sense. Let's analyze what does words_in_string do.
word_list means a list of words you want to keep, and set(word_list) transform this list into a set which only contains unique elements. In your example, transform ['apples and oranges', 'a test', 'bannanas'] into a set is {'apples and oranges', 'a test', 'bannanas'}.
Next, a_string.split() splits a_string into a list, then call set's function intersection to get the intersection of set and what a_string.split() created.
Finally, return the result.
To be more clearly, given a list of words, this function will return the words in a_string if these words are also contained in list.
For example:
given ["banana", "apple", "orange"] and a_string = "I like banana and apple". It will return {"banana", "apple"}.
But if you change list into ["bananas", "apple", "orange"], it will only return {"apple"} as banana doesn't equal to bananas.

Keeping a count of words in a list without using any count method in python?

I need to keep a count of words in the list that appear once in a list, and one list for words that appear twice without using any count method, I tried using a set but it removes only the duplicate not the original. Is there any way to keep the words appearing once in one list and words that appear twice in another list?
the sample file is text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'Andy Gosling\n'], so technically Andy, and Andy would be in one list, and the rest in the other.
Using dictionaries is not allowed :/
for word in text:
clean = clean_up(word)
for words in clean.split():
clean2 = clean_up(words)
l = clean_list.append(clean2)
if clean2 not in clean_list:
clean_list.append(clean2)
print(clean_list)
This is a very bad, unPythonic way of doing things; but once you disallow Counter and dict, this is about all that's left. (Edit: except for sets, d'oh!)
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
once_words = []
more_than_once_words = []
for sentence in text:
for word in sentence.split():
if word in more_than_once_words:
pass # do nothing
elif word in once_words:
once_words.remove(word)
more_than_once_words.append(word)
else:
once_words.append(word)
which results in
# once_words
['Fennimore', 'Cooper', 'Peter,', 'Paul,', 'and', 'Mary', 'Gosling']
# more_than_once_words
['Andy']
It is a silly problem removing key data structures or loops or whatever. Why not just program is C then? Tell your teacher to get a job...
Editorial aside, here is a solution:
>>> text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n','Andy Gosling\n']
>>> data=' '.join(e.strip('\n,.') for e in ''.join(text).split()).split()
>>> data
['Andy', 'Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Andy', 'Gosling']
>>> [e for e in data if data.count(e)==1]
['Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Gosling']
>>> list({e for e in data if data.count(e)==2})
['Andy']
If you can use a set (I wouldn't use it either, if you're not allowed to use dictionaries), then you can use the set to keep track of what words you have 'seen'... and another one for the words that appear more than once. Eg:
seen = set()
duplicate = set()
Then, each time you get a word, test if it is on seen. If it is not, add it to seen. If it is in seen, add it to duplicate.
At the end, you'd have a set of seen words, containing all the words, and a duplicate set, with all those that appear more than once.
Then you only need to substract duplicate from seen, and the result is the words that have no duplicates (ie. the ones that appear only once).
This can also be implemented using only lists (which would be more honest to your homework, if a bit more laborious).
from itertools import groupby
from operator import itemgetter
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
one, two = [list(group) for key, group in groupby( sorted(((key, len(list(group))) for key, group in groupby( sorted(' '.join(text).split()))), key=itemgetter(1)), key=itemgetter(1))]

How to remove list of words from a list of strings

Sorry if the question is bit confusing. This is similar to this question
I think this the above question is close to what I want, but in Clojure.
There is another question
I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.
Hope I made myself clear.
I think that this is due to the fact that strings in python are immutable.
I have a list of noise words that need to be removed from a list of strings.
If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this
places = ['New York', 'the New York City', 'at Moscow' and many more]
noise_words_list = ['of', 'the', 'in', 'for', 'at']
for place in places:
stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]
I would like to know as to what mistake I'm doing.
Without regexp you could do like this:
places = ['of New York', 'of the New York']
noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
for place in places
]
print stuff
Here is my stab at it. This uses regular expressions.
import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase), phrases) # ['New York', 'New York']
Sans lambda:
[pattern.sub("", phrase) for phrase in phrases]
Update
Fix for the bug pointed out by gnibbler (thanks!):
pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']
#prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".
>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']
Since you would like to know what you are doing wrong, this line:
stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]
takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.
I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):
results = []
for place in places:
for word in words:
if place.startswith(word):
place = place.replace(word, "").strip()
results.append(place)
Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

Categories

Resources