Remove numbers from list but not those in a string - python

I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]

Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo

One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]

It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.

You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']

Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

regex match word and what comes after it

I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.
words = ['qtr','hard','quarter'] # keywords that must exist
test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
excepted output is
['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]
What I have tried so far
re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)
The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.
You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:
pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
groups = pattern.search(x)
if groups:
result.append(groups.group(0))
else:
result.append(None)
OUTPUT:
result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.
Idea extracted from the existing answer and made shorter :
>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0) if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
As mentioned in comment :
But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.
We can try this to overcome that issue :
>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
You can put all needed words in or expression and put your word definition after that
import re
words = ['qtr','hard','quarter']
regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"
p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
for string in test:
result = p.search(string)
if result is not None:
print(p.search(string).group(0))
else:
print(result)
Output:
hard/qtr Mix
qtr
hard
hard work
None

How can I detect multiple keywords in python string?

I am looking for a way to create several lists and for the keywords in those lists to be extracted and matched with a responce.
User Input: This is a good day I am heading out for a jog.
List 1 : Keywords : good day, great day, awesome day, best day.
List 2 : Keywords : a run, a swim, a game.
But for a huge database of words, can this be linked to just the lists? Or does it need to be especific words?
Also would you recommend Python for a huge database of keywords?
The first thing to do is to break the input string up into tokens. A token is just a piece of the string that you want to match. In your case, it looks like your token size is 2 words (but it doesn't have to be). You might also want to strip all punctuation from the input string as well.
Then for your input, your tokens are
['This is', 'is a', 'a good', 'good day', 'day I', 'I am', 'am heading', 'heading out', 'out for', 'for a', 'a jog']
Then you can iterate over the tokens and check to see if they're contained in each one of the lists. Might look like this:
input = 'This is a good day I am heading out for a jog'
words = input.split(' ')
tokens = [' '.join(words[i:i+2]) for i in range(len(words) - 1)]
for token in tokens:
if token in list1:
print('{} is in list1'.format(token))
if token in list2:
print('{} is in list2'.format(token))
One thing you will likely want to do to optimize this is to use sets for list1 and list2, instead of lists.
set1 = set(list1)
sets offer O(1) lookups, as opposed to O(n) for lists, which is critical if your keyword lists are large.

Finding words in phrases using regular expression

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?
Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

how to find longest match of a string including a focus word in python

new to python/programming, so not quite sure how to phrase this....
What I want to do is this: input a sentence, find all matches of the input sentence and a set of stored sentences/strings, and return the longest combination of matched strings.
I think the answer will have something to do with regex, but I haven't started those yet and didn't want to if i didn't need to.
My question: is regex the way to go about this? or is there a way to do this without importing anything?
if it helps you understand my question/idea, heres pseudocode for what i'm trying to do:
input = 'i play soccer and eat pizza on the weekends'
focus_word = 'and'
ss = [
'i play soccer and baseball',
'i eat pizza and apples',
'every day i walk to school and eat pizza for lunch',
'i play soccer but eat pizza on the weekend',
]
match = MatchingFunction(input, focus_word, ss)
# input should match with all except ss[3]
ss[0]match= 'i play soccer and'
ss[1]match = 'and'
ss[2]match = 'and eat pizza'
#the returned value match should be 'i play soccer and eat pizza'
It sounds like you want to find the longest common substring between your input string and each string in your database. Assuming you have a function LCS that will find the longest common substring of two strings, you could do something like:
> [LCS(input, s) for s in ss]
['i play soccer and ',
' eat pizza ',
' and eat pizza ',
' eat pizza on the weekend']
Then, it sounds like you're looking for the most-repeated substring within your list of strings. (Correct me if I'm wrong, but I'm not quite sure what you're looking for in the general case!) From the array output above, what combination of strings would you use to create your output string?
Based on your comments, I think this should do the trick:
> parts = [s for s in [LCS(input, s) for s in ss] if s.find(focus_word) > -1]
> parts
['i play soccer and ', ' and eat pizza ']
Then, to get rid of the duplicate words in this example:
> "".join([parts[0]] + [p.replace(focus_word, "").strip() for p in parts[1:]])
'i play soccer and eat pizza'

Categories

Resources