python extract capitalized words using regex - python

I want to extract the word which is capital and occurs 3 or 4 before word "cell" or "cells"
example
:
Briefly, MCF-7 idential cells grown as described above were treated with a range of LTX-diol or iso-LTX-diol.
I would like to extract MCF-7 from above example.
I tried to use [A-Z0-9-]+cells, but its returning cells, instead of MCF-7

This answer assumes that you want to match a word beginning with a capital letter, which in turn is followed by 1 to 4 other words, followed then by cell or cells. We can try matching using the following pattern:
([A-Z][^ ]*)(?=\s+(?:[^A-Z]\S*\s+){1,4}cells?)
The positive lookahead at the end of the pattern asserts the requirement for 1 to 4 words occurring before cell or cells.
input = "Briefly, MCF-7 idential cells grown as described above were treated with a range of LTX-diol or iso-LTX-diol."
r1 = re.findall(r"([A-Z][^ ]*)(?=\s+(?:[^A-Z]\S*\s+){1,4}cells?)", input)
print(r1)
['MCF-7']

Related

Match sequence of words with regex

I have a list of strings and I want to extract from it only the item name, with spaces, if there are.
The strings stay in column named 0, and index is just for reference.
For example, from each index line I want the following results:
Index - Expected result
0 - BOV BCONTRA
1 - BF PARAROLE C
2 - CUBINHOS DACE
... and so on.
Notice that inline 25 the desired result are not separated from the preceding numbers with spaces
There can be a dot . between the words line in index line 30.
I've tried re.findall(r"\n\d{1,2} \d+(\b\w+\b)") with no success.
Also re.findall(r"\n\d{1,2} \d+( ?\w+)") brings me only the first word, and I want all the words, not only the first one.
The lines start with a \n char that it's not printed at the list.
so basically you need all the upper case strings on the text.
try this expression, where it will get all the text with or without spaces
re.findall('[A-Z]+[ A-Z]*', text)
It seems you want [A-Z .]+, not "words" (represented by r'\w'), bordered by
integers. \w maps to
[a-zA-Z0-9_].
That's the Regex string to have: r'\d+ \d+([A-Z .]+)\d+'.
I don't know what you mean that a newline precedes each line. If you have a string with lines in it, it's perhaps better to split the input in lines with string.splitlines(), then do a linear Regex match (re.match so the Regex only matches from the start) on each relevant line.

Python regex for multiple and single dots

I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Regex demo.
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Python demo.

Removing the subsequent words after identified key words

I am trying out regular expressions and would like to find out if there is any way to remove immediate subsequent words after the word I have identified.
For example,
text = "This is the full sentence that I wish to apply regex on."
If I want to remove the word "full", I understand that I can do the following:
result = re.sub(r"full", "", text)
which would give me
This is the sentence that I wish to apply regex on.
Is there any way to get the following statement from my above text? (e.g. keep parts 5 words after the word "full" or remove the first 5 words after "full" )
apply regex on.
Match everything from the beginning to the string to the full, then repeat a group of \s\S+ 5 times to match 5 words, and replace the whole match with the empty string:
^.*?full(?:\s\S+){5}\s
https://regex101.com/r/H6tdCH/1

Find sub-words in a string using regex

In trying to solve this challenge (which I pasted at the bottom of this question) using Python 3, the first of the two proposed solutions below, passes all test cases, while the second one doesn't. Since, in my eyes, they're doing pretty much the same, this leaves me very confused. Why doesn't the second block of code work?
It must be something very obvious because the second one fails most test cases, but having worked through custom-inputs, I still can't figure it out.
Working solution:
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
text = "\n".join(N)
for s in S:
print(len(re.findall(r"(?<!\W)(?="+s.strip()+r"\w)", text)))
Broken "solution":
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
for s in S:
total=0
for string in N:
total += len(re.findall("(?<!\W)(?="+s.strip()+"\w)", string))
print(total)
We define a word character to be any of the following:
An English alphabetic letter (i.e., a-z and A-Z).
A decimal digit (i.e., 0-9).
An underscore (i.e., _, which corresponds to ASCII value ).
We define a word to be a contiguous sequence of one or more word characters that is preceded and succeeded by one or more occurrences of non-word-characters or line terminators. For example, in the string I l0ve-cheese_?, the words are I, l0ve, and cheese_.
We define a sub-word as follows:
A sequence of word characters (i.e., English alphabetic letters,
digits, and/or underscores) that occur in the same exact order (i.e.,
as a contiguous sequence) inside another word.
It is preceded and succeeded by word characters only.
Given sentences consisting of one or more words separated by non-word characters, process queries where each query consists of a single string, . To process each query, count the number of occurrences of as a sub-word in all sentences, then print the number of occurrences on a new line.
Input Format
The first line contains an integer, n, denoting the number of sentences.
Each of the subsequent lines contains a sentence consisting of words separated by non-word characters.
The next line contains an integer, , denoting the number of queries.
Each line of the subsequent lines contains a string, , to check.
Constraints
1 ≤ n ≤ 100
1 ≤ q ≤ 10
Output Format
For each query string, print the total number of times it occurs as a sub-word within all words in all sentences.
Sample Input
1
existing pessimist optimist this is
1
is
Sample Output
3
Explanation
We must count the number of times is occurs as a sub-word in our input sentence(s):
occurs time as a sub-word of existing.
occurs time as a sub-word of pessimist.
occurs time as a sub-word of optimist.
While is a substring of the word this, it's followed by a blank
space; because a blank space is non-alphabetic, non-numeric, and not
an underscore, we do not count it as a sub-word occurrence.
While is a substring of the word is in the sentence, we do not count
it as a match because it is preceded and succeeded by non-word
characters (i.e., blank spaces) in the sentence. This means it
doesn't count as a sub-word occurrence.
Next, we sum the occurrences of as a sub-word of all our words as 1+1+1+0+0=3. Thus, we print 3 on a new line.
Without specifying your strings as raw strings, the regex metacharacters are actually interpreted as special escaped characters, and the pattern will not match as you expect.
Since you are no longer looking inside a multiline string, you'll want to add modify your negative lookbehind to a positive one: (?<=\w)
As Wiktor mentions in his comment, it would be a good idea to escape s.strip so that any potential chars that could be treated as regex metachars will be escaped and taken literally. You can use re.escape(s.strip()) for that.
Your code will work with this change:
total += len(re.findall(r"(?<\w)(?=" + re.escape(s.strip()) + r"\w)", string))

Using regex to detect a sku

I'm new to regex and I have some trouble dectecting the sku (unique ids) of a product in a column.
My skus can take any form: all they have in common basically is:
to be words made of a combination of letters and numbers
to have 6 characters
Here is an example of what I have in my column:
LX0051
N41554
shoes
handbag
1B1F25
1V1F8L
store near me
M90947
M90844
How can I identify the rows that contain a sku using regex?
If I understand correctly you mean that it must have at least on digit, and at least one letter and be exactly 6 characters... Try
^(?=.*\d)(?=.*[a-z])[a-z\d]{6}$
It uses two look-aheads to ensure there's at least one digit and one letter in the string. then it simply matches 6 characters. (Remember the i flag if both common and capital letters should be allowed.)
See it here at regex101.

Categories

Resources