I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Regex demo.
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Python demo.
Related
I scraped some text from pdfs and accents/umlaut on characters get scraped after their letter, e.g.: `"Jos´e" and "Mu¨ller". Because there are just a few of these characters, I would like to fix them to e.g. "José" and "Müller".
I am trying to adapt the pattern here Regex to match words with hyphens and/or apostrophes.
pattern="(?=\S*[´])([a-zA-Z´]+)"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
m.group() #returns "Jos´e" and "Vald´ez"
m.start() #returns 0 and 6, but I want 3 and 10
In the example above, what pattern can I use to get the position of the '´' character? Then I can check the subsequent letter and replace the text accordingly.
My texts are scraped from from scientific papers and could contain those characters elsewhere, for example in code. That is the reason why I am using regex instead of .replace or text normalization with e.g. unicodedata, because I want to make sure I am replacing "words" (more precisely the authors' first and last names).
EDIT: I can relax these conditions and simply replace those characters everywhere because, if they appear in non-words such as "F=m⋅x¨", I will discard non-words anyway. Therefore, I can use a simple replace approach
I suggest using
import re
d = {'´e': 'é', 'u¨' : 'ü'}
pattern = "|".join([x for x in d])
print( re.sub(pattern, lambda m: d[m.group()], "Jos´e Vald´ez") )
# => José Valdéz
See the Python demo.
If you need to make sure there are word boundaries, you may consider using
pattern = r"\b´e|u¨\b"
See this Python demo. \b before ´ and after u will make sure there are other word chars before/after them.
A quick fix on the pattern returns the indexes which you are looking for. Instead of matching the whole word, the group will catch the apostrophe characters only.
import re
pattern = "(?=\S*[´])[a-zA-Z]+([´]+)[a-zA-Z]+"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
print(m.group()) # returns "Jos´e" and "Vald´ez"
print(m.start(1)) # returns 3 and 10
I need to write a code when given the string "LEMONLEMONLEMO"
I have to find the repetitive word and return: "LEMON"
Given "APPLLEAPL" return "APLLE".
It's given that the string is build form repetitiveness of the same word.
I'm just starting with Python, which make it harder for me to think how to address the problem.
We can try matching on the following regex pattern:
(.*).*\1
This says to match and capture some number of characters, so long as the same group appears later in the input.
input = "LEMONLEMO"
result = re.match(r'(.*).*\1', input)
match = result.group(1)
print(match)
LEMO
res = re.match(match + '.*' + '(?=' + match + ')', input)
output = res.group(0)
print(output)
LEMON
The (.*) is greedy, so it should, by default, find the longest substring which also happens to repeat later on.
Edit:
To take into account your full requirements, after finding LEMO, we then need to take the full substring from the first match up to, but not including, the repeat occurrence of LEMO. I use this regex pattern for that:
LEMO.*(?=LEMO)
The code appears a bit rough, because the above pattern needs to be built on the fly.
I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.
if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)
Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))
If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]
I am using Python to write a program that counts how many time a word appears. But, in order to count, the program needs to look at the beginning of a sentence and only count words in a sentence that starts with %. For example,
%act: <dur> pours peanut on plate
and I want to count the word peanut. The program should return 1. While,
*CHI: peanut.
would return 0 because it starts with *
So I used findall()
findall('\%.*?' + "peanut", website_html)
But, if a sentence has two "peanut"'s, the pattern matching would only return 1. For example
%act: <bef> gives peanut . eats . <dur> gives peanut . <aft> gives raisin
would only return 1.
How can I make it return 2?
Thanks
I'd recommend breaking it down into two parts. I.e., something like:
num_peanuts = 0
for sentence in re.findall(r'(?m)^%.*', website_html):
num_peanuts = len(re.findall(r'\bpeanut\b', sentence))
I'm not sure what the right regexp would be for selecting "a sentence that begins with "%" -- here I assume that it's a line whose first character is % (note that by default . does not match newlines; also, the (?m) puts the regexp in multiline mode; and the ^ is a zero-width assertion that matches the beginning of a line.).
I'll also note that the \b's in my peanut-related regexp are to make sure that the word peanut is not a substring of some larger word (eg peanuts). You may or may not want them, depending on the details of your task.
Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.