replacing the hyphenated words in a sentence created by new lines - python

reso- lution
sug- gest
evolu- tion
are all words that have contain hyphens due to limited space in a line in a piece of text. e.g.
Analysis of two high reso- lution nucleosome maps revealed strong
signals that even though they do not constitute a definite proof are
at least consistent with such a view. Taken together, all these
findings sug- gest the intriguing possibility that nucleosome
positions are the product of a mechanical evolu- tion of DNA
molecules.
I would like to replace with their natural forms i.e.
resolution
suggest
evolution
How can I do this in a text with python?

Make sure there is a lowercase letter before - and a lowercase letter after the -+space, capture the letters and use backreferences to get these letters back after replacement:
([a-z])- ([a-z])
See regex demo (replace with \1\2 backreference sequence). Note that you may adjust the number of spaces with {1,max} quantifier (say, if there is one or two spaces between the parts of the word, use ([a-z])- {1,2}([a-z])). If there can be any whitespace, use \s rather than .
Python code:
import re
s = 'Analysis of two high reso- lution nucleosome maps revealed strong signals that even though they do not constitute a definite proof are at least consistent with such a view. Taken together, all these findings sug- gest the intriguing possibility that nucleosome positions are the product of a mechanical evolu- tion of DNA molecules.'
s = re.sub(r'([a-z])- ([a-z])', r'\1\2', s)
print(s)

Use str.replace() to replace "- " with "". For example:
>>> my_text = 'reso- lution'
>>> my_text = my_text.replace('- ', '')
>>> my_text # Updated value without "- "
'resolution'

Related

Is there a way to tell if a newline character is splitting two distinct words in Python?

Using the below code, I imported a few .csv files with sentences like the following into Python:
df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)
Sample sentence:
I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n
While I have no problem removing the newline characters surrounded by spaces, in the middle of words, or at the end of the string, I don't know what to do with the newline characters separating words.
The output I want is as follows:
Goal sentence:
I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.
Is there a way for me to indicate in my code that the newline character is surrounded by two distinct words? Or is this classic garbage in, garbage out?
df = df[~df['Sentence'].str.contains("\n")]
After doing some digging, I came up with two solutions.
1. The textwrap package: Though it seems that the textwrap package is normally used for visual formatting (i.e. telling a UI when to show "..." to signify a long string), it successfully identified the \n patterns I was having issues with. Though it's still necessary to remove extra whitespace of other kinds, this package got me 90% of the way there.
import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS. '
2. Function to ID different \n appearance patterns: The 'boil the ocean' solution I came up with before learning about textwrap, and it doesn't work as well. This function finds matches defined as a newline character surrounded by two word (alphanumeric) characters. For all matches, the function searches NLTK's words.words() list for each string surrounding the newline character. If at least one of the two strings is a word in that list, it's considered to be two separate words.
This doesn't take into consideration domain-specific words, which have to be added to the wordlist, or words like "about", which would be incorrectly categorized by this function if the newline character appeared as "ab\nout". I'd recommend textwrap for this reason, but still thought I'd share.
carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
if carriage.search(sentence):
if not wordword.search(sentence):
sentence = re.sub(carriage, '', sentence)
else:
matches = re.findall(wordword, sentence)
for match in matches:
word1 = match[1].lower()
word2 = match[2].lower()
if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
sentence = sentence.replace(match[0], word1 + ' ' + word2)
else:
sentence = sentence.replace(match[0], word1+word2)
sentence = re.sub(carriage, '', sentence)
display(sentence)
return sentence

Regex to match single dots but not numbers or reticences

I'm working on a sentencizer and tokenizer for a tutorial. This means splitting a document string into sentences and sentences into words. Examples:
#Sentencizing
"This is a sentence. This is another sentence! A third..."=>["This is a sentence.", "This is another sentence!", "A third..."]
#Tokenizatiion
"Tokens are 'individual' bits of a sentence."=>["Tokens", "are", "'individual'", "bits", "of", "a", "sentence", "."]
As seen, there's a need for something more than just a string.split(). I'm using re.sub() appending a 'special' tag for each match (and later splitting in this tag), first for sentences and then for tokens.
So far it works great, but there's a problem: how to make a regex that can split at dots, but not at (...) or at numbers (3.14)?
I've been working with these options with lookahead (I need to match the group and then be able to recall it for appending), but none works:
#Do a negative look behind for preceding numbers or dots, central capture group is a dot, do the same as first for a look ahead.
(?![\d\.])(\.)(?<![\d\.])
The application is:
sentence = re.sub(pattern, '\g<0>'+special_tag, raw_sentence)
I used the following to find the periods that it looked like were relevant:
import re
m = re.compile(r'[0-9]\.[^0-9.]|[^0-9]\.[^0-9.]|[!?]')
st = "This is a sentence. This is another sentence! A third... Pi is 3.14. This is 1984. Hello?"
m.findall(st)
# if you want to use lookahead, you can use something like this:
m = re.compile(r'(?<=[0-9])\.(?=[^0-9.])|(?<=[^0-9])\.(?=[^0-9.])|[!?]')
It's not particularly elegant, but I also tried to deal with the case of "We have a .1% chance of success."
Good luck!
This might be overkill, or need a bit of cleanup, but here is the best regex I could come up with:
((([^\.\n ]+|(\.+\d+))\b[^\.]? ?)+)([\.?!\)\"]+)
To break it down:
[^\.\n ]+ // Matches 1+ times any char that isn't a dot, newline or space.
(\.+\d+) // Captures the special case of decimal numbers
\b[^\.]? ? // \b is a word boundary. This may be optionally
// followed by any non-dot character, and optionally a space.
All these previous parts are matches 1+ times. In order to determine that a sentence is finished, we use the following:
[\.?!\)\"] // Matches any of the common sentences terminators 1+ times
Try it out!

Python regex or other solution to extract text items from a string?

I have a string that lookls like this:
\nInhaltse / techn. Angaben*\n\nAQUA • COCO-GLUCOSIDE • COCOSULFATE • SODIUM\n\n\
And I need to get a list of the items between dots, as follows:
AQUA COCO-GLUCOSIDE COCOSULFATE SODIUM
I have tried with regex and other tools but I cant find the right, flexible* answer.
*flexible = the list might have something between 1 and N elements
You should define a little bit better what are the possibilities, and which rules you want to apply.
I think that a rule like 'any word with only at least 2 uppercase characters or dash preceded and followed by a space or \n' may work for you. If thats the case, here's your RegEx:
import re
my_string = "\nInhaltse / techn. Angaben*\n\nAQUA • COCO-GLUCOSIDE • COCOSULFATE • SODIUM\n\n"
print(re.findall(r"(?<=\n|\s)[A-Z-]{2,}(?=\n|\s)", my_string))
Output:
['AQUA', 'COCO-GLUCOSIDE', 'COCOSULFATE', 'SODIUM']
and here's how you read the RegEx:
(?<=\n|\s) means preceded by (?<=) a new line(\n) or (|) a space (\s)
[A-Z-\s]{2,} means at least two ({2,}) uppercase letters, dash and spaces ([A-Z-\s])
(?=\n|\s) means followed by (?=) a new line(\n) or (|) a space (\s)
or for fitting better your request:
get a list of the items between dots
you may use:
r"(?<=\n\n|\•\s)[A-Z-\s]{2,}(?=\n\n|\s\•)"
which means:
at least 2 uppercase letters, dash or spaces, preceded by two new line or a dot and a space and followed by two new lines or a space and a dot

Chunking sentences using the word 'but' with RegEx

I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...
sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees():
if subtree.label() == 'CHUNK': print(subtree.node())
I need to split the sentence "There are no large collections present but there is spinal canal stenosis." into two:
1. "There are no large collections present"
2. "there is spinal canal stenosis."
I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.
I think you can simply do
import re
result = re.split(r"\s+(?:but|and)\s+", sentence)
where
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:` Match the regular expression below, do not capture
Match either the regular expression below (attempting the next alternative only if this one fails)
`but` Match the characters "but" literally
`|` Or match regular expression number 2 below (the entire group fails if this one fails to match)
`and` Match the characters "and" literally
)
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
You can add more conjunction words in there separated by a pipe-character |.
Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)
If you want to avoid hardcoding conjunction words like 'but' and 'and', try chinking along with chunking:
import nltk
Digdug = nltk.RegexpParser(r"""
CHUNK_AND_CHINK:
{<.*>+} # Chunk everything
}<CC>+{ # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = Digdug.parse(sentence)
for subtree in result.subtrees(filter=lambda t: t.label() ==
'CHUNK_AND_CHINK'):
print (subtree)
Chinking basically excludes what we dont need from a chunk phrase - 'but' in this case.
For more details , see: http://www.nltk.org/book/ch07.html

Regex to find&replace movie names python

I've been working on tweets about different movies (using the Twitter Search API) and now I wanted to replace the match by a fixed string.
I've been struggling with "XMen Apocalypse" because there are many ways to find this on tweets.
I looked for "XMen Apocalypse", "X-Men Apocalypse", "X Men Apocalypse", "XMen", "X-Men", "X Men" and it retrived me matches that also includes "#xmenmovie", "#xmen", "x-men: apocalypse", etc...
This is the regex that I have:
xmen_regex = re.compile("(((#)x[\-]?men:?(apocalypse)?)|(x[\-]? ?men[:]?[ ]?(apocalypse)?))")
def re_place_moviename(text, compiled_regex):
return re.sub(compiled_regex, "MOVIE_NAME", text.lower())
I have tested with RegExr, but still isn't accurate at some edge cases like: '#xmen blabla' -> replace -> '#MOVIE_NAME blabla' or 'MOVIE_NAMEblabla'.
So, there is a better way to do this? maybe compile different regex (on increasing length order (?)) and applying it separately?
edit
Constrains (or summary):
I want to find "x-men", "x men", "xmen"
All of 1 + " apocalypse"
All of 1 + ": apocalypse"
Also: "#xmen", "#x-men", "#xmenapocalypse", "#x-menapocalypse"
All musn't be a substring ("#xmenmovie" or "lovexmen perfect"), must contain at least 1 space at the begining and end of the expression.
PS: Other movies are easier, but xmen and others like Rogue One there has many ways to expressed it and we want to catches the most of it.
PS1: I know that \b can help, but I couldn't understand how it works.
This one should do the job:
(?:^|\s)#x[ -]?men:?\s?apocalypse\b
In case of replacement, if you want to keep the space before, use a capture group and put it in the replacement part:
(^|\s)#x[ -]?men:?\s?apocalypse\b
Explanation:
(?:^|\s) : non capture group, begining of string or a space
# : #
x : x
[ -]? : optional space or dash
men : men
:? : optional semicolon
\s? : optional space
apocalypse : apocalypse
\b : word boundary
This should work per your (vague) constraints:
(?i)(?<![##])x[- ]?men(?!:)( apocalypse)?
(?i) -- ignore case flag
(?<![##]) -- no # or # before 'xmen'
[- ]? -- optional - or
(?!:) -- no colon after 'xmen'
( apocalypse)? -- optional apocalypse string
Edit: Instead of requiring a space in front/behind, I think having a boundary (\b) would be more fitting, i.e. (?i)\b(?<!#)(x[- ]?men:?\s?(?:apocalypse)?)\b as 'xmen' may start the sentence.

Categories

Resources