Regular Expression for HTML artifacts - python

I some text with HTML artifacts where the < and > of tags got dropped, so now I need something that will match a small p followed by a capital letter, like
pThe next day they....
And I also need something that will catch the trailing /p which is easier. These need to be stripped, i.e. replaced with "" in python.
What RE would I use for that? Thanks!
Stephan.

Try this:
re.sub(r"(/?p)(?=[A-Z]|$)", r"<\1>", str)
You might want to extend the boundary assertion (here (?=[A-Z]|$)) with additional characters like whitespace.

I got is. You use backreferences,
import re
smallBig = re.compile(r'[a-z]([A-Z])')
...
cleanedString = smallBig.sub(r'\1', dirtyString)
This removes the small letter but keeps the capital letter in cases where the '<' and '>' of html tags were stripped and you sit with text like
pSome new paragraph text /p
Quick and dirty but it works in my case.

Related

Regex to split text file in python

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Python, eliminating lines within angle brackets with regex

I'm writing a python script to assign grammatical categories to words in several text files. In each text file, I have file headers within angle brackets <>. Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. I want to remove these lines. This is basically what the text files look like:
<title Titipuru Supay>
<speaker name>
<sex female>
<dialect Pastaza>
<register narrative>
<contributor name>
chan; payguna serenkya man chiga;
<ima?>
payguna kirina man, chiga, mana
shayachira; ninagunan shi tujsirani nira:
illaparani nira shi illapay
<173>
pasasha, ima shi kasna nin, nisha,
Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets.
with open(file, encoding='utf-8') as file_in:
text = file_in.read()
re.sub(r"<.*>", " ", text)
I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). I also tried other solutions like: \<.*\>
Am I just not getting the regex right or there something deeper here?
From what I understand, you may have several <...> on the same line. In this case, you are much safer with a negated character class solution:
text = re.sub(r"<[^>]*>", " ", text)
The text variable, of course, should be updated as Python strings are immutable, and the regex is now matching <, then zero or more characters other than >, and then >.
See the regex demo
Strings are immutable, meaning they cannot be modified, only reassigned. The re.sub(...) is working, but it's returning a new string. Try this:
text = re.sub(r"<.*>", " ", text)
If this still doesn't work, please give us more information about your problem

Regex between quotes look ahead - Python

I have this:
myText = str(^123"I like to"^456&U"play video games and"$"eat cereal")
I want to extract everything in between (and including) quotation marks, split everything before and after the $ sign, and append them into a nested list. E.g.
myTextList = [["I like to","play video games and"],["eat cereal"]]
This is what I tried:
tempTextList = []
for text in re.findall('(?<=\$)"[^"]*"(?<!\^)',myText,re.DOTALL)
tempTextList.append(text)
myTextList.append(tempTextList)
I used the website https://www.regex101.com/#python and tried almost everything I could think of...
(?!\$)"(?!\^\00\+\-\&)[^"].*"
etc...
The re.findall part doesn't really work the way I want it to.
Can someone point me in the right direction?
Thanks
You can use "[^"]*" regex with re.findall:
import re
s = 'myText = str(^123"I like to"^456&U"play video games and"$"eat cereal")'
print(re.findall(r'"[^"]*"', s))
See demo
It matches the double quoted substrings you need with double quotes: ['"I like to"', '"play video games and"', '"eat cereal"'].
Note that "[^"]*" matches " followed by zero or more characters other than " followed with ".
If you need to get the contents inside "..." without the double quotes, you can use capturing mechanism:
r'"([^"]*)"'
The re.findall will only return the captures in Group 1. See another demo.

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.
Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.
re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression
Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

Categories

Resources