Regex python and html : looking for dots outside markups? - python

I have been testing and looking in books and forums for hours without finding the answer, so here is the tricky question.
I am parsing an html file and BeautifulSoup gives me a txt and html versions of a text.
Now I want to split the text in sentences (according to [?!. ]* as end of sentence), so I have :
sentences_txt = re.compile("[^?!.]+?[?!. ]*").findall(txt) # this work : return a list of sentences
and I want to make a list of the same number sentences but for their html counter part, like :
sentences_html = re.compile("[^?!.]+?[?!. ]*").findall(html) # this doesn't work
It doesn't work because when there are markups, it will split in the middle of the markup as soon as it find one of the character [?!.].
==> How can I split an html text according to [?!.] when they are not inside a markup ?
I tried some things using (?
sentences_html = re.compile("(?:<.*>)*[^?!.]+?[?!. ]*").findall(html) # doesn't work
sentences_html = re.compile("(?<!<)[^?!.]+?(?!>)[?!. ]*").findall(html) # doesn't work

Related

Python regular expression match start and end string and must contain specific word

I need some guidance with refining my regex. I have the source of a webpage, and would like to extract the href's from the page. the table doesn't have any ID's or class. I have decided to use regex, however my expression seems to be matching more than I want.
I have tried the following:
http:\/\/(.*?)(?=.*showuri)(.*?)responseType=xml\">\/lnc\/
my start is http:// the end is responseType=xml">/lnc/ and I need the middle bit to contain the word showuri
I am using Python 3
The method I used for this is as follows:
doc = html.fromstring(text)
tr_elements = doc.xpath('//a/#href')
df = pd.DataFrame(tr_elements)
df.columns=['URL']
from this point, I will drop rows that do no contain "showuri"

How can i extract URLs from docx file using python?

packages like python docx is ineffective in this case as it is used in creating and updating of Docx files.
Even if i get the full text, i can make some algorithm to extract links from that.
need help!
If all of your links start with http:// or www., you could use a regular expression. From this post, said regular expression would be \b(?:https?://|www\.)\S+\b
If you are using Python 3, you might try:
import re
doc = '...' # use PythonDocx to put the text in here
matches = re.search('\b(?:https?://|www\.)\S+\b',doc)
if matches:
print(matches(0))
Source: Python Documentation
If this is correct, this will locate all text within doc that starts with http://, https://, or www. and print them.
Update: whoops, wrong solution
From the python-docx documentation, here is a working solution:
from docx import Document
document = Document("foobar.docx")
doc = '' # only use if you want the entire document
for paragraph in document.paragraphs
text = paragraph.text
# with text, run your algorithms on it, paragraph by paragraph. if you want the whole thing:
doc += text
# now run your algorithm on text
My Python is a bit rusty, so I might have made an error.

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Python string.replace does not replace

I have the following code.
def render_markdown(markdown):
"Replaces markdown links with plain text"
# non greedy
# also includes images
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
# create a mapping
mapping = load_mapping()
anchors = RE_ANCHOR.findall(markdown)
counter = -1
while len(anchors) != 0:
for link in anchors:
counter += 1
text, href = link.split('](')[:2]
text = '-=-' + text[1:] + '-=-'
text = text.replace(' ', '_') + '_' + str(counter)
href = href[: -1]
mapping[text] = href
markdown = markdown.replace(link, text)
anchors = RE_ANCHOR.findall(markdown)
return markdown, mapping
However the markdown function does not replace all the links. Almost none are replaced.I looked around on SO and found a lot of questions pertaining to this. The problems found were of the type:
abc.replace(x, y) instead of abc = abc.replace(x, y)
I am doing that but the string is not being replaced
It looks like the cause is that your regex isn't matching the text you expected. So the loop is empty.
Try posting some sample markdown, run your code, and add print statements so that you can see all the intermediate results (especially the anchors list). With that in hand, debugging will be much easier :-)
I don't understand why you are using replace when you are already using regex. The re library gives you the tools to do what you want without needing to locate the string twice (once with the regex and once with replace).
For example, MatchObject contains the start and end positions of the matched text. You could use text slicing to do your own string substitutions. But even that is unnecessary as you can use re.sub and have the re library do the substitution for you when a match is found. You just need to define a callable which accepts the MathObject and returns the text to replace it.
def render_markdown(markdown):
"Replaces markdown links with plain text"
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
mapping = load_mapping()
def replace_link(m):
# process your link here...
mapping[text] = href
return text
return RE_ANCHOR.sub(replace_link, markdown)
And if you wanted to make a few additions to your regular expression, you could have the regex break up your link into parts which would be accessible as groups on the match object. For example:
RE_ANCHOR = re.compile(r"\[([^\[]*?)\]\((.+?)\)")
# ...
text = m.group(1)
link = m.group(2)
All I did was add a set of parentheses around each of the text and link (inside the brackets). Although, I expect your regex is not sophisticated enough to match all possible links found within Markdown documents. For example, the Python-Markdown library permits at least six levels of nested brackets inside the "text" portion of the link. And don't forget about titles defined in a link (as (url "title")). But that is just scratching the surface. Besides, that would be a separate question.

Python - regular expressions - find every word except in tags

How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.

Categories

Resources