How can i extract URLs from docx file using python?

How can i extract URLs from docx file using python? - python

packages like python docx is ineffective in this case as it is used in creating and updating of Docx files.
Even if i get the full text, i can make some algorithm to extract links from that.
need help!

If all of your links start with http:// or www., you could use a regular expression. From this post, said regular expression would be \b(?:https?://|www\.)\S+\b
If you are using Python 3, you might try:
import re
doc = '...' # use PythonDocx to put the text in here
matches = re.search('\b(?:https?://|www\.)\S+\b',doc)
if matches:
print(matches(0))
Source: Python Documentation
If this is correct, this will locate all text within doc that starts with http://, https://, or www. and print them.
Update: whoops, wrong solution
From the python-docx documentation, here is a working solution:
from docx import Document
document = Document("foobar.docx")
doc = '' # only use if you want the entire document
for paragraph in document.paragraphs
text = paragraph.text
# with text, run your algorithms on it, paragraph by paragraph. if you want the whole thing:
doc += text
# now run your algorithm on text
My Python is a bit rusty, so I might have made an error.

Related

How to use regular expressions with python docx?

I want to find a specific regex in a docx document. I installed python-docx and I can find strings in my text. However, I want to use regular expressions.
So far my code is:
import re
from docx import Document
doc = Document('categoriemanzoni.docx')
match = re.search(r"\[(['prima']+(?!\S))", doc)
for paragraph in doc.paragraphs:
paragraph_text = paragraph.text
if match in paragraph.text:
print('ok')
To me, it seems also that it doesn't read all paragraphs. How to fix it?

Your code is applying the regex (which itself is faulty) at the wrong place. You probably want something like this:
import re
from docx import Document
doc = Document('categoriemanzoni.docx')
regex = re.compile(r"\[prima(?!\S)")
for paragraph in doc.paragraphs:
if regex.search(paragraph.text):
print('ok')

import docx2txt
test_doc = docx2txt.process('story.docx')
docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = docu_Regex.findall(test_doc)
print(mo)
I used this as an example. It worked the way I needed it to.

web scrape python find all by text instead of find all by element tag

Let's use the word technology for my example.
I want to search all text on a webpage. For each text, I want to find each element tags containing a string with the word "technology" and print only the contents of the element tag containing the word. Please help me figure this out.
words = soup.body.get_text()
for word in words:
i = word.soup.find_all("technology")
print(i)

You should use the search by text which can be accomplished by using the text argument (which was renamed to string in the modern BeautifulSoup versions), either via a function and substring in a string check:
for element in soup.find_all(text=lambda text: text and "technology" in text):
print(element.get_text())
Or, via a regular expression pattern:
import re
for element in soup.find_all(text=re.compile("technology")):
print(element.get_text())

Since you are looking for data inside of an 'HTML structure' and not a typical data structure, you are going to have to nearly write an HTML parser for this job. Python doesn't normally know that "some string here" relates to another string wrapped in brackets somewhere else.
There may be a library for this, but I have a feeling that there isn't :(

Regex python and html : looking for dots outside markups?

I have been testing and looking in books and forums for hours without finding the answer, so here is the tricky question.
I am parsing an html file and BeautifulSoup gives me a txt and html versions of a text.
Now I want to split the text in sentences (according to [?!. ]* as end of sentence), so I have :
sentences_txt = re.compile("[^?!.]+?[?!. ]*").findall(txt) # this work : return a list of sentences
and I want to make a list of the same number sentences but for their html counter part, like :
sentences_html = re.compile("[^?!.]+?[?!. ]*").findall(html) # this doesn't work
It doesn't work because when there are markups, it will split in the middle of the markup as soon as it find one of the character [?!.].
==> How can I split an html text according to [?!.] when they are not inside a markup ?
I tried some things using (?
sentences_html = re.compile("(?:<.*>)*[^?!.]+?[?!. ]*").findall(html) # doesn't work
sentences_html = re.compile("(?<!<)[^?!.]+?(?!>)[?!. ]*").findall(html) # doesn't work

python match image tags from large content string using regular expressions

am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None
content = i.content[0].value
prog = re.compile(r'^<img')
result = prog.match(content)
print result
any suggestions?

Multipurpose solution:
image_re = re.compile(r"""
(?P<img_tag><img)\s+ #tag starts
[^>]*? #other attributes
src= #start of src attribute
(?P<quote>["''])? #optional open quote
(?P<image>[^"'>]+) #image file name
(?(quote)(?P=quote)) #close quote
[^>]*? #other attributes
> #end of tag
""", re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments
image_tags = []
for match in image_re.finditer(content):
image_tags.append(match.group("img_tag"))
#print found image_tags
for image_tag in image_tags:
print image_tag
As you can see in regex definition, it contains
(?P<group_name>regex)
It allows you to access found groups by group_name, and not by number. It is for readability. So, if you want to show all src attributes of img tags, then just write:
for match in image_re.finditer(content):
image_tags.append(match.group("image"))
After this image_tags list will contain src of image tags.
Also, if you need to parse html, then there are instruments that were designed exactly for such purposes. For example it is lxml, that use xpath expressions.

I don't know Python but assuming it uses normal Perl compatible regular expressions...
You probably want to look for "<img[^>]+>" which is: "<img", followed by anything that is not ">", followed by ">". Each match should give you a complete image tag.

replace URLs in text with links to URLs

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does.
Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\#&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\#\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:#&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'\1', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.

/\w+:\/\/[^\s]+/

When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?

Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i extract URLs from docx file using python? - python

packages like python docx is ineffective in this case as it is used in creating and updating of Docx files. Even if i get the full text, i can make some algorithm to extract links from that. need help!

Related

How to use regular expressions with python docx?

web scrape python find all by text instead of find all by element tag

Regex python and html : looking for dots outside markups?

python match image tags from large content string using regular expressions

replace URLs in text with links to URLs

Categories

Resources