How to use regular expressions with python docx?

How to use regular expressions with python docx? - python

I want to find a specific regex in a docx document. I installed python-docx and I can find strings in my text. However, I want to use regular expressions.
So far my code is:
import re
from docx import Document
doc = Document('categoriemanzoni.docx')
match = re.search(r"\[(['prima']+(?!\S))", doc)
for paragraph in doc.paragraphs:
paragraph_text = paragraph.text
if match in paragraph.text:
print('ok')
To me, it seems also that it doesn't read all paragraphs. How to fix it?

Your code is applying the regex (which itself is faulty) at the wrong place. You probably want something like this:
import re
from docx import Document
doc = Document('categoriemanzoni.docx')
regex = re.compile(r"\[prima(?!\S)")
for paragraph in doc.paragraphs:
if regex.search(paragraph.text):
print('ok')

import docx2txt
test_doc = docx2txt.process('story.docx')
docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = docu_Regex.findall(test_doc)
print(mo)
I used this as an example. It worked the way I needed it to.

Related

Regex extract paragraph based on 2 regex match

I am working on an python automation script where I want extract specific paragraph based on regex match but I am stuck on how to extract the paragraph. The following is an example showing my case:
Solution : (Consistent Pattern)
The paragraph I want to extract (Inconsistent Pattern)
Remote value: x (Consistent Pattern)
The following is the program that I am currently working on and it will be great if anyone could enlighten me!
import re
test= 'Solution\s:'
test1='Remote'
with open('<filepath>', 'r') as extract:
lines=extract.readlines()
for line in lines:
x = re.search(test, line)
y = re.search(test1, line)
if x is not y:
f4.write(line)
print('good')
else:
print('stop')

This can be easily done using regular expressions, for example:
import re
text = r"""
Solution\s:
The paragraph I
want to extract
Remote
Some useless text here
Solution\s:
Another paragraph
I want to
extract
Remote
"""
m = re.findall(r"Solution\\s:(.*?)Remote", text, re.DOTALL | re.IGNORECASE)
print(m)
Where text represents some text of interest (read in from a file, for example) from which we wish to extract all portions between the sentinel patterns Solution\s: and Remote. Here we use an IGNORECASE search so that the sentinel patterns are recognised even if spelt with different capitalization.
The above code outputs:
['\nThe paragraph I\nwant to extract\n', '\nAnother paragraph\nI want to\nextract\n']
Read the Python re library documentation at https://docs.python.org/3/library/re.html for more details.

discord.py Clean Content for emojis?

How can I handle custom emojis and clean them? For example turn <a:load:742504529278402560> into just :load:?
There doesn't seem to be a built in way in the library to do this though.

Here is a way:
import re
def cleanemojis(string):
return re.sub(r"<a?:([a-zA-Z0-9_-]{1,32}):[0-9]{17,21}>", r":\1:", string)
>>> cleanemojis("Loading <a:load:742504529278402560>")
"Loading :load:"

you probably would want to use regex. Try this one:
import re
pattern = r":\w*:"
# NEXT IS JUST A TEST
string = "<a:load:742504529278402560>"
result = re.search(pattern, string)
print(result.group())

As Arie Chertkov said, that would be the ideal way to do it. As per your request, I've written it into a function.
import re
pattern = r":\w*:"
def clean(string):
result = re.search(pattern, string)
return(result.group())
print(clean("<a:load:742504529278402560>"))

Regex to find text after linebreak in URL

I want to use regex to get a part of the string. I want to remove the kerberos and everything after it and get the Username
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.compile(r"^((Kerberos?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$",text)
print(reg1)
Output
Username
I am new to regex and tried this regex but it doesn't seem to work

Your regex works just fine, but I am assuming you would like to make most of the groups non-capturing (you can do that by adding ?: to each group.
It will give you the following:
re.match(r"^(?:(?:Kerberos?|ftp):\/)?\/?(?:[^:\/\s]+)(?:(\/\w+)*\/)(?P<u>[\w\-\.]+[^#?\s]+)(?:.*)?(?:#[\w\-]+)?$",t).group('u')
Also, for future reference, try using https://regex101.com/ , it has an easy way to test your regex + explanations on each part.

How about this simple one:
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.findall(r"//.*/(.*)", text)
print(''.join(reg1))
# Username

If you want you can use split instead of regex
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
m = text.split('/')[-1]
print m

I'm having trouble with web scraping to Python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.

Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?

You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

How can i extract URLs from docx file using python?

packages like python docx is ineffective in this case as it is used in creating and updating of Docx files.
Even if i get the full text, i can make some algorithm to extract links from that.
need help!

If all of your links start with http:// or www., you could use a regular expression. From this post, said regular expression would be \b(?:https?://|www\.)\S+\b
If you are using Python 3, you might try:
import re
doc = '...' # use PythonDocx to put the text in here
matches = re.search('\b(?:https?://|www\.)\S+\b',doc)
if matches:
print(matches(0))
Source: Python Documentation
If this is correct, this will locate all text within doc that starts with http://, https://, or www. and print them.
Update: whoops, wrong solution
From the python-docx documentation, here is a working solution:
from docx import Document
document = Document("foobar.docx")
doc = '' # only use if you want the entire document
for paragraph in document.paragraphs
text = paragraph.text
# with text, run your algorithms on it, paragraph by paragraph. if you want the whole thing:
doc += text
# now run your algorithm on text
My Python is a bit rusty, so I might have made an error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use regular expressions with python docx? - python

import docx2txt test_doc = docx2txt.process('story.docx') docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') mo = docu_Regex.findall(test_doc) print(mo) I used this as an example. It worked the way I needed it to.

Related

Regex extract paragraph based on 2 regex match

discord.py Clean Content for emojis?

Regex to find text after linebreak in URL

I'm having trouble with web scraping to Python

How can i extract URLs from docx file using python?

Categories

Resources