need to convert numbers to word in a text document - python

i want to change all numbers in a document to word. follow two functions detect numbers in string by pattern and convert it to word through num2word library.
import num2words
from re import sub
def _conv_num(match):
word=num2words(match)
return word
def change_to_word(text):
normalized_text = sub(r'[^\s]*\d+[^\s]*', lambda m: _conv_num(m.group()), text)
return normalized_text
when i use these two function by follow code
txt="there are 3 books"
change_to_word(txt)
python issue this error
TypeError: 'module' object is not callable
i tried to find some similar post but it seems that no body had same issue or i didn't search in proper way, so kindly help me with a solution or a link about it
regards

I would do it like this:
import re
def _conv_num(match):
return num2words(match.group())
def numbers_to_words(text):
return re.sub(r'\b\d+\b', _conv_num, text)
for clarity, import the whole regular expression library and use re.sub() instead of just sub
no need for a lambda if your conversion function takes a match instead of a string
use word boundary matchers (\b) in the regular expression
more descriptive name for the main function

Related

Python Regex re.sub not working as intended with dictionary lookup function?

I have one Py file working as a lookup like (constants.py):
constants = {'[%IP%]':'0.0.0.0'}
From my main file, I'm trying to replace a part of my string using the lookup.
from constants import constants
import re
path = r"[%IP%]\file.exe"
pattern = r'\[%.*?\%]'
def replace(match):
return constants.get(match)
print(replace("[%IP%]")) # this bit works fine
print(re.sub(pattern, replace, path)) # this results in the matched pattern just vanishing
When I directly access the 'replace' function in my main file, it correctly outputs the matched result from the constants file dictionary.
But when I use the re.sub to substitute the pattern directly using the replace function with the lookup, it results in a string without the matched pattern at all.
So this is essentially the output for this (which isn't correct):
0.0.0.0
\file.exe
The desired output for this is:
0.0.0.0
0.0.0.0\file.exe
Please help me fix this.
The callback you pass to re.sub will be called with a Match object, not with a plain string. So the callback function needs to extract the string from that Match object.
Can be done like this:
def replace(match):
return constants.get(match[0])
The solution that I'll mark as Accepted is definitely one solution.
The other way around is to use a lambda function and pass the parameters like:
re.sub(pattern, lambda x: replace(x.group()), path))

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Replacing multiple links in a string in one line of code in python

I am new to regular expression module. I am trying to remove all the links in a given exampleString but in one line of code:
exampleSentence = exampleSentence.replace(link for link in re.findall(r'http://*',exampleSentence),'')
But I am getting this syntax error:
SyntaxError: Generator expression must be parenthesized if not sole argument
How to proceed with this?
You have many issues.
First, str.replace() replace a sub-string by another in a given string; it does not take generators.
Example:
print 'example'.replace('e', 'E')
Next, if you want to remove, there is re.sub():
data = re.sub(
r'[A-Za-z]+://[A-Za-z0-9-_]+.[A-Za-z0-9-_:%&;\?#/.=]+', # the URI
'', # the replacement (nothing here)
input_data
)
The URI regex was copied from #miko-trueman answer.
If all you want to do is remove all links from a string, you don't need a generator. The following will work.
import re
exampleString = "http://google.com is my personal library. I am not one for http://facebook.com, but I am in love with http://stackoverflow.com"
exampleString = re.sub(r"(?:\#|https?\://)\S+", '', exampleString)

Using REGEX over split string

I'm trying to split a string into sub string, splitting by the 'AND' term, and after that
clean each sub string from "garbage".
The following code get the error:
AttributeError: 'NoneType' object has no attribute 'group'
import re
def fun(self, str):
for subStr in str.split('AND'):
p = re.compile('[^"()]+')
m = p.match(subStr)
print (m.group())
It means the match is not found, and it returned None.
Note that you might want to use re.search here instead of re.match. re.match matches only at the beginning of the string while re.search can search anywhere in the string.
From the docs:
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
If you already know that then you can handle that None using:
if m:
print (m.group())
else:
#do something else
If the code above is what you really want to do, wouldn't it be easier to remove the garbage first using string.translate. Something like:
import string
def clean_and_split(x):
return string.translate(x, None, r'^"()').split("AND")

Perform simple math on regular expression output? (Python)

Is it possible to perform simple math on the output from Python regular expressions?
I have a large file where I need to divide numbers following a ")" by 100. For instance, I would convert the following line containing )75 and )2:
((words:0.23)75:0.55(morewords:0.1)2:0.55);
to )0.75 and )0.02:
((words:0.23)0.75:0.55(morewords:0.1)0.02:0.55);
My first thought was to use re.sub using the search expression "\)\d+", but I don't know how to divide the integer following the parenthesis by 100, or if this is even possible using re.
Any thoughts on how to solve this? Thanks for your help!
You can do it by providing a function as the replacement:
s = "((words:0.23)75:0.55(morewords:0.1)2:0.55);"
s = re.sub("\)(\d+)", lambda m: ")" + str(float(m.groups()[0]) / 100), s)
print s
# ((words:0.23)0.75:0.55(morewords:0.1)0.02:0.55);
Incidentally, if you wanted to do it using BioPython's Newick tree parser instead, it would look like this:
from Bio import Phylo
# assuming you want to read from a string rather than a file
from StringIO import StringIO
tree = Phylo.read(StringIO(s), "newick")
for c in tree.get_nonterminals():
if c.confidence != None:
c.confidence = c.confidence / 100
print tree.format("newick")
(while this particular operation takes more lines than the regex version, other operations involving trees might be made much easier with it).
The replacement expression for re.sub can be a function. Write a function that takes the matched text, converts it to a number, divides it by 100, and then returns the string form of the result.

Categories

Resources