Trying to find a regex (or different method), that removes whitespace in a string only if it occurs between two lowercase letters. I'm doing this because I'm cleaning noisy text from scans where whitespace was mistakenly added inside of words.
For example, I'd like to turn the string noisy = "Hel lo, my na me is Mark." into clean= "Hello, my name is Mark."
I've tried to capture the group in a regex (see below) but don't know how to then replace only whitespace in between two lowercase letters. Same issue with re.sub.
This is what I've tried, but it doesn't work because it removes all the whitespace from the string:
import re
noisy = "Hel lo my name is Mark"
finder = re.compile("[a-z](\s)[a-z]")
whitesp = finder.search(noisy).group(1)
clean = noisy.replace(whitesp,"")
print(clean)
Any ideas are appreciated thanks!
EDIT 1:
My use case is for Swedish words and sentences that I have OCR'd from scanned documents.
To correct an entire string, you could try symspellpy.
First, install it using pip:
python -m pip install -U symspellpy
Then, import the required packages, and load dictionaries. Dictionary files shipped with symspellpy can be accessed using pkg_resources. You can pass your string through the lookup_compound function, which will return a list of spelling suggestions (SuggestItem objects). Words that require no change will still be included in this list. max_edit_distance refers to the maximum edit distance for doing lookups (per single word, not entire string). You can maintain casing by setting transfer_casing to True. To get the clean string, a simple join statement with a little list comprehension does the trick.
import pkg_resources
from symspellpy import SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
my_str = "Hel lo, my na me is Mark."
sugs = sym_spell.lookup_compound(
my_str,
max_edit_distance=2,
transfer_casing=True
)
print(" ".join([sug.term for sug in sugs]))
Output:
Hello my name is Mark
Check out their documentation for other examples and use cases.
Is this what you want:
In [3]: finder = re.compile("([a-z])\s([a-z])")
In [4]: clean = finder.sub(r'\1\2', noisy, 1)
In [5]: clean
Out[5]: 'Hello my name is Mark'
I think you need a Python module that contain words (like an oxford dictionary) that can check for any valid words in the string by matching the character that has space in between, for example, you can break the string into list string.split() then loop the list starting with index 1 range(1,len(your_list)) by joining the current index and the previous index list[index - 1] + list[index] into a string (i.e., token/word); then use this token to check the set of words that you have collected to see if this token is a valid word; if is true, append this token into a temporary list, if not true then just append the previous word into the temporary list, once the loop is done, you can just join the list into a string.
You can try Python spelling checker pyenchant, Python grammar checker language-check, or even using NLTK Corpora to build your own checker.
Related
I'm trying to check if a subString exists in a string using regular expression.
RE : re_string_literal = '^"[a-zA-Z0-9_ ]+"$'
The thing is, I don't want to match any substring. I'm reading data from a file:
Now one of the lines have this text:
cout<<"Hello"<<endl;
I just want to check if there's a string inside the line and if yes, store it in a list.
I have tried the re.match method but it only works if we have to match a pattern, but in this case, I just want to check if a string exists or not, if yes, store it somewhere.
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
text = 'cout<<"Hello World!"<<endl;'
re.match(re_string_lit,text)
It doesn't output anything.
In simple words,
I just want to extract everything inside ""
If you just want to extract everything inside "" then string splitting would be much simpler way of doing things.
>>> a = 'something<<"actualString">>something,else'
>>> b = a.split('"')[1]
>>> b
'actualString'
The above example would only work for not more than 2 instances of double quotes ("), but you could make it work by iterating over every substring extracted using split method and applying a much simpler Regular Expression.
This worked for me:
re.search('"(.+?)"', 'cout<<"Hello"<<endl')
I want to compare a list of strings and if a certain sequence of characters match, I want to put those matching strings into a new_list, like so:
string_list1 = ['CE.1.FXZ', 'CE.1.FXX', 'CE.1.FXY', 'CE.4.FXZ', 'CE.4.FXX', 'CE.4.FXY']
new_list = ['CE.1.FXZ', 'CE.1.FXX', 'CE.1.FXY']
As you can see, the common character in each is either 1 or 4.
My question is how can I separate strings based on a common character, if I do not know the common character beforehand? For example, I would like to parse the string_list1 into a function and have the function automatically identify the common characters and then separate based on that. Any help would be great! Thanks.
You can isolate the "common character" in your example with python built-in str.split() method (more info at https://docs.python.org/fr/2.7/library/stdtypes.html#str.split) like so :
for i in string_list1:
common_character = i.split(".")[1]
Next step would be creating a list each time you see a novel "common_character" or adding your element to an existing list using the list.append() method (one by one).
Best of luck !
If the common char is always the second token (when split on the .) you can use a default dict where each key is the common char and each value is the list of common chars.
from collections import defaultdict
string_list1 = ['CE.1.FXZ', 'CE.1.FXX', 'CE.1.FXY', 'CE.4.FXZ', 'CE.4.FXX', 'CE.4.FXY']
common_chars = defaultdict(list)
for str in string_list1:
common_chars[str.split('.')[1]].append(str)
for common_group in common_chars.values():
print(common_group)
Outputs:
['CE.1.FXZ', 'CE.1.FXX', 'CE.1.FXY']
['CE.4.FXZ', 'CE.4.FXX', 'CE.4.FXY']
I am working on a Python script to interactively replace pin numbers (part of the string) in between common features within a set of web service links (imagine as the entire string). See below as a case:
The entire string:
http://www.adamscountyarcserver.com/adamscountyarcserver/rest/services/Adams_County_Basemap_Complete/MapServer/14/query?where=**PIN%3D%27010059400200%27**&text=&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=4326&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentsOnly=false&datumTransformation=¶meterValues=&rangeValues=&f=pjson
The part of the string I want to replace/pin number:
010059400200
The common features that are surrounded in the start and in the end of the pin number:
PIN%3D%27 and %27
I tried python built-in functions such as replace substitute and partition, but it seemed like all of them required me to specific the word itself that is to be replaced rather than specifying the relative location of the word within the entire string.
Any solutions or ideas?
There are a few ways to do this, but I think re.sub might be the easiest:
>>> import re
>>> newpin = '1234567890'
>>> re.sub(r'PIN%3D%27\d+%27', 'PIN%3D%27' + newpin + '%27', text)
'http://www.adamscountyarcserver.com/adamscountyarcserver/rest/services/Adams_County_Basemap_Complete/MapServer/14/query?where=PIN%3D%271234567890%27&text=&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=4326&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentsOnly=false&datumTransformation=¶meterValues=&rangeValues=&f=pjson'
I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.
from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?
import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.
I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.
import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.
When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))