How to remove prefixes from strings? - python

I'm trying to do some text preprocessing so that I can do some string matching activities.
I have a set of strings, i want to check if the first word in the string starts with "1/" prefix. If it does, I want to remove this prefix but maintain the rest of the word/string.
I've come up with the following, but its just removing everything after the first word and not necessarily removing the prefix "1/"
prefixes = (r'1/')
#remove prefixes from string
def prefix_removal(text):
for word in str(text).split():
if word.startswith(prefixes):
return word[len(prefixes):]
else:
return word
Any help would be appreciated!
Thank you!

Assuming you only want to remove the prefix from the first word and leave the rest alone, I see no reason to use a for loop. Instead, I would recommend this:
def prefix_removal(text):
first_word = text.split()[0]
if first_word.startswith(prefixes):
return text[len(prefixes):]
return text
Hopefully this answers your question, good luck!

Starting with Python 3.9 you can use str.removeprefix:
word = word.removeprefix(prefix)
For other versions of Python you can use:
if word.startswith(prefix):
word = word[len(prefix):]

Related

Remove whitespace between two lowercase letters

Trying to find a regex (or different method), that removes whitespace in a string only if it occurs between two lowercase letters. I'm doing this because I'm cleaning noisy text from scans where whitespace was mistakenly added inside of words.
For example, I'd like to turn the string noisy = "Hel lo, my na me is Mark." into clean= "Hello, my name is Mark."
I've tried to capture the group in a regex (see below) but don't know how to then replace only whitespace in between two lowercase letters. Same issue with re.sub.
This is what I've tried, but it doesn't work because it removes all the whitespace from the string:
import re
noisy = "Hel lo my name is Mark"
finder = re.compile("[a-z](\s)[a-z]")
whitesp = finder.search(noisy).group(1)
clean = noisy.replace(whitesp,"")
print(clean)
Any ideas are appreciated thanks!
EDIT 1:
My use case is for Swedish words and sentences that I have OCR'd from scanned documents.
To correct an entire string, you could try symspellpy.
First, install it using pip:
python -m pip install -U symspellpy
Then, import the required packages, and load dictionaries. Dictionary files shipped with symspellpy can be accessed using pkg_resources. You can pass your string through the lookup_compound function, which will return a list of spelling suggestions (SuggestItem objects). Words that require no change will still be included in this list. max_edit_distance refers to the maximum edit distance for doing lookups (per single word, not entire string). You can maintain casing by setting transfer_casing to True. To get the clean string, a simple join statement with a little list comprehension does the trick.
import pkg_resources
from symspellpy import SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
my_str = "Hel lo, my na me is Mark."
sugs = sym_spell.lookup_compound(
my_str,
max_edit_distance=2,
transfer_casing=True
)
print(" ".join([sug.term for sug in sugs]))
Output:
Hello my name is Mark
Check out their documentation for other examples and use cases.
Is this what you want:
In [3]: finder = re.compile("([a-z])\s([a-z])")
In [4]: clean = finder.sub(r'\1\2', noisy, 1)
In [5]: clean
Out[5]: 'Hello my name is Mark'
I think you need a Python module that contain words (like an oxford dictionary) that can check for any valid words in the string by matching the character that has space in between, for example, you can break the string into list string.split() then loop the list starting with index 1 range(1,len(your_list)) by joining the current index and the previous index list[index - 1] + list[index] into a string (i.e., token/word); then use this token to check the set of words that you have collected to see if this token is a valid word; if is true, append this token into a temporary list, if not true then just append the previous word into the temporary list, once the loop is done, you can just join the list into a string.
You can try Python spelling checker pyenchant, Python grammar checker language-check, or even using NLTK Corpora to build your own checker.

Python regex to check if a substring is at the beginning or at the end of a bigger path to look for

I have a string containing words in the form word1_word2, word3_word4, word5_word1 (so a word can appear at the left or at the right). I want a regex that looks for all the occurrences of a specific word, and returns the "super word" containing it. So if I'm looking for word1, I expect my regex to return word1_word2, word5_word1. Since the word can appear on the left or on the right, I wrote this:
re.findall("( {}_)?[\u0061-\u007a\u00e0-\u00e1\u00e8-\u00e9\u00ec\u00ed\u00f2-\u00f3\u00f9\u00fa]*(_{} )?".format("w1", "w1"), string)
With the optional blocks at the beginning or at the end of the pattern. However, it takes forever to execute and I think something is not correct because I tried removing the optional blocks and writing two separate regex for looking at the beginning and at the end and they are much faster (but I don't want to use two regex). Am I missing something or is it normal?
This would be the regex solution to your problem:
re.findall(rf'\b({yourWord}_\w+?|\w+?_{yourWord})\b', yourString)
Python provides some methods to do this
a=['word1_word2', 'word3_word4', 'word5_word1']
b = [x for x in a if x.startswith("word1") or x.endswith('word1')]
print(b) # ['word1_word2', 'word5_word1']
Referenece link
s = 'word1_word2, word3_word4, word5_word1'
matches = re.finditer(r'(\w+_word1)|(word1_\w+)', s)
result = list(map(lambda x: x.group(), matches))
['word1_word2', 'word5_word1']
This is one method, but seeing #Carl his answer I voted for his. That is a faster and cleaner method. I will just leave it here as one of many regex options.
this regex will do the job for word1:
regex = (word\d_)*word1(_word\d)*
re.findall(regex, string)
you can also use this:
re.findall(rf'\b(word{number}_\w+?|\w+?_word{number})\b', string)
Try the following regex.
In the following, replace word1 with the word you're looking for. This is assuming that the word you are looking for consists of only alphanumeric characters.
([a-zA-Z0-9]*_word1)|(word1_.[a-zA-Z0-9]*)

how to find substring from a single line string

suppose, I have a string, s="panpanIpanAMpanJOEpan" . From this I want to find the word pan and replace it with spaces so that I can get the output string as "I AM JOE". How can I do it??
Actually I also don't know how to find certain substring from a long string without spaces such as mentioned above.
It will be great if someone helps me learning about this.
If you don't know pan you can exploit that the letters you want to find is all upper case.
fillword = min(set("".join(i if i.islower() else ' ' for i in s).split(' '))-set(['']),key=len)
This works by first replacing all upper case letters with space, then splitting on space and finding the minimal nonempty word.
Use replace to replace with space, and then strip to remove excess spacing.
s="panpanIpanAMpanJOEpan"
s.replace(fillword,' ').strip()
gives:
'I AM JOE'
s="panpanIpanAMpanJOEpan"
print(s.replace("pan"," ").strip())
use replace
Output:
I AM JOE
As DarrylG and others mentioned, .replace will do what you asked for, where you define what you want to replace ("pan") and what you want to replace it with (" ").
To find a certain string in a longer string you can use .find(), which takes a string you are looking for and optionally where to start and stop looking for it (as integers) as arguments.
If you wanted to find all of the occurrences of a string in a bigger string there's two options:
Find the string with find(), then cut the string so it no longer contains your searchterm and repeat this until the .find() method returns -1(that means the searchterm is not found in the string anymore)
or use the regex module and use the .finditer method to find all occurences of your string Link to someone explaining exactly that on stackoverflow.
Edit: If you don't know what you are searching for, it becomes a bit more tricky, but you can write a regex expession that would extract this data as well using the same regex module. This is easy if you know what the end result is supposed to be (I AM JOE in your case). If you don't it becomes more complicated and we would need additional information to help with this.
You can use replace, to replace all occurances of a substring at once.
In case you want to find the substrings yourself, you can do it manually:
s = "panpanIpanAMpanJOEpan"
while True:
panPosition = s.find('pan') # -1 == 'pan' not found!
if panPosition == -1:
s = s.strip()
break
# Cut out pan from s and replace it with a blanc.
s = s[:panPosition] + ' ' + s[panPosition + 3:]
print(s)
Out:
I AM JOE

How to edit items in a list

This is a follow on from a previous question.
I've loaded a word list in Python, but there is a problem. For example, when I access the 21st item in wordlist I should get "ABACK". Instead I get:
wordlist[21]
"'ABACK\\n',"
So for each word in wordlist I need to trim "'" off the front, and "\\n'," off the back of every string in wordlist. I've tried different string methods but haven't found one that works yet.
wordlist = [ e [1:-4] for e in wordlist]
That does the trick. Courtesy of whoever commented above and a similar post.
Since there is a backslash before \n, doing a simple strip won't work if you want to remove it. You can do this:
wordlist = [word.strip("'").strip().split("\\n")[0] for word in wordlist]
Additional strip if you actually have a \n to get rid of. Or you can do word[1:-4] as #jonrsharpe has suggested.
If this is generic to every word in list you can utilize lstrip() and rstrip() functions.
for word in wordlist:
word.lstrip("'").rstrip("\\n',")

How to remove special characters from txt files using Python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?
import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.
I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.
import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.
When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))

Categories

Resources