I scraped a few pdfs and some thick fonts get scraped as in this example:
text='and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
instead of
"and assesses our reformed teaching in the classroom"
How to fix this? I am trying with regex
pattern=r'([a-z])(?=\1)'
re.sub(pattern,'',text)
#"and aseses reformed teaching in the clasrom"
I am thinking of grouping the two groups above and add word boundaries
EDIT: this one fixes words with even number of letters:
pattern=r'([a-z])\1([a-z])\2'
re.sub(pattern,'\1\2',text)
#"and assesses oouurr reformed teaching in the classroom"
If letters are duplicated, you can try something like this
for w in text.split():
if len(w) %2 != 0:
print(w)
continue
if w[0::2] == w[1::2]:
print(w[0::2])
continue
print(w)
I am using a mixed approach: build the pattern and substitution in a for loop, then applying regex. The regexes applied go from e.g. words of 8x2=16 letters down to 3.
import re
text = 'and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
wrd_len = [9,8,7,6,5,4,3,2]
for l in wrd_len:
sub = '\\' + '\\'.join(map(str,range(1,l+1)))
pattern = '([a-z])\\' + '([a-z])\\'.join(map(str,range(1,l+1)))
text = re.sub(pattern, sub , text)
text
#and assesses our reformed teaching in the classroom
For example, the regex for 3-letter words becomes:
re.sub('([a-z])\1([a-z])\2([a-z])\3', '\1\2\3', text)
As a side note, I could not get those backslashes right with raw strings, and I am actually going to use [a-zA-Z].
i found solution in javascript that works fine :
([a-z])\1(?:(?=([a-z])\2)|(?<=\3([a-z])\1\1))
but in some how it doesn't work in python because lookbehind can't take references to group so i came up with another solution that can work in this example :
([a-z])\1(?:(?=([a-z])\2)|(?=[^a-z])))
try it here
Related
I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\
You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!
I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.
I am working on a function which retains symbols that is inside of a word(a word can consist of a-zA-Z,0-9 and _), but removes every other symbol outside the word:
For example:
Input String - hell_o ? my name _ i's <hel'lo/>
Output - ['hell_o' ,'my', 'name', '_', "i's" ,'hel'lo']
The function i am using :
l = ' '.join(filter(None,(word.strip(punctuation.replace("_","")) for word in input_String.split())))
l = re.sub(r'\s+'," ",l)
t = str.split(l.lower())
I know this is not the best, optimal way!!Does anyone recommend any alternatives that i can try??Probably a regEx to do this??
I tried using:
negative look around and look behinds: \W+(?!\S*[a-z])|(?<!\S)\W+
s.strip(punctuation)
re.sub('[^\w]', ' ', doc.strip(' ').lower()) - This Removes punctuation inside the word too
You can match any character different than a-zA-Z, 0-9 and _ as you mention, between 2 letters with (?<=[a-z])\W(?=[a-z]) and replace it with nothing, to remove it.
In the end you will have a very dangerous algorithm for instance in the sentence I'm fine.And you? if there is no space after the dot it will end up in I'm fineAnd you? which may not be what you want.
[EDIT] after your comments.
Ok I misunderstood your question.
Now I came along with the one regex you want to select 'hell_o' ,'my', 'name', "i's" ,'hel'lo':
(?<![a-z])[a-z][^\s]*[a-z](?![a-z]).
You can see it working here: https://regex101.com/r/EAEelq/3. (don't forget the i and g flags).
[EDIT] As you also want to match the _ outside a word
ok so if you want the underscores to be matched also update as is: (?<![a-z_])[a-z_][^\s]*[a-z_](?![a-z_])|(?<= )[a-z_](?= ).
See it working here: https://regex101.com/r/EAEelq/4
I have a massive string of letters all jumbled up, 1.2k lines long.
I'm trying to find a lowercase letter that has EXACTLY three capital letters on either side of it.
This is what I have so far
def scramble(sentence):
try:
for i,v in enumerate(sentence):
if v.islower():
if sentence[i-4].islower() and sentence[i+4].islower():
....
....
except IndexError:
print() #Trying to deal with the problem of reaching the end of the list
#This section is checking if
the fourth letters before
and after i are lowercase to ensure the central lower case letter has
exactly three upper case letters around it
But now I am stuck with the next step. What I would like to achieve is create a for-loop in range of (-3,4) and check that each of these letters is uppercase. If in fact there are three uppercase letters either side of the lowercase letter then print this out.
For example
for j in range(-3,4):
if j != 0:
#Some code to check if the letters in this range are uppercase
#if j != 0 is there because we already know it is lowercase
#because of the previous if v.islower(): statement.
If this doesn't make sense, this would be an example output if the code worked as expected
scramble("abcdEFGhIJKlmnop")
OUTPUT
EFGhIJK
One lowercase letter with three uppercase letters either side of it.
Here is a way to do it "Pythonically" without
regular expressions:
s = 'abcdEFGhIJKlmnop'
words = [s[i:i+7] for i in range(len(s) - 7) if s[i:i+3].isupper() and s[i+3].islower() and s[i+4:i+7].isupper()]
print(words)
And the output is:
['EFGhIJK']
And here is a way to do it with regular expressions,
which is, well, also Pythonic :-)
import re
words = re.findall(r'[A-Z]{3}[a-z][A-Z]{3}', s)
if you can't use regular expression
maybe this for loop can do the trick
if v.islower():
if sentence[i-4].islower() and sentence[i+4].islower():
for k in range(1,4):
if sentence[i-k].islower() or sentence[i+k].islower():
break
if k == 3:
return i
regex is probably the easiest, using a modified version of #Israel Unterman's answer to account for the outside edges and non-upper surroundings the full regex might be:
s = 'abcdEFGhIJKlmnopABCdEFGGIddFFansTBDgRRQ'
import re
words = re.findall(r'(?:^|[^A-Z])([A-Z]{3}[a-z][A-Z]{3})(?:[^A-Z]|$)', s)
# words is ['EFGhIJK', 'TBDgRRQ']
using (?:.) groups keeps the search for beginning of line or non-upper from being included in match groups, leaving only the desired tokens in the result list. This should account for all conditions listed by OP.
(removed all my prior code as it was generally *bad*)