I'm trying to remove the dots of a list of abbreviations so that they will not confuse the sentence tokenizer. This is should be very straightforward. Don't know why my code is not working.
Below please find my code:
abbrevs = [
"No.", "U.S.", "Mses.", "B.S.", "B.A.", "D.C.", "B.Tech.", "Pte.", "Mr.", "O.E.M.",
"I.R.S", "sq.", "Reg.", "S-K."
]
def replace_abbrev(abbrs, text):
re_abbrs = [r"\b" + re.escape(a) + r"\b" for a in abbrs]
abbr_no_dot = [a.replace(".", "") for a in abbrs]
pattern_zip = zip(re_abbrs, abbr_no_dot)
for p in pattern_zip:
text = re.sub(p[0], p[1], text)
return text
text = "Test No. U.S. Mses. B.S. Test"
text = replace_abbrev(abbrevs, text)
print(text)
Here is the result. Nothing happened. What was wrong? Thanks.
Test No. U.S. Mses. B.S. Test
re_abbrs = [r"\b" + re.escape(a) for a in abbrs]
You need to use this.There is no \b after . .This gives the correct output.
Test No US Mses BS Test
You could use map and operator.methodcaller no need for re even though it's a great library.
from operator import methodcaller
' '.join(map(methodcaller('replace', '.', ''), abbrevs))
#No US Mses BS BA DC BTech Pte Mr OEM IRS sq Reg S-K
Related
Can anyone help in fixing the issue here.
I am trying to extract GSTIN/UIN from texts.
#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works
def extract_gstin(text):
return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))
Your second pattern in the commented out part works, and you can omit {1} as it is the default.
What you might do to make it a bit more specific is add word boundaries \b to the left and right to prevent a partial word match.
If it should be after GSTIN : you can use a capture group as well.
Example with the commented pattern:
import re
GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')
def extract_gstin(s):
return re.findall(GSTIN_REG, s)
s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))
Output
['06AACCE2308Q1ZK']
A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)
\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b
Regex demo
my problem is the follow: I want to do a sentiment analysis on Italian tweet and I would to tokenise and lemmatise my Italian text in order to find new analysis dimension for my thesis. The problem is that I would like to tokenise my hashtag, splitting also the composed one. For example if I have #nogreenpass, I would have also without the # symbol, because the sentiment of the phrase would be better understand with all word of the text. How could I do this? I tried with sapCy, but I have no results. I created a function to clean my text, but I can't have the hashtag in the way I would. I'm using this code:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('it_core_news_lg')
# Clean_text function
def clean_text(text):
text = str(text).lower()
doc = nlp(text)
text = re.sub(r'#[a-z0-9]+', str(' '.join(t in nlp(doc))), str(text))
text = re.sub(r'\n', ' ', str(text)) # Remove /n
text = re.sub(r'#[A-Za-z0-9]+', '<user>', str(text)) # Remove and replace #mention
text = re.sub(r'RT[\s]+', '', str(text)) # Remove RT
text = re.sub(r'https?:\/\/\S+', '<url>', str(text)) # Remove and replace links
return text
For example here I don't know how add the first < and last > replacing the # symbol and the tokenisation process doesn't work as I would. Thank you for the time spent for me and for the patience. I hope to became stronger in the Jupiter analysis and python coding so I could give an help also to your problem. Thank you guys!
You can tweak your current clean_code with
def clean_text(text):
text = str(text).lower()
text = re.sub(r'#(\w+)', r'<\1>', text)
text = re.sub(r'\n', ' ', text) # Remove /n
text = re.sub(r'#[A-Za-z0-9]+', '<user>', text) # Remove and replace #mention
text = re.sub(r'RT\s+', '', text) # Remove RT
text = re.sub(r'https?://\S+\b/?', '<url>', text) # Remove and replace links
return text
See the Python demo online.
The following line of code:
print(clean_text("#Marcorossi hanno ragione I #novax htt"+"p://www.asfag.com/"))
will yield
<user> hanno ragione i <novax> <url>
Note there is no easy way to split a glued string into its constituent words. See How to split text without spaces into list of words for ideas how to do that.
I have the following code
import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r"[-()\"#/#;:<>{}`+=&~|.!?,^]", " ", oldstr)
print(newstr)
The above code does not work.
Current result
"HR Director, LearningÂ"
Expected result
"HR Director, Learning"
How to achieve this ?
Converting my comment to answer so that solution is easy to find for future visitors.
You may use:
import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r'[^\x00-\x7f]+|[-()"#/#;:<>{}`+=&~|.!?,^]+', "", oldstr)
print(newstr)
Output:
HR Director Learning
[^\x00-\x7f] will match all non-ASCII characters.
You can use this method too:
def _removeNonAscii(s):
return "".join(i for i in s if ord(i)<128)
Here's how my piece of code outputs:
s = "HR Director, LearningÂ"
def _removeNonAscii(s):
return "".join(i for i in s if ord(i)<128)
print(_removeNonAscii(s))
Output:
HR Director, Learning
I'm trying to identify all sentences that contain in-text citations in a journal article in pdf formats.
I converted the .pdf to .txt and wanted to find all sentences that contained a citation, possibly in one of the following format:
Smith (1990) stated that....
An agreement was made on... (Smith, 1990).
An agreement was made on... (April, 2005; Smith, 1990)
Mixtures of the above
I first tokenized the txt into sentences:
import nltk
from nltk.tokenize import sent_tokenize
ss = sent_tokenize(text)
This makes type(ss) list, so I converted the list into str to use re findall:
def listtostring(s):
str1 = ' '
return (str1. join(s))
ee = listtostring(ss)
Then, my idea was to identify the sentences that contained a four digit number:
import re
for sentence in ee:
zz = re.findall(r'\d{4}', ee)
if zz:
print (zz)
However, this extracts only the years but not the sentences that contained the years.
Using regex, something (try it out) that can have decent recall while trying to avoid inappropriate matches (\d{4} may give you a few) is
\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)
A python example (using spaCy instead of NLTK) would then be
import re
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")
l = [sent.text for sent in doc.sents]
for sentence in l:
if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
print(sentence)
import re
l = ['This is 1234','Hello','Also 1234']
for sentence in l:
if re.findall(r'\d{4}',sentence):
print(sentence)
Output
This is 1234
Also 1234
I want to separate some prefixes that are integrated into words after the word "di" is followed by letters.
sentence1 = "dipermudah diperlancar"
sentence2 = "di permudah di perlancar"
I expect the output like this:
output1 = "di permudah di perlancar"
output2 = "di permudah di perlancar"
Demo
This expression might work to some extent:
(di)(\S+)
if our data would just look like as simple as is in the question. Otherwise, we would be adding more boundaries to our expression.
Test
import re
regex = r"(di)(\S+)"
test_str = "dipermudah diperlancar"
subst = "\\1 \\2"
print(re.sub(regex, subst, test_str))
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Here is one way to do this using re.sub:
sentence1 = "adi dipermudah diperlancar"
output = re.sub(r'(?<=\bdi)(?=\w)', ' ', sentence1)
print(output)
Output:
adi di permudah di perlancar
The idea here is to insert a space whenever what immediately precedes is the prefix di, and what also follows is some other word character.