Removing punctuation marks in tokenization nltk with dataframe (python) - python

I have some text that I was able to process from stop words, links, emoticons, etc. After tokenizing my dataframe, I get a not-so-good picture. There are a lot of extra punctuation marks that are identified as separate words and appear in the processed text. Add an image
For this I use the following command:
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].apply(nltk.word_tokenize)
As you can see, there are many characters like dashes, colons, etc.The question immediately pops up, why not apply the removal of punctuation before tokenization. The point is that there are decimal values in the text that I need. Removing punctuation marks before tokenization splits them into two words, which is not correct.
An example of what happens when you remove punctuation marks before tokenization:
custom_pipeline2 = [preprocessing.remove_punctuation]
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].pipe(hero.clean, custom_pipeline2)
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(nltk.word_tokenize)
I have found a couple of examples how to solve my sign problem but when the data is not a data frame but a string. Can you somehow customize nltk tokenization? Or use some kind of regular expression to process the resulting list later?

Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(lambda x: re.sub(r"(, '[\W\.]')",r"", str(x)))

Related

Regular expression to search for specific twitter username

I have a project where I'm trying to analyze a database of tweets. I need to write a python regex expression that pulls tweets mentioning specific twitter users. Here is an example tweet I'd like to capture.
"That #A_Person is a real jerk."
The regex that I've been trying is
([^.?!]*)(\b([#]A_Person)\b)([^.?!]*)
But it's not working and I've tried lots of variations. Any advice would be appreciated!
\b matches a word boundary, but # is not a word character, so if it occurs after a space, the match will fail. Try removing the word boundary there, and removing the extra groups, and add a character set at the end for [.?!] to include the final punctuation, and you get:
[^.?!]*#A_Person\b.*?[^.?!]*[.?!]
You also might consider including a check for the start of the string or the end of the last sentence, otherwise the engine will go through a lot of steps while going through areas without any matches. Perhaps use
(?:^|(?<=[.?!])\s*)
which will match the start of the string, or will lookbehind for [.?!] possibly followed by spaces. Put those together and you get
(?:^|(?<=[.?!])\s*)([^.?!]*#A_Person\b.*?[^.?!]*[.?!])
where the string you want is in the first group (no leading spaces).
https://regex101.com/r/447KsF/3

Python regex ignore punctuation when using re.sub

Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.

How to handle with words which have space between characters?

I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".
Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیم‌فاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زنده‌گی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زنده‌گی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to می‌شود. Also you can add custom words to the database.

How to use a regex to find combined words with period delimiters?

I am trying to find all instances of words separated by period delimiters.
So for example, these would be valid:
word1.word2
word1.word2.word3
word1.word2.word3.word4
Valid letters of words are those composed of a-zA-Z0-9-.
And so on. I tried [\w.]* but this does not appear to be accurate.
You can use the following:
[a-zA-Z0-9]\w+(?:\.\w+)+
See DEMO

How to tokenize continuous words with no whitespace delimiters?

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?
I am not aware of such tools, but the solution of your problem depends on the language.
For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.
You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.

Categories

Resources