how to format my text dataset for training?

how to format my text dataset for training? - python

I'm new to python and machine learning,
I'm working on training a chatbot
I collected (or wrote) large number of possible inputs in an excel file (.xlsx), I will train my dataset using LSTM and IOBES lableing, I will do the same as here :
https://www.depends-on-the-definition.com/guide-sequence-tagging-neural-networks-python/
In the link you can see a snapshot of the dataset, I want to make my dataset like it.
my questions are :
1- Is there a way to split a sentence into words so I can do the tagging for words ? (there is a tool in Excel, I tried it but it is very exhausted).
2- I tried to convert my file to .cvs, but I've faced a lot of problems (it is with utf-8, because my dataset is not in english) , is there another extension ?
I really appreciate your help and advice.
Thank you

You can use the pandas method pd.read_excel('your_file.xlsx',sep=',') to avoid converting your file into csv.
To split a sentence into words you need to use a Natural Language Processing (NLP) python package like nltk with an English vocabulary. This will take into account punctuation, quotes etc..

I'm using openpyxl to load excel file directly into memory. For instance,
from openpyxl import load_workbook
trainingFile = './InputForTraining/1.labelled.Data.V2.xlsx'
trainingSheet = 'sheet1'
TrainingFile = load_workbook(trainingFile)
sheet = TrainingFile[trainingSheet]
Then you don't have to convert excel to csv. Sometime if the data structure is every complicated, the converting is not that simple. You still have to write some code to form the structure.
For split sentence is quite easy if your sentence is quite clean. Python has function split() to split your string to a list of words based on space. For instance,
wordsList = yourString.split()
But you will need to be careful about punctuation. It usually follows right after a word. You can use regEx to split punctuation to a word. For instance,
pat = re.compile(r"([.,;:()/&])")
return_text = pat.sub(" \\1 ", return_text)
wordList = return_text.split()
So [.,;:()/&] will be split from a word.
Or maybe you can even just remove punctuation from sentences if you don't need them at all. And replace them as space. For instance,
return_text = re.sub("[^a-zA-Z\s1234567890]+", ' ', text).strip().rstrip()
Then only letters and numbers will be remained.
.strip().rstrip() is removing extra spaces.

Related

Removing punctuation marks in tokenization nltk with dataframe (python)

I have some text that I was able to process from stop words, links, emoticons, etc. After tokenizing my dataframe, I get a not-so-good picture. There are a lot of extra punctuation marks that are identified as separate words and appear in the processed text. Add an image
For this I use the following command:
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].apply(nltk.word_tokenize)
As you can see, there are many characters like dashes, colons, etc.The question immediately pops up, why not apply the removal of punctuation before tokenization. The point is that there are decimal values in the text that I need. Removing punctuation marks before tokenization splits them into two words, which is not correct.
An example of what happens when you remove punctuation marks before tokenization:
custom_pipeline2 = [preprocessing.remove_punctuation]
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].pipe(hero.clean, custom_pipeline2)
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(nltk.word_tokenize)
I have found a couple of examples how to solve my sign problem but when the data is not a data frame but a string. Can you somehow customize nltk tokenization? Or use some kind of regular expression to process the resulting list later?

Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(lambda x: re.sub(r"(, '[\W\.]')",r"", str(x)))

Filtering out Non English sentences in a list in Python Pandas

So there is a excel file which i have read through pandas and stored it in a dataframe 'df'. Now that excel file contains 24 columns as 'questions' and 631 rows as 'responses/answers'.
So i converted one such question into a list so that i can tokenize it and apply further nlp related tasks on it.
df_lst = df['Q8 Why do you say so ?'].values.tolist()
Now, this gives me a list that contains 631 sentences, out of which some sentences are non-english.. So i want to filter out the non-english sentences so that in the end I am left with a list that contains only english sentences.
What i have:
df_lst = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', 'This text is in a random non english language'...]
Output (What i want):
english_words = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', ...]
Also, I read about a python library named pyenchant which should be able to do this, but it's not compatible with windows 64bit and python 3.. Is there any other way by which this can be done ?
Thanks!

There is another library (closely related to nltk), TextBlob,
Initially bound to Sentiment analysis,
But you can still use it for translation, see the doc here: https://textblob.readthedocs.io/en/dev/quickstart.html
Section Translation and Language Detection
gl

Have you considered taking advantage of the number of English "stopwords" in your sentences? Give a look at the nltk package. Check English stopwords using the following code:
import nltk
from ntlk.corpus import stopwords
ntlk.download('stopwords') # If you just installed the package
set(stopwords.words('english'))
You could add a new column indicating the number of English stopwords present in each of your sentences. Presence of stopwords could be used as a predictor of English language.
Other thing that could work is, if you know for a fact that most answers are in English to begin with, make a frequency ranking for words (possibly for each question in your data). In your example, it looks like the word "customer" shows up quite consistently for the question under study. So you could engineer a variable that indicates the presence of very frequent words in an answer. That could also serve as a predictor. Do not forget to either make all words lowercase or uppercase and deal with plural or 's so you don't rank "customer", "Customer", "customers", "Customers", "customer's" and "customers'" all as different words.
After engineering the variables above, you can set up a threshold above which you consider the sentence to be written in English, or you can go for something a bit more fancy in terms of unsupervised learning.

How do I remove non-English words from a file?

I am trying to process a file with 2 columns of text and categories. From the text column, I need to remove non-English words. I am new to Python so would appreciate if there are any suggestions on how to do this. My file has 60,000 rows of instances.
And I can get to this point below but need help on how to move forward

If you want to remove non English characters, such as punctuation, symbols or script of any other language, you can use isalpha() method of String module.
words=[word.lower() for word in words if word.isalpha()]
To remove meaningless English words you can proceed with #Infinity suggestion but creating a dictionary with 20,000 words will not cover all the scenarios.
As this question is tagged text-mining, you can select a source similar to what corpus you are using, find all the words in the source and then proceed with #Infinity approach.

This code should do the trick.
import pandas
import requests
import string
# The following link contains a text file with the 20,000
# most frequent words in english, one in each line.
DICTIONARY_URL = 'https://raw.githubusercontent.com/first20hours/' \
'google-10000-english/master/20k.txt'
PATH = r"C:\path\to\file.csv"
FILTER_COLUMN_NAME = 'username'
PRINTABLES_SET = set(string.printable)
def is_english_printable(word):
return PRINTABLES_SET >= set(word)
def prepare_dictionary(url):
return set(requests.get(url).text.splitlines())
DICTIONARY = prepare_dictionary(DICTIONARY_URL)
df = pandas.read_csv(PATH, encoding='ISO-8859-1')
df = df[df[FILTER_COLUMN_NAME].map(is_english_printable) &
df[FILTER_COLUMN_NAME].map(str.lower).isin(DICTIONARY)]

Regex Python: Adding . after every 15 terms

I have a text file containing clean tweets and after every 15th term I need to insert a period.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Basically, so that each line becomes its own string after a period.
Or is there an alternative way to split a paragraph into individual sentences.

Splitting paragraphs into sentences can be achieved with functions in nltk package. Please refer to this answer Python split text on sentences

Creating a corpus from data in a custom format

I have hundreds of files containing text I want to use with NLTK. Here is one such file:
বে,বচা ইয়াণ্ঠা,র্চা ঢার্বিত তোখাটহ নতুন, অ প্রবঃাশিত।
তবে ' এ বং মুশায়েরা ' পত্রিব্যায় প্রকাশিত তিনটি লেখাই বইযে
সংব্যজান ব্যরার জনা বিশেষভাবে পরিবর্ধিত। পাচ দাপনিকেব
ড:বন নিয়ে এই বই তৈরি বাবার পরিব্যল্পনাও ম্ভ্রাসুনতন
সামন্তেরই। তার আর তার সহকারীদেব নিষ্ঠা ছাডা অল্প সময়ে
এই বই প্রব্যাশিত হতে পারত না।,তাঁদের সকলকে আমাধ
নমস্কার জানাই।
বতাব্যাতা শ্রাবন্তা জ্জাণ্ণিক
জানুয়ারি ২ ণ্ট ণ্ট ৮
Total characters: 378
Note that each line does not contain a new sentence. Rather, the sentence terminator - the equivalent of the period in English - is the '।' symbol.
Could someone please help me create my corpus? If imported into a variable MyData, I would need to access MyData.words() and MyData.sents(). Also, the last line should not appear in the corpus (it merely contains a character count).
Please note that I will need to run operations on data from all the files at once.
Thanks in advance!

You don't need to input the files yourself or to provide words and sents methods.
Read in your corpus with PlaintextCorpusReader, and it will provide those for you.
The corpus reader constructor accepts arguments for the path and filename pattern of the files, and for the input encoding (be sure to specify it).
The constructor also has optional arguments for the sentence and word tokenization functions, so you can pass it your own method to break up the text into sentences. If word and sentence detection is really simple, i.e., if the | character has other uses, you can configure a tokenization function from the nltk's RegexpTokenizer family, or you can write your own from scratch. (Before you write your own, study the docs and code or write a stub to find out what kind of input it's called with.)
If recognizing sentence boundaries is non-trivial, you can later figure out how to train the nltk's PunktSentenceTokenizer, which uses an unsupervized statistical algorithm to learn which uses of the sentence terminator actually end a sentence.
If the configuration of your corpus reader is fairly complex, you may find it useful to create a class that specializes PlaintextCorpusReader. But much of the time that's not necessary. Take a look at the NLTK code to see how the gutenberg corpus is implemented: It's just a PlainTextCorpusReader instance with appropriate arguments for the constructor.

1) to get rid of the last line is rather straightforward.
f = open('corpus.txt', 'r')
for l in f.readlines()[:-1]:
....
The [:-1] in the for loop will skip the last line for you.
2) The built-in readlines() function of a file object breaks the content in the file into lines by using the newline character as a delimiter. So you need to write some code to cache the lines until the '|' is seen. When a '|' is encountered, treat the cached lines as one single sentence and put it in your MyData class

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to format my text dataset for training? - python

You can use the pandas method pd.read_excel('your_file.xlsx',sep=',') to avoid converting your file into csv. To split a sentence into words you need to use a Natural Language Processing (NLP) python package like nltk with an English vocabulary. This will take into account punctuation, quotes etc..

Related

Removing punctuation marks in tokenization nltk with dataframe (python)

Filtering out Non English sentences in a list in Python Pandas

How do I remove non-English words from a file?

Regex Python: Adding . after every 15 terms

Creating a corpus from data in a custom format

Categories

Resources