How do I remove non-English words from a file?

How do I remove non-English words from a file? - python

I am trying to process a file with 2 columns of text and categories. From the text column, I need to remove non-English words. I am new to Python so would appreciate if there are any suggestions on how to do this. My file has 60,000 rows of instances.
And I can get to this point below but need help on how to move forward

If you want to remove non English characters, such as punctuation, symbols or script of any other language, you can use isalpha() method of String module.
words=[word.lower() for word in words if word.isalpha()]
To remove meaningless English words you can proceed with #Infinity suggestion but creating a dictionary with 20,000 words will not cover all the scenarios.
As this question is tagged text-mining, you can select a source similar to what corpus you are using, find all the words in the source and then proceed with #Infinity approach.

This code should do the trick.
import pandas
import requests
import string
# The following link contains a text file with the 20,000
# most frequent words in english, one in each line.
DICTIONARY_URL = 'https://raw.githubusercontent.com/first20hours/' \
'google-10000-english/master/20k.txt'
PATH = r"C:\path\to\file.csv"
FILTER_COLUMN_NAME = 'username'
PRINTABLES_SET = set(string.printable)
def is_english_printable(word):
return PRINTABLES_SET >= set(word)
def prepare_dictionary(url):
return set(requests.get(url).text.splitlines())
DICTIONARY = prepare_dictionary(DICTIONARY_URL)
df = pandas.read_csv(PATH, encoding='ISO-8859-1')
df = df[df[FILTER_COLUMN_NAME].map(is_english_printable) &
df[FILTER_COLUMN_NAME].map(str.lower).isin(DICTIONARY)]

Related

Remove whitespace between two lowercase letters

Trying to find a regex (or different method), that removes whitespace in a string only if it occurs between two lowercase letters. I'm doing this because I'm cleaning noisy text from scans where whitespace was mistakenly added inside of words.
For example, I'd like to turn the string noisy = "Hel lo, my na me is Mark." into clean= "Hello, my name is Mark."
I've tried to capture the group in a regex (see below) but don't know how to then replace only whitespace in between two lowercase letters. Same issue with re.sub.
This is what I've tried, but it doesn't work because it removes all the whitespace from the string:
import re
noisy = "Hel lo my name is Mark"
finder = re.compile("[a-z](\s)[a-z]")
whitesp = finder.search(noisy).group(1)
clean = noisy.replace(whitesp,"")
print(clean)
Any ideas are appreciated thanks!
EDIT 1:
My use case is for Swedish words and sentences that I have OCR'd from scanned documents.

To correct an entire string, you could try symspellpy.
First, install it using pip:
python -m pip install -U symspellpy
Then, import the required packages, and load dictionaries. Dictionary files shipped with symspellpy can be accessed using pkg_resources. You can pass your string through the lookup_compound function, which will return a list of spelling suggestions (SuggestItem objects). Words that require no change will still be included in this list. max_edit_distance refers to the maximum edit distance for doing lookups (per single word, not entire string). You can maintain casing by setting transfer_casing to True. To get the clean string, a simple join statement with a little list comprehension does the trick.
import pkg_resources
from symspellpy import SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
my_str = "Hel lo, my na me is Mark."
sugs = sym_spell.lookup_compound(
my_str,
max_edit_distance=2,
transfer_casing=True
)
print(" ".join([sug.term for sug in sugs]))
Output:
Hello my name is Mark
Check out their documentation for other examples and use cases.

Is this what you want:
In [3]: finder = re.compile("([a-z])\s([a-z])")
In [4]: clean = finder.sub(r'\1\2', noisy, 1)
In [5]: clean
Out[5]: 'Hello my name is Mark'

I think you need a Python module that contain words (like an oxford dictionary) that can check for any valid words in the string by matching the character that has space in between, for example, you can break the string into list string.split() then loop the list starting with index 1 range(1,len(your_list)) by joining the current index and the previous index list[index - 1] + list[index] into a string (i.e., token/word); then use this token to check the set of words that you have collected to see if this token is a valid word; if is true, append this token into a temporary list, if not true then just append the previous word into the temporary list, once the loop is done, you can just join the list into a string.
You can try Python spelling checker pyenchant, Python grammar checker language-check, or even using NLTK Corpora to build your own checker.

How to remove prefixes from strings?

I'm trying to do some text preprocessing so that I can do some string matching activities.
I have a set of strings, i want to check if the first word in the string starts with "1/" prefix. If it does, I want to remove this prefix but maintain the rest of the word/string.
I've come up with the following, but its just removing everything after the first word and not necessarily removing the prefix "1/"
prefixes = (r'1/')
#remove prefixes from string
def prefix_removal(text):
for word in str(text).split():
if word.startswith(prefixes):
return word[len(prefixes):]
else:
return word
Any help would be appreciated!
Thank you!

Assuming you only want to remove the prefix from the first word and leave the rest alone, I see no reason to use a for loop. Instead, I would recommend this:
def prefix_removal(text):
first_word = text.split()[0]
if first_word.startswith(prefixes):
return text[len(prefixes):]
return text
Hopefully this answers your question, good luck!

Starting with Python 3.9 you can use str.removeprefix:
word = word.removeprefix(prefix)
For other versions of Python you can use:
if word.startswith(prefix):
word = word[len(prefix):]

how to format my text dataset for training?

I'm new to python and machine learning,
I'm working on training a chatbot
I collected (or wrote) large number of possible inputs in an excel file (.xlsx), I will train my dataset using LSTM and IOBES lableing, I will do the same as here :
https://www.depends-on-the-definition.com/guide-sequence-tagging-neural-networks-python/
In the link you can see a snapshot of the dataset, I want to make my dataset like it.
my questions are :
1- Is there a way to split a sentence into words so I can do the tagging for words ? (there is a tool in Excel, I tried it but it is very exhausted).
2- I tried to convert my file to .cvs, but I've faced a lot of problems (it is with utf-8, because my dataset is not in english) , is there another extension ?
I really appreciate your help and advice.
Thank you

You can use the pandas method pd.read_excel('your_file.xlsx',sep=',') to avoid converting your file into csv.
To split a sentence into words you need to use a Natural Language Processing (NLP) python package like nltk with an English vocabulary. This will take into account punctuation, quotes etc..

I'm using openpyxl to load excel file directly into memory. For instance,
from openpyxl import load_workbook
trainingFile = './InputForTraining/1.labelled.Data.V2.xlsx'
trainingSheet = 'sheet1'
TrainingFile = load_workbook(trainingFile)
sheet = TrainingFile[trainingSheet]
Then you don't have to convert excel to csv. Sometime if the data structure is every complicated, the converting is not that simple. You still have to write some code to form the structure.
For split sentence is quite easy if your sentence is quite clean. Python has function split() to split your string to a list of words based on space. For instance,
wordsList = yourString.split()
But you will need to be careful about punctuation. It usually follows right after a word. You can use regEx to split punctuation to a word. For instance,
pat = re.compile(r"([.,;:()/&])")
return_text = pat.sub(" \\1 ", return_text)
wordList = return_text.split()
So [.,;:()/&] will be split from a word.
Or maybe you can even just remove punctuation from sentences if you don't need them at all. And replace them as space. For instance,
return_text = re.sub("[^a-zA-Z\s1234567890]+", ' ', text).strip().rstrip()
Then only letters and numbers will be remained.
.strip().rstrip() is removing extra spaces.

Need to modify my regexp to split string at every new line

I have a string with sentences I wanted to separate into individual sentences. The string has a lot of subtleties that are difficult to capture and split. I cannot use the nltk library either. My current regex does the best job among all others I have tried, but misses some sentences that start in a new line (implying a new paragraph). I was wondering if there was an easy way to modify the current expression to also split when there is a new line.
import re
file = open('data.txt','r')
text = file.read()
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
The current regexp is as follows:
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
I would essentially need to modify the expression to also split when there is a new line.

concatenate words in a text file

I have exported a pdf file as a .txt and I observed that many words were broken into two parts due to line breaks. So, in this program, I want to join the words that are separated in the text while maintaining the correct words in the sentence. In the end, I want to get a final .txt file (or at least a list of tokens) with all words properly spelt. Can anyone help me?
my current text is like this:
I need your help be cause I am not a good progra mmer.
result I need:
I need your help because I am not a good programmer.
from collections import defaultdict
import re
import string
import enchant
document_text=open('test-list.txt','r')
text_string=document_text.read().lower()
lst=[]
errors=[]
dic=enchant.Dict('en_UK')
d=defaultdict(int)
match_pattern = re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', text_string)
for w in match_pattern:
lst.append(w)
for i in lst:
if dic.check(i) is True:
continue
else:
a=list(map(''.join, zip(*([iter(lst)]*2))))
if dic.check(a) is True:
continue
else:
errors.append(a)
print (lst)

You have a bigger problem - how will your program know that:
be
cause
... should be treated as one word?
If you really wanted to, you could replace newline characters with empty spaces:
import re
document_text = """
i need your help be
cause i am not a good programmer
""".lower().replace("\n", '')
print([w for w in re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', document_text)])
This will spellcheck because correctly, but will fail in cases like:
Hello! My name is
Foo.
... because isFoo is not a word.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I remove non-English words from a file? - python

Related

Remove whitespace between two lowercase letters

How to remove prefixes from strings?

how to format my text dataset for training?

Need to modify my regexp to split string at every new line

concatenate words in a text file

Categories

Resources