stemming problems in python

stemming problems in python - python

I want to find the stems of Persian language verbs. For that first I made a file containing some current and exception stems. I want first, my code searches in the file and if the stem was there it returns the stem and if not, it goes through the rest of the code and by deleting suffixes and prefixes it returns the stem. The problem 1) is that it doesn't pay attention to the file and ignoring it, it just goes through the rest of the code and outputs a wrong stem because exceptions are in the file. 2) because I used "for", the suffixes and prefixes of verbs influence on other verbs and omit other verbs' suffixes and prefixes which sometimes outputs a wrong stem. How should I change the code that each "for" loop works independently and doesn't affect the others? (I have to just write one function and call just it)
I reduced some suffixes and prefixes.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
else:
for i in suffix1:
if verb.endswith(i):
verb = verb[:-len(i)]
return verb

You don't have to put all of your code, sara. We are only concerned with the snippet that causes the problem.
My guess is that the problematic part is the check if i in verb that might fail most of the time because of trailing characters after splitting the characters. Normally, when you split the tokens, you also need to trim the ending characters with the strip() method:
>>> 'who\n'.strip() in 'who'
True
Conditionals like:
>>> "word\n" in "word"
False
>>> 'who ' in 'who'
False
will always fail and that's why the program doesn't check the exceptions at all.

I found the answer. the problem is caused by "else:". there is no need to it.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
for i in suffix1: # ماضي ابعد
if verb.endswith(i):
verb = verb[:-len(i)]
break

Related

Weird behavior when writing a string to a file

I am trying to make an AutoHotKey script that removes the letter 'e' from most words you type. To do this, I am going to put a list of common words in a text file and have a python script add the proper syntax to the AHK file for each word. For testing purposes, my word list file 'words.txt' contains this:
apple
dog
tree
I want the output in the file 'wordsOut.txt' (which I will turn into the AHK script) to end up like this after I run the python script:
::apple::appl
::tree::tr
As you can see, it will exclude words without the letter 'e' and removes 'e' from everything else. But when I run my script which looks like this...
f = open('C:\\Users\\jpyth\\Desktop\\words.txt', 'r')
while True:
word = f.readline()
if not word: break
if 'e' in word:
sp_word = word.strip('e')
outString = '::{}::{}'.format(word, sp_word)
p = open('C:\\Users\\jpyth\\Desktop\\wordsOut.txt', 'a+')
p.write(outString)
p.close()
f.close()
The output text file ends up like this:
::apple
::apple
::tree::tr
The weirdest part is that, while it never gets it right, the text in the output file can change depending on the number of lines in the input file.

I'm making this an official answer and not a comment because it's worth pointing out how strip works, and to be weary of hidden characters like new line characters.
f.readline() returns each line, including the '\n'. Because strip() only removes the character from beginning and end of string, not from the middle, it's not actually removing anything from most words with 'e'. In fact even a word that ends in 'e' doesn't get that 'e' removed, since you have a new line character to the right of it. It also explains why ::apple is printed over two lines.
'hello\n'.strip('o') outputs 'hello\n'
whereas 'hello'.strip('o') outputs 'hell'
As pointed out in the comments, just do sp_word = word.strip().replace('\n', '').replace('e', '')

lhay's answer is right about the behavior of strip() but I'm not convinced that list comprehension really qualifies as "simple".
I would instead go with replace():
>>> 'elemental'.replace('e', '')
'lmntal'
(Also side note: for word in f: does the same thing as the first three lines of your code.)

Python 2.7 find IP address and replace with text

I have used Python to extract a route table from a router and am trying to
strip out superfluous text, and
replace the destination of each route with a text string to match a different customer grouping.
At the moment I have::
infile = "routes.txt"
outfile = "output.txt"
delete_text = ["ROUTER1# sh run | i route", "ip route"]
client_list = ["CUST_A","CUST_B"]
subnet_list = ["1.2.3.4","5.6.7.8"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_text:
line = line.replace(word, "")
for word in subnet_list:
line = line.replace("1.2.3.4", "CUST_A")
for word in subnet_list:
line = line.replace("5.6.7.8", "CUST_B")
fout.write(line)
fin.close()
fout.close()
f = open('output.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()
This works to an extent but when it searches and replaces for e.g. 5.6.7.8 it also picks up that string within other IP addresses e.g. 5.6.7.88, and replaces them also which I don't want to happen.
What I am after is an exact match only to be found and replaced.

You could use re.sub() with explicit word boundaries (\b):
>>> re.sub(r'\b5.6.7.8\b', 'CUST_B', 'test 5.6.7.8 test 5.6.7.88 test')
'test CUST_B test 5.6.7.88 test'

As you found out your approach is bad because it results in false positives (i.e., undesirable matches). You should parse the lines into tokens then match the individual tokens. That might be as simple as first doing tokens = line.split() to split on whitespace. That, however, may not work if the line contains quoted strings. Consider what the result of this statement: "ab 'cd ef' gh".split(). So you might need a more sophisticated parser.
You could use the re module to perform substitutions using the \b meta sequence to ensure the matches begin and end on a "word" boundary. But that has its own, unique, failure modes. For example, consider that the . (period) character matches any character. So doing re.sub('\b5.6.7.8\b', ...) as #NPE suggested will actually match not just the literal word 5.6.7.8 but also 5x6.7y8. That may not be a concern given the inputs you expect but is something most people don't consider and is therefore another source of bugs. Regular expressions are seldom the correct tool for a problem like this one.

thanks guys, I've been testing with this and the re.sub function just seems to print out the below string in a loop : CUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTB .
I have amended the code snippet above to :
for word in subnet_list:
line = re.sub(r'\b5.6.7.8\b', 'CUST_B', '5.6.7.88')
Ideally I would like the string element to be replaced in all of the list occurrences along with preserving the list structure ?

how to take a txt file and split it into strings, getting rid of any floats or int

New to programming, using python 3.0.
I have to write a program that takes an input file name for a .txt file, reads the file but only reads certain words, ignoring the floats and integers and any words that don't match anything in another list.
Basically, I have wordsList and messages.txt.
This program has to read through messages.txt (example of text:
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with the love of my life ... ARREIC)
It then has to ignore all the numbers and search for whether or not any of the words in the message match the words in the wordsList and then match those words to a value(int) in the hvList.
What I have so far: (the wordsList and hvList are in another part of code that I don't think is necessary to show to understand what I'm trying to do (let me know if you do need it to help)
def tweetsvalues ():
tweetwList = []
tinputfile = open(tweetsinputfile,"r")
for line in tinputfile:
entries = line.split()
The last line, entries = line.split() is the one that needs to be changed I'm guessing.

Yes, split is your best friend here. You can also look up the documentation for the is methods. Although the full code is beyond the normal scope of StackOverflow, your central work will look something like
words = sentence.split()
good_words = [word for word in words if isalpha(word)]
This can be done in a "Pythonic" fashion using filter, removing punctuation, etc. However, from the way you write in your posting, I suspect you can take it from here.

This is some code that I wrote earlier today as a really basic spellchecker. I would answer your question more specifically but I do not have the time as of now. This should accomplish what you are looking for. The hard coded .txt file that I am opening contains a lot of english words that are spelled correctly. Feel free to supplement my ideas into your work as needed, but make sure to understand all of the code you are using, otherwise I would only be hindering your learning by giving this code to you. In your case, you would likely want to output all of the words regardless of their spelling, in my code, I am only outputting incorrectly spelled words. Feel free to ask any questions
#---------------------------------------------------------
# The "spellCheck" function determines whether the input
# from the inputFile is a correctly spelled word, and if not
# it will return the word and later be written to a file
# containing misspelled words
#---------------------------------------------------------
def spell_check(word, english):
if word in english:
return None
else:
return word
#---------------------------------------------------------
# The main function will include all of the code that will
# perform actions that are not contained within our other
# functions, and will generally call on those other functions
# to perform required tasks
#---------------------------------------------------------
def main():
# Grabbing user input
inputFile = input('Enter the name of the file to input from: ')
outputFile = input('Enter the name of the file to output to: ')
english = {} # Will contain all available correctly spelled words.
wrong = [] # Will contain all incorrectly spelled words.
num = 0 # Used for line counter.
# Opening, Closing, and adding words to spell check dictionary
with open('wordlist.txt', 'r') as c:
for line in c:
(key) = line.strip()
english[key] = ''
# Opening, Closing, Checking words, and adding wrong ones to wrong list
with open(inputFile, 'r') as i:
for line in i:
line = line.strip()
fun = spell_check(line, english)
if fun is not None:
wrong.append(fun)
# Opening, Closing, and Writing to output file
with open(outputFile, 'w') as o:
for i in wrong:
o.write('%d %s\n' % (num, i))
num += 1
main()

How to sort contents in a file in python

I'm trying to figure out a simple way to sort words from a file, however the spaces "\n" are always returned when I print the words.
How could I improve this code to make it work properly? I'm using python 2.7
Thanks in advance.
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
for word in file_handle:
word = word.split()
print sorted(file_handle)
file_handle.close()

You actually have two problems here.
The big one is that print sorted(file_handle) reads and sorts the whole rest of the file and prints that out. You're doing that once per line. So, what happens is that you read the first line, split it, ignore the result, sort and print all the lines after the first, and then you're done.
What you want to do is accumulate all the words as you go along, then sort and print that. Like this:
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
words = []
for line in file_handle:
words += line.split()
file_handle.close()
print sorted(words)
Or, if you want to print the sorted list one line at a time, instead of as a giant list, change the last line to:
print '\n'.sorted(words)
For the second, more minor problem, the one you asked about, you just need to strip off the newlines. So, change the words += line to this:
words += line.strip().split()
However, if you had solved the first problem, you wouldn't even have noticed this one. If you have a line like "one two three\n", and you call split() on it, you will get back ["one", "two", "three"], with no \n to worry about. So, you don't actually even need to solve this one.
While we're at it, there are a few other improvements you could make here:
Use a with statement to close the file instead of doing it manually.
Make this function return the list of words (so you can do various different things with it, instead of just printing it and returning nothing).
Take the filename as a parameter instead of hardcoding it (for similar flexibility).
Maybe turn the loop into a comprehension—but that would require an extra "flattening" step, so I'm not sure it's worth it.
If you don't want duplicate words, use a set rather than a list.
Depending on the use case, you often want to use rstrip() or rstrip('\n') to remove just the trailing newline, while leaving, say, paragraph indentation tabs or spaces. If you're looking for individual words, however, you probably don't want that.
You might want to filter out and/or split on non-alphabetical characters, so you don't get "that." as a word. Doing even this basic kind of natural-language processing is non-trivial, so I won't show an example here. (For example, you probably want "John's" to be a word, you may or may not want "jack-o-lantern" to be one word instead of three; you almost certainly don't want "two-three" to be one word…)
The self parameter is only needed in methods of classes. This doesn't appear to be in any class. (If it is, it's not doing anything with self, so there's no visible reason for it to be in a class. You might have some reason which would be visible in your larger program, of course.)
So, anyway:
def sorting(filename):
words = []
with open(filename) as file_handle:
for line in file_handle:
words += line.split()
return sorted(words)
print '\n'.join(sorting('food.txt'))

Basically all you have to do is strip that newline (and all other whitespace because you probably don't want it):
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
for line in file_handle:
word = line.strip().split()
print sorted(file_handle)
file_handle.close()
Otherwise you can just remove the last character with line[:-1].split()

Use .strip(). It will remove white space by default. You can also add other characters (like "\n") to strip as well. This will leave just the words.

Try this:
def sorting(self):
words = []
with open("food.txt") as f:
for line in f:
words.extend(line.split())
return sorted(words, key=lambda word: word.lower())

To avoid printing the new lines just put , in the end:
print sorted(file_handle),
In your code, i don't see that you are sorting the whole file, just the line. Use a list to save all the words, and after you read the file, sort them all.

How to remove special characters from txt files using Python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?

import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.

I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.

import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.

When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

stemming problems in python - python

I found the answer. the problem is caused by "else:". there is no need to it. def stemmer (verb, file): with open (file, encoding = "utf-8") as f: f = f.read().split() for i in f: if i in verb: return i for i in suffix1: # ماضي ابعد if verb.endswith(i): verb = verb[:-len(i)] break

Related

Weird behavior when writing a string to a file

Python 2.7 find IP address and replace with text

how to take a txt file and split it into strings, getting rid of any floats or int

How to sort contents in a file in python

How to remove special characters from txt files using Python

Categories

Resources