how to search char from file in python - python

Actually i am coming from C++ and i am new here as well, I am having iteration problem.I am using python 2.7.8 and unable to solve which is what i am wanting. I have a file name called "foo.txt". Through code i am trying to find using how many "a e i o u" are in the file. I have created array: vowel[] = {'a','e','i','o',u} and my code should shd give me the combine count of all vowels. But i am facing
error:
TypeError: list indices must be integers, not str
file foo.txt
Chronobiology might sound a little futuristic – like something from a science fiction novel, perhaps – but it’s actually a field of study that concerns one of the oldest processes life on this planet has ever known: short-term rhythms of time and their effect on flora and fauna.
This can take many forms. Marine life, for example, is influenced by tidal patterns. Animals tend to be active or inactive depending on the position of the sun or moon. Numerous creatures, humans included, are largely diurnal – that is, they like to come out during the hours of sunlight. Nocturnal animals, such as bats and possums, prefer to forage by night. A third group are known as crepuscular: they thrive in the low-light of dawn and dusk and remain inactive at other hours.
When it comes to humans, chronobiologists are interested in what is known as the circadian rhythm. This is the complete cycle our bodies are naturally geared to undergo within the passage of a twenty-four hour day. Aside from sleeping at night and waking during the day, each cycle involves many other factors such as changes in blood pressure and body temperature. Not everyone has an identical circadian rhythm. ‘Night people’, for example, often describe how they find it very hard to operate during the morning, but become alert and focused by evening. This is a benign variation within circadian rhythms known as a chronotype.
my code:
fo = open("foo.txt", "r")
count = 0
for i in fo:
word = i
vowels = ['a','e','i','o','u','y']
word = word.lower().strip(".:;?!")
#print word
for j in word: # wanting that loop shd iterate till the end of file
for k in vowels: # wanting to index string array until **vowels.length()**
if (vowels[k] == word[j]):
count +=1
#print word[0]
print count

Python has a wonderful module called collections with a function Counter. You can use it like this:
import collections
with open('foo.txt') as f:
letters = collections.Counter(f.read())
vowels = ['a','e','i','o','u','y']
## you just want the sum
print(sum(letters[vowel] for vowel in vowels))
You can also do it without collections.Counter():
import itertools
vowels = {'a','e','i','o','u','y'}
with open("foo.txt") as f:
print(sum(1 for char in itertools.chain.from_iterable(f) if char in vowels))
Please note that the time complexity of a set {} lookup is O(1), whereas the time complexity for a list [] lookup is O(n) according to this page on wiki.python.org.
I tested both methods with the module timeit and as expected the first method using collections.Counter() is slightly faster:
0.13573385099880397
0.16710168996360153

Do in range(len()) instead, because if you use for k in vowels , k will be 'a' then 'b' then 'c'... etc. However, the syntax for getting objects via indexes is vowels[index_number], not vowels[content]. So, you have to iterate over the length of the array, and use vowels[0] to get 'a', then vowels[1]' to get 'b' etc.
fo = open("foo.txt", "r")
count = 0
for i in fo:
word = i
vowels = ['a','e','i','o','u','y']
word = word.lower().strip(".:;?!")
#print word
for j in range(len(word)): # wanting that loop shd iterate till the end of file
if (word[j] in vowels):
count +=1
#print word[0]
print count

Python prides itself on its abstraction and standard library data structures. Check out collections.Counter. It takes an iterable and returns a dict of value -> frequency.
with open('foo.txt') as f:
string = f.read()
counter = collections.Counter(string) # a string is an iterable of characters
vowel_counts = {vowel: counter[vowel] for vowel in "aeiou"}

Related

NLP how to speed up spelling correction on 147k rows filled with short messages

Trying to speed up spelling check on large dataset with 147k rows. The following function has been running for an entire afternoon and is still running. Is there a way to speed up the spelling check? The messages has already been case treated, punctuations removed, lemmatized and they are all in string format.
import autocorrect
from autocorrect import Speller
spell = Speller()
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell(word))
return ' '.join(correct_word)
df['clean'] = df['old'].apply(spell_check)
The autocorrect library is not very efficient, and is not made for tasks such as you present it. What it does is generate all the possible candidates with one or two typos, and checks which of them are valid words — and does it in plain Python.
Take a six-letter word like "source":
from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305
autocorrect generates as many as 131654 candidates to test, just for this word. What if it is longer? Let's try "transcompilation":
print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325
That's 814214 candidates, just for one word! And note that numpy can't speed it up, as the values are Python strings, and you're invoking a Python function on every row. The only way to speed this up is to change the method you are using for spell-checking: for example, using aspell-python-py3 library instead (a wrapper for aspell, AFAIK the best free spellchecker for Unix).
Additionally to what #Amadan said and is definitely true (autocorrect does the correction in a very ineffective way):
You treat each word in the giant dataset as if all words in it are looked up for the first time, because you call spell() on each word. In reality (at least after a while) almost all words were previously looked up, so storing these results and loading them would be much more efficient.
Here is one way to do it:
import autocorrect
from autocorrect import Speller
spell = Speller()
# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}
# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}
# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]

Matching two-word variants with each other if they don't match alphabetically

I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?
Try something like this:
s = "trydfydfgfay"
l = list(s)
candidateWords = []
for idx, c in enumerate(l):
if c=='y':
newList = l.copy()
newList[idx] = "i"
candidateWord = "".join(newList)
candidateWords.append(candidateWord)
print(candidateWords)
#['tridfydfgfay', 'trydfidfgfay', 'trydfydfgfai']
#look up these words to see if they are real words
I do not think this is a programming challenge, but looks more like an NLP challenge itself. The spelling variations are often a hurdle that one would face during preprocessing.
I would suggest that you use an Edit-distance based approaches to identify word pairs that allow for some variations. Specifically for the kind of problem that you have described above, I would recommend "Jaro Winkler Distance". This method allows to give higher similarity score between word pairs that shows variations between specific character pairs, say y and i.
All these approaches are implemented in Jellyfish library.
You may also have a look at the fuzzywuzzy package as well. Hope this helps.
So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:
wordlist = open("data.txt", "r", encoding='utf-8')
y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
all_words = []
y_words = []
for word in wordlist:
word = word.lower()
all_words.append(word)
for word in all_words:
if "y" in word:
y_words.append(word)
word_dict = {}
for word in y_words:
newwith1y = word.replace("y", "i",1)
newwith2y = word.replace("y", "i",2)
newyback = word[::-1].replace("y", "i",1)
newyback = newyback[::-1]
word_dict[word] = newwith1y
word_dict[word] = newwith2y
word_dict[word] = newyback
for key, value in word_dict.items():
if value in all_words:
y_wordlist.write(key)
y_wordlist.write(" - ")
y_wordlist.write(value)
y_wordlist.write("\n")

auto-correct the words from the list in python

I want to auto-correct the words which are in my list.
Say I have a list
kw = ['tiger','lion','elephant','black cat','dog']
I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.
Now I have list of str
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]
Expected output:
['tiger','lion',None,'dog']
My Efforts:
import difflib
op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)
My Output:
[[], [], [], ['dog']]
The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).
If I lower the cutoff value it starts returning the words which is should not.
So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.
So is there way to implement this?
I have explored few more libraries like autocorrect, hunspell etc. but no success.
You could implement something based of levenshtein distance.
It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html
Clearly, bieber is a long way from beaver—they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of
misspellings could be corrected with a single edit to the original
string.
Elasticsearch supports a maximum edit distance, specified with the
fuzziness parameter, of 2.
Of course, the impact that a single edit has on a string depends on
the length of the string. Two edits to the word hat can produce mad,
so allowing two edits on a string of length 3 is overkill. The
fuzziness parameter can be set to AUTO, which results in the following
maximum edit distances:
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters
I like to use pyxDamerauLevenshtein myself.
pip install pyxDamerauLevenshtein
So you could do a simple implementation like:
keywords = ['tiger','lion','elephant','black cat','dog']
from pyxdameraulevenshtein import damerau_levenshtein_distance
def correct_sentence(sentence):
new_sentence = []
for word in sentence.split():
budget = 2
n = len(word)
if n < 3:
budget = 0
elif 3 <= n < 6:
budget = 1
if budget:
for keyword in keywords:
if damerau_levenshtein_distance(word, keyword) <= budget:
new_sentence.append(keyword)
break
else:
new_sentence.append(word)
else:
new_sentence.append(word)
return " ".join(new_sentence)
Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.
Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:
def find_similar_word(s, kw, thr=0.5):
from difflib import SequenceMatcher
out = []
for i in s:
f = False
for j in i.split():
for k in kw:
if SequenceMatcher(a=j, b=k).ratio() > thr:
out.append(k)
f = True
if f:
break
if f:
break
else:
out.append(None)
return out
Output
find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']
Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.
import difflib
import itertools
kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]
op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]
print(op)
The output is generate is:
[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]
The trick is to split all the sentences along the whitespaces.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()
From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]
What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

How can I make this code work faster ? (searching large corpus of text for large words)

in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:
ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5
IMPLEMENTATION
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.
I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:
arraySorter which neither deals with arrays nor sorts (it's an implementation of nub)
findLongestWord which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)
getSentence which appends an arbitrary number of sentences onto a list
appending which sounds like it might be a state checker, but operates only through side effects
considerable confusion between local and global variables, for instance the global variable sentenceAppender is never used, nor is it an actor (for instance, a function) like the name suggests
For the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.
This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.
Maybe Psyco speeds up the execution?

Categories

Resources