Is there a better way to tokenize some strings?

Is there a better way to tokenize some strings? - python

I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?

Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]

You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

Related

Convert list input to string out

I have a list as below, may I know how do I convert to strings output?
Input
A = [['I', 'love', 'apple','.'],
['Today', 'is', 'Sunday', '.'],
['How', 'are', 'you'],
['What', 'are', 'you','doing']]
Output
I love apple.
Today is Sunday.
How are you
What are you doing

You can simply use a for loop to iterate through each list nested in the list A.
A = [
['I', 'love', 'apple.'],
['Today', 'is', 'Sunday.'],
['How', 'are', 'you'],
['What', 'are', 'you', 'doing']
]
for row in A:
print(' '.join(row))

Please use:
for row in X:
print(' '.join(X))

finding gappy sublists within a larger list

Let's say I have a list like this:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
Now I have a list like this:
['she', 'is', 'student']
I want to query the larger list with this one, and return all the lists that contain the words within the query list in the same order. There might be gaps, but the order should be the same. How can I do that? I tried using the in operator but I don't get the desired output.

If all that you care about is that the words appear in order somehwere in the array, you can use a collections.deque and popleft to iterate through the list, and if the deque is emptied, you have found a valid match:
from collections import deque
def find_gappy(arr, m):
dq = deque(m)
for word in arr:
if word == dq[0]:
dq.popleft()
if not dq:
return True
return False
By comparing each word in arr with the first element of dq, we know that when we find a match, it has been found in the correct order, and then we popleft, so we now are comparing with the next element in the deque.
To filter your initial list, you can use a simple list comprehension that filters based on the result of find_gappy:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student'], ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]

You can compare two lists, with a function like this one. The way it works is it loops through your shorter list, and every time it finds the next word in the long list, cuts off the first part of the longer list at that point. If it can't find the word it returns false.
def is_sub_sequence(long_list, short_list):
for word in short_list:
if word in long_list:
i = long_list.index(word)
long_list = long_list[i+1:]
else:
return False
return True
Now you have a function to tell you if the list is the desired type, you can filter out all the lists you need from the 'list of lists' using a list comprehension like the following:
a = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
b = ['she', 'is', 'student']
filtered = [x for x in a if is_sub_sequence(x,b)]
The list filtered will include only the lists of the desired type.

Finding duplicates in a list of a list, and adding their values

I'm trying to find the top 50 words that occur within three texts of Shakespeare and the ratio of each words occurrance in, macbeth.txt, allswell.txt, and othello.txt. Here is my code so far:
def byFreq(pair):
return pair[1]
def shakespeare():
counts = {}
A = []
for words in ['macbeth.txt','allswell.txt','othello.txt']:
text = open(words, 'r').read()
test = text.lower()
for ch in '!"$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
for w in words:
counts[w] = counts.get(w, 0) + 1
items = list(counts.items())
items.sort()
items.sort(key=byFreq, reverse = True)
for i in range(50):
word, count = items[i]
count = count / float(len(counts))
A += [[word, count]]
print A
And its output:
>>> shakespeare()
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]]
Instead of outputing the top 50 words of all three texts, its outputs the top 50 words of each text, 150 words. Im struggling on trying to delete the duplicates but add their ratios together. For example, in macbeth.txt the word 'the' has a ratio of 0.12929982922664066, allswell.txt has a ratio of 0.15813168261114238, and othello.txt has a ratio of 0.17894627455055595. I want to combine the ratios of all three of them. I;m pretty sure I have to use a for loop but I'm struggling to loop through a list within a list. I am more of a java guy so any help would be appreciated!

You can use a list comprehension and the Counter-class:
from collections import Counter
c = Counter([word for file in ['macbeth.txt','allswell.txt','othello.txt']
for word in open(file).read().split()])
Then you get a dict which maps words to their counts. You can sort them like this:
sorted([(i,v) for v,i in c.items()])
If you want the relative quantities, then you can calculate the total number of words:
numWords = sum([i for (v,i) in c.items()])
and adapt the dict c via a dict-comprehension:
c = { v:(i/numWords) for (v,i) in c.items()}

You're summarizing the count inside your loop over files. Move the summary code outside your for loop.

How to join a list while preserving previous structure?

I am having trouble joining a pre-split string after modification while preserving the previous structure.
say I have a string like this:
string = """
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
"""
I have to do some tests of that string.. finding specific words and characters within those words etc...and then replace them accordingly. so to accomplish that I had to break it up using
string.split()
The problem with this is, is that split also gets rid of the \n and extra spaces immediately ruining the integrity of the previous structure
Are there some extra methods in split that will allow me to accomplish this or should I seek an alternative route?
Thank you.

The split method takes an optional argument to specify the delimiter. If you only want to split words using space (' ') characters, you can pass that as an argument:
>>> string = """
...
... This is a nice piece of string isn't it?
... I assume it is so. I have to keep typing
... to use up the space. La-di-da-di-da.
...
... Bonjour.
... """
>>>
>>> string.split()
['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?', 'I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing', 'to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.', 'Bonjour.']
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nBonjour.\n']
>>>

The split method will split your string based on all white-spaces by default. If you want to split the lies separately, you can first split your string with new-lines then split the lines with white-space:
>>> [line.split() for line in string.strip().split('\n')]
[['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?'], ['I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing'], ['to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.'], [], ['Bonjour.']]

Just split with a delimiter:
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nThis', '', '', 'is', '', '', '', 'a', '', '', '', 'spaced', '', '', 'out', '', '', 'sentence\n\nBonjour.\n']
And to get it back:
>>> ' '.join(a)
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.

just do string.split(' ') (note the space argument to the split method).
this will keep your precious new lines within the strings that go into the resulting array...

You can save the spaces in another list then after modifying the words list you join them together.
In [1]: from nltk.tokenize import RegexpTokenizer
In [2]: spacestokenizer = RegexpTokenizer(r'\s+', gaps=False)
In [3]: wordtokenizer = RegexpTokenizer(r'\s+', gaps=True)
In [4]: string = """
...:
...: This is a nice piece of string isn't it?
...: I assume it is so. I have to keep typing
...: to use up the space. La-di-da-di-da.
...:
...: This is a spaced out sentence
...:
...: Bonjour.
...: """
In [5]: spaces = spacestokenizer.tokenize(string)
In [6]: words = wordtokenizer.tokenize(string)
In [7]: print ''.join([s+w for s, w in zip(spaces, words)])
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.

How do I select the first elements of each list in a list of lists?

I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!

There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)

>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How

Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]

Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a better way to tokenize some strings? - python

Related

Convert list input to string out

finding gappy sublists within a larger list

Finding duplicates in a list of a list, and adding their values

How to join a list while preserving previous structure?

How do I select the first elements of each list in a list of lists?

Categories

Resources