Related
I have a number of sentences which I would like to split on specific words (e.g. and). However, when splitting the sentences sometimes there are two or more combinations of a word I'd like to split on in a sentence.
Example sentences:
['i', 'am', 'just', 'hoping', 'for', 'strength', 'and', 'guidance', 'because', 'i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home', 'and', 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was', 'because', 'he', 'is', 'been', 'told', 'not', 'to', 'talk']
so I have written some code to split a sentence:
split_on_word = []
no_splitting = []
indexPosList = [ i for i in range(len(kth)) if kth[i] == 'and'] # check if word is in sentence
for e in example:
kth = e.split() # split strings into list so it looks like example sentence
for n in indexPosList:
if n > 4: # only split when the word's position is 4 or more
h = e.split("and")
for i in h:
split_on_word.append(i)# append split sentences
else:
no_splitting.append(kth) #append sentences that don't need to be split
However, you can see that when using this code more than once (e.g.: replace the word to split on with another) I will create duplicates or part duplicates of the sentences that I append to a new list.
Is there any way to check for multiple conditions, so that if a sentence contains both or other combinations of it that I split the sentence in one go?
The output from the examples should then look like this:
['i', 'am', 'just', 'hoping', 'for', 'strength']
['guidance', 'because']
['i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home']
[ 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was']
['he', 'is', 'been', 'told', 'not', 'to', 'talk']
You can use itertools.groupby with a function that checks whether a word is a split-word:
In [11]: split_words = {'and', 'because'}
In [12]: [list(g) for k, g in it.groupby(example, key=lambda x: x not in split_words) if k]
Out[12]:
[['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home'],
['tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was'],
['he', 'is', 'been', 'told', 'not', 'to', 'talk']]
Below is the list.
List=[['hello', 'how'], ['are', 'you', 'hope'], ['you,are,fine', 'thank', 'you']]
I want the output list as
List=[['hello', 'how'], ['are', 'you', 'hope'], ['you', 'are' ,'fine', 'thank', 'you']]
Try the following nested comprehension that recompiles the list in a single walk-through while splitting the tokens:
>>> [[token for el in sub for token in el.split(',')] for sub in List]
[['hello', 'how'], ['are', 'you', 'hope'], ['you', 'are', 'fine', 'thank', 'you']]
Using a simple iteration and str.split
Ex:
lst=[['hello', 'how'], ['are', 'you', 'hope'], ['you,are,fine', 'thank', 'you']]
result = []
for i in lst:
temp = []
for j in i:
temp += j.split(",")
result.append(temp)
print(result)
Output:
[['hello', 'how'],
['are', 'you', 'hope'],
['you', 'are', 'fine', 'thank', 'you']]
You can use this
[[z for z in x.split(',') for x in y] for y in List]
I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]
You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
I have this nested list of strings which is in it's final stage of cleaning. I want to replace the non letters in the nested list with spaces or create a new list without the non-letters. Here is my list:
list = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
And this is the pattern that I want to use in order to clean it all
pattern = re.compile(r'\W+')
newlist = list(filter(pattern.search, list))
print(newlist)
the code doesn't work and this is the error that I get:
Traceback (most recent call last):
File "/Users/art/Desktop/TxtProcessing/regexp", line 28, in <module>
newlist = [list(filter(pattern.search, list))]
TypeError: expected string or bytes-like object
I understand that list is not a string but a list of lists of strings, how do I fix it?
Any help will be very much Appreciated!
You need to step deeper into your list
import re
list_ = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
pattern = re.compile(r'\W+')
newlist_ = [item
for sublist_ in list_
for item in sublist_
if pattern.search(item)]
print(newlist_)
# ['mr.', ',', '?', ',', '.', 'pinkish-blue', '.', "n't", '.']
Additionally, you must not name your variables list.
You are attempting to pass a list to re.search, however, only strings are allowed since pattern matching is supposed to occur. Try looping over the list instead:
import re
l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
new_l = [[b for b in i if re.findall('^\w+$', b)] for i in l]
Also, note that your original variable name, list, shadows the builtin list function and in this case will assign the list contents to the attribute list.
First of all, shadowing a built-in name like list may lead to all sorts of troubles - choose your variable names carefully.
You don't actually need a regular expression here - there is a built-in isalpha() string method:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.
In [1]: l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'wea
...: ther', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pink
...: ish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
In [2]: [[item for item in sublist if item.isalpha()] for sublist in l]
Out[2]:
[['hello', 'smith', 'how', 'are', 'you', 'doing', 'today'],
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome'],
['the', 'sky', 'is'],
['you', 'should', 'eat', 'cardboard']]
Here is how you can apply the same filtering logic but using map and filter (you would need the help of functools.partial() as well):
In [4]: from functools import partial
In [5]: for item in map(partial(filter, str.isalpha), l):
...: print(list(item))
['hello', 'smith', 'how', 'are', 'you', 'doing', 'today']
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome']
['the', 'sky', 'is']
['you', 'should', 'eat', 'cardboard']
I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!
There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)
>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How
Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]
Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])