nltk how to give multiple separated sentences - python

I have list of sentences (each sentence is a list) in English and I would like to fetch ngrams.
For example:
sentences = [['this', 'is', 'sentence', 'one'], ['hello','again']]
In order to run
nltk.utils.ngram
I need to flat the list to:
sentences = ['this','is','sentence','one','hello','again']
But then I get a fault bgram in
('one','hello')
.
What is the best way to deal with it?
Thanks!

Try this:
from itertools import chain
sentences = list(chain(*sentences))
chain return a chain object whose .__next__() method returns elements from the first iterable until it is exhausted, then elements from the next
iterable, until all of the iterables are exhausted.
or you can do:
sentences = [i for s in sentences for i in s]

you can also use list comprehension
f = []
[f.extend(_l) for _l in sentences]
f = ['this', 'is', 'sentence', 'one', 'hello', 'again']

Related

Creating bigrams list not in list of tuples but in list of strings of both words combined

I am using this code to create bigrams for tokenization of a list of titles (headlines).
from nltk.util import ngrams
def bigram_creator(headlines):
bigrams = []
for line in headlines:
bigrm = nltk.bigrams(line.split())
bigrams.extend(bigrm)
return bigrams
However the code is giving me a list of tuples:
ex: [('opinion', 'one'), ('one', 'good')]
and I would like it to output a list of strings of both words together:
ex[('opinion one'), ('one good')]
Anybody know what I have to do to my code to change it?
Thank you in advance
You can use ' '.join(). Also, your output will not have the tuples, though I'd say they are unesassary.
bad_output = [('opinion', 'one'), ('one', 'good')]
good_output = [' '.join(tup) for tup in bad_output]
output
['opinion one', 'one good']

Splitting a string based on a certain set of words

I have a list of strings like such,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and'] I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
For example, the above set would be split into
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
actually with the _ delimiters you dont really need to sort it by length so you could just do
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
You could use a regular expression:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats' is split on '_for_':
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the | metacharacter) you get to split on any of the keywords:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable() lets us treat all those lists as one long iterable.
Demo:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and'] in every string with a replacement string, say for example _1_ (it could be any string), then at then end of each iteration, to split over this replacement string:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']

What does "word for word" syntax mean in Python?

I see the following script snippet from the gensim tutorial page.
What's the syntax of word for word in below Python script?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]
This is a list comprehension. The code you posted loops through every element in document.lower.split() and creates a new list that contains only the elements that meet the if condition. It does this for each document in documents.
Try it out...
elems = [1, 2, 3, 4]
squares = [e*e for e in elems] # square each element
big = [e for e in elems if e > 2] # keep elements bigger than 2
As you can see from your example, list comprehensions can be nested.
That is a list comprehension. An easier example might be:
evens = [num for num in range(100) if num % 2 == 0]
I'm quite sure i saw that line in some NLP applications.
This list comprehension:
[[word for word in document.lower().split() if word not in stoplist] for document in documents]
is the same as
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
internal_list = [] # often known as a a list tokens
for word in document.lower().split():
if word not in stoplist:
internal_list.append(word) # this is where the [[word for word...] ...] appears
ending_list.append(internal_list)
Basically you want a list of documents that contains a list of tokens. So by looping through the documents,
for document in documents:
you then split each document into tokens
list_of_tokens = []
for word in document.lower().split():
and then make a list of of these tokens:
list_of_tokens.append(word)
For example:
>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
It's the same as:
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
... list_of_tokens.append(word)
...
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']

using an array of words to filter words from a second array

I am comparing two arrays in Python.
The 1st array is a list of words from a query string. The second array is the list of words to be excluded from the query.
I have to compare these arrays and exclude words from the first array which are contained in the second array.
I tried to solve this by comparing each word from the first array to the whole of second array and continuing until all the words from the first array are exhausted:
for i in q_str:
if q_str[i] in stop_arr:
continue
else:
sans_arr[j] = q_arr[i]
j = j + 1
Where q_str is the query array, stop_arr contains the words to be excluded, and
sans_arr is a new array with the words excluded.
This code generates an error:
list indices must be integers not str
Use sets instead of lists, which gives easy access to set operations, such as subtraction:
set1 = set(q_str)
set2 = set(stop_arr)
set3 = set1 - set2 # things which are in set1, but not in set2
# or
set4 = set1.difference(set2) # things which are in set1, but not in set2
Here's an example:
>>> u = set([1,2,3,4])
>>> v = set([3,4,5,6])
>>> u - v
set([1, 2])
>>> u.difference(v)
set([1, 2])
>>> v.difference(u)
set([5, 6])
It is not entirely clear whether you wish to preserve the ordering of words in q_str. If you do:
import re
q_str = 'I am comparing 2 arrays in python. both are character arrays. the 1st array is a list of words from a query string. the second array is the list of words to be excluded from the query.'
q_arr = re.split(r'[\s.,;]+', q_str)
stop_arr = set(['a', 'the', 'of', 'is', 'in', 'to', 'be', 'am', 'are', ''])
print [w for w in q_arr if w not in stop_arr]
This produces:
['I', 'comparing', '2', 'arrays', 'python', 'both', 'character', 'arrays', '1st',
'array', 'list', 'words', 'from', 'query', 'string', 'second', 'array', 'list',
'words', 'excluded', 'from', 'query']
This code generates new array with all elements of q_str that not exists in stop_arr:
sans_arr = [ x for x in q_str if x not in stop_arr ]
Disclaimer: I don't know if q_str is an array of string because you talk about a query array.
When you are iterating over a list with a for loop, you will get the elements of the list, not indices. This means that i will actually be the strings from q_str, so instead of doing if q_str[i] in stop_arr you can check if i in stop_arr. This also means that you want to add i to sans_arr instead of q_arr[i].
Also, unless sans_arr has already been created with a certain length, you probably want to do sans_arr.append(i) instead of your current approach of setting the element at a specific index and then incrementing your current index.
And since i makes more sense for an index than a word, I have renamed i in the loop to word:
for word in q_str:
if word in stop_arr:
continue
else:
sans_arr.append(word)
Solution for filtering query string keys-values
I assume q_str is the dictionary of key-value pairs from query string, stop_arr is a list with keys you do not want, and sans_arr is filtered q_str, without keys existing in stop_arr.
Under the above assumptions, the solution would look like this:
sans_arr = {x: q_str[x] for x in q_str if x not in stop_arr}
Test
This is how it works:
>>> q_str = {
'test1': 'val1',
'test2': 'val2',
'test3': 'val3'
}
>>> stop_arr = ['test3','test4']
>>> sans_arr = {x: q_str[x] for x in q_str if x not in stop_arr}
>>> sans_arr
{'test1': 'val1', 'test2': 'val2'}
'for i in q_str' iterates over the list in your loop returning a string each time.
I would lose the [i] in your loop
for word in q_str:
if word in stop_arr:
continue
else:
sans_arr[j] = word
j=j+1

Return list of words from a list of lines with regexp

I'm running the following code on a list of strings to return a list of its words:
words = [re.split('\\s+', line) for line in lines]
However, I end up getting something like:
[['import', 're', ''], ['', ''], ['def', 'word_count(filename):', ''], ...]
As opposed to the desired:
['import', 're', '', '', '', 'def', 'word_count(filename):', '', ...]
How can I unpack the lists re.split('\\s+', line) produces in the above list comprehension? Naïvely, I tried using * but that doesn't work.
(I'm looking for a simple and Pythonic way of doing; I was tempted to write a function but I'm sure the language accommodates for this issue.)
>>> import re
>>> from itertools import chain
>>> lines = ["hello world", "second line", "third line"]
>>> words = chain(*[re.split(r'\s+', line) for line in lines])
This will give you an iterator that can be used for looping through all words:
>>> for word in words:
... print(word)
...
hello
world
second
line
third
line
Creating a list instead of an iterator is just a matter of wrapping the iterator in a list call:
>>> words = list(chain(*[re.split(r'\s+', line) for line in lines]))
The reason why you get a list of lists is because re.split() returns a list which then in 'appended' to the list comprehension output.
It's unclear why you are using that (or probably just a bad example) but if you can get the full content (all lines) as a string you can just do
words = re.split(r'\s+', lines)
if lines is the product of:
open('filename').readlines()
use
open('filename').read()
instead.
You can always do this:
words = []
for line in lines:
words.extend(re.split('\\s+',line))
It's not nearly as elegant as a one-liner list comprehension, but it gets the job done.
Just stumbled across this old question, and I think I have a better solution. Normally if you want to nest a list comprehension ("append" each list), you think backwards (un-for-loop-like). This is not what you want:
>>> import re
>>> lines = ["hello world", "second line", "third line"]
>>> [[word for word in re.split(r'\s+', line)] for line in lines]
[['hello', 'world'], ['second', 'line'], ['third', 'line']]
However if you want to "extend" instead of "append" the lists you're generating, just leave out the extra set of square brackets and reverse your for-loops (putting them back in the "right" order).
>>> [word for line in lines for word in re.split(r'\s+', line)]
['hello', 'world', 'second', 'line', 'third', 'line']
This seems like a more Pythonic solution to me since it is based in list-processing logic rather than some random-ass built-in function. Every programmer should know how to do this (especially ones trying to learn Lisp!)

Categories

Resources