Python connect composed keywords in texts - python

So, I have a keyword list lowercase. Let's say
keywords = ['machine learning', 'data science', 'artificial intelligence']
and a list of texts in lowercase. Let's say
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
I need to transform the texts into:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]
What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.
I was trying to use Phraser, but I can't manage to build one with only my keywords.
Could someone suggest me a more optimized way of doing it?

The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)
You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you've already tried and found wanting.
For example, let's first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that's just an empty-tuple. (The reason for this will become clear later.)
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ())
for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict
After this step, combinations_dict is:
{('machine', 'learning'): ('machine_learning', ()),
('data', 'science'): ('data_science', ()),
('artificial', 'intelligence'): ('artificial_intelligence', ())}
Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.
For example:
def combining_generator(tokens, comb_dict):
buff = () # start with empty buffer
for in_tok in tokens:
buff += (in_tok,) # add latest to buffer
if len(buff) < 2: # grow buffer to 2 tokens if possible
continue
# lookup what to do for current pair...
# ...defaulting to emit-[0]-item, keep-[1]-item in new buff
out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
yield out_tok
if buff:
yield buff[0] # last solo token if any
Here we see the reason for the earlier () empty-tuples: that's the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn't found.
Now designated combinations can be applied via:
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts
...which reports tokenized_texts as:
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'],
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]
Note that the tokens ('artificial', 'intelligence.') aren't combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.
Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing - and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:
import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts
Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts
Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.

This is probably not the best pythonic way to do it but it works with 3 steps.
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
#Add underscore
for idx, text in enumerate(texts):
for keyword in keywords:
reload_text = texts[idx]
if keyword in text:
texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
#Split text for each "." encountered
for idx, text in enumerate(texts):
texts[idx] = list(filter(None, text.split(".")))
print(texts)
#Split text to get each word
for idx,text in enumerate(texts):
for idx_s,sentence in enumerate(text):
texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters
print(texts)
Output
[
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
],
[
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'],
['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
]
]

Related

Tokenize text containing digits

I want to create a text-classifier, the input to the model contains digits along with the text that contains important information (don't think I can just throw away the digits). Is there a way to tokenize this kind of input?
The input looks like this:
input:
-------
Please have a look at case#345
injector 1 and injector 3 is not responding for model 8
Car has been running for 2345 km, try to do this procedure
.....
.....
This helps:
from keras.preprocessing.text import text_to_word_sequence
text = 'Please have a look at case#345. injector1 and injector3 is not responding for model8. Car has been running for 2345 km, try to do this procedure .'
# tokenize the document
result = text_to_word_sequence(text)
print(result)
Output:
['please', 'have', 'a', 'look', 'at', 'case', '345', 'injector1', 'and', 'injector3', 'is', 'not', 'responding', 'for', 'model8', 'car', 'has', 'been', 'running', 'for', '2345', 'km', 'try', 'to', 'do', 'this', 'procedure']

Str into Dict, len of each str as k and list of words with len as v [duplicate]

This question already has answers here:
python create dict using list of strings with length of strings as values
(6 answers)
Closed 2 years ago.
I have a string here:
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
I am supposed to create a dictionary where the keys are the length of the words and the
values are all the words with the same length. And use a list to store all those words.
This is what i have tried, it works, but I'm not sure how to use a loop efficiently to do this, can anyone please share the answer.
files_dict_values = {}
files_list = list(set(str_file_txt.split()))
values_1=[]
values_2=[]
values_3=[]
values_4=[]
values_5=[]
values_6=[]
values_7=[]
values_8=[]
values_9=[]
values_10=[]
values_11=[]
for ele in files_list:
if len(ele) == 1:
values_1.append(ele)
files_dict_values.update({len(ele):values_1})
elif len(ele) == 2:
values_2.append(ele)
files_dict_values.update({len(ele):values_2})
elif len(ele) == 3:
values_3.append(ele)
files_dict_values.update({len(ele):values_3})
elif len(ele) == 4:
values_4.append(ele)
files_dict_values.update({len(ele):values_4})
elif len(ele) == 5:
values_5.append(ele)
files_dict_values.update({len(ele):values_5})
elif len(ele) == 6:
values_6.append(ele)
files_dict_values.update({len(ele):values_6})
elif len(ele) == 7:
values_7.append(ele)
files_dict_values.update({len(ele):values_7})
elif len(ele) == 8:
values_8.append(ele)
files_dict_values.update({len(ele):values_8})
elif len(ele) == 9:
values_9.append(ele)
files_dict_values.update({len(ele):values_9})
elif len(ele) == 10:
values_10.append(ele)
files_dict_values.update({len(ele):values_10})
print(files_dict_values)
Here is the output i got:
{6: ['modern', 'bytes,', 'stored', 'within', 'exists', 'bytes.', 'system', 'binary', 'length', 'files:', 'refers'], 8: ['sequence', 'content.', 'variable', 'records.', 'systems,', 'computer'], 10: ['container,', 'electronic', 'delimiters', 'structured', '(sometimes', 'character,'], 1: ['A', 'a'], 4: ['will', 'line', 'data', 'done', 'last', 'more', 'kind', 'such', 'text', 'Some', 'size', 'need', 'ways', 'have', 'file', 'CP/M', 'with', 'that', 'most', 'name', 'type', 'keep', 'does'], 5: ['store', 'after', 'files', 'while', 'file"', 'known', 'those', 'plain', 'there', 'fixed', 'which', '"Text', 'file.', 'level', 'where', 'track', 'lines', 'kinds', 'text.', 'There'], 9: ['depending', 'Unix-like', 'primarily', 'textfile;', 'separated', 'Microsoft', 'flatfile)', 'operating', 'different'], 3: ['EOF', 'may', 'one', 'and', 'use', 'are', 'two', 'new', 'the', 'end', 'any', 'for', 'few', 'old', 'not'], 7: ['systems', 'denoted', 'Windows', 'because', 'spelled', 'marker,', 'padding', 'special', 'MS-DOS,', 'generic', 'contain', 'system.', 'placing'], 2: ['At', 'do', 'of', 'on', 'as', 'in', 'an', 'or', 'is', 'In', 'On', 'by', 'to']}
How about using loops and let json create keys on its own
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
op={}
for items in str_files_txt.split():
if len(items) not in op:
op[len(items)]=[]
op[len(items)].append(items)
for items in op:
op[items]=list(set(op[items]))
answer = {}
for word in str_files_text.split(): # loop over all the words
# use setdefault to create an empty set if the key doesn't exist
answer.setdefault(len(word), set()).add(word) # add the word to the set
# the set will handle deduping
# turn those sets into lists
for k,v in answer.items():
answer[k] = list(v)
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
lengthWordDict = {}
for word in str_files_txt.split(' '):
wordWithoutSpecialChars = ''.join([char for char in word if char.isalpha()])
wordWithoutSpecialCharsLength = len(wordWithoutSpecialChars)
if(wordWithoutSpecialCharsLength in lengthWordDict.keys()):
lengthWordDict[wordWithoutSpecialCharsLength].append(word)
else:
lengthWordDict[wordWithoutSpecialCharsLength] = [word]
print(lengthWordDict)
This is my solution, it gets the length of the word(Without special characters ex. Punctuation)
To get the absolute length of the word(With punctuation) replace wordWithoutSpecialChars with word
Output:
{1: ['A', 'a', 'a', 'A', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'], 4: ['text', 'file', 'name', 'kind', 'file', 'that', 'text.', 'text', 'file', 'data', 'file', 'such', 'does', 'keep', 'file', 'size', 'text', 'file', 'more', 'last', 'line', 'text', 'file.', 'such', 'text', 'file', 'keep', 'file', 'size', 'most', 'text', 'need', 'have', 'done', 'ways', 'Some', 'with', 'file', 'line', 'will', 'text', 'with', "'Text", "file'", 'type', 'text', 'type', 'text'], 9: ['(sometimes', 'operating', 'operating', 'end-of-file', 'operating', 'Microsoft', 'character,', 'operating', 'end-of-line', 'different', 'depending', 'operating', 'operating', 'primarily', 'separated', 'container,'], 7: ['spelled', 'systems', 'denoted', 'placing', 'special', 'padding', 'systems', 'Windows', 'systems,', 'contain', 'special', 'because', 'systems', 'systems', 'systems', 'systems', 'records.', 'content.', 'generic'], 8: ['textfile;', 'flatfile)', 'computer', 'sequence', 'computer', 'Unix-like', 'variable', 'computer'], 2: ['an', 'is', 'is', 'of', 'is', 'as', 'of', 'of', 'as', 'In', 'as', 'of', 'in', 'of', 'is', 'by', 'or', 'as', 'an', 'as', 'in', 'On', 'as', 'do', 'on', 'of', 'in', 'to', 'in', 'on', 'as', 'or', 'to', 'of', 'to', 'of', 'At', 'of', 'of'], 3: ['old', 'CP/M', 'and', 'the', 'not', 'the', 'the', 'end', 'one', 'the', 'and', 'not', 'any', 'EOF', 'the', 'are', 'for', 'are', 'few', 'may', 'not', 'use', 'new', 'and', 'are', 'two', 'and'], 11: ['alternative', 'description,'], 10: ['structured', 'electronic', 'characters,', 'delimiters,', 'delimiters'], 5: ['lines', 'MS-DOS,', 'where', 'track', 'bytes,', 'known', 'after', 'files', 'those', 'track', 'bytes.', 'There', 'files', 'which', 'store', 'files', 'lines', 'fixed', 'while', 'plain', 'level', 'there', 'kinds', 'files:', 'files', 'files'], 6: ['exists', 'stored', 'within', 'system.', 'system', 'marker,', 'modern', 'system.', 'length', 'refers', 'refers', 'binary'], 16: ['record-orientated']}
You can directly add the strings to the dictionary at the right position as follows:
res = {}
for ele in list(set(str_files_txt.split())):
if len(ele) in res:
res[len(ele)].append(ele)
else:
res[len(ele)] = [ele]
print(res)
You got two problems: cleaning your data and creation of the dictionary.
Use a defaultdict(list) after cleaning your words from characters not belonging to them. (This is similar to the dupe's answer ).
from collections import defaultdict
d = defaultdict(list)
text = """A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
"""
# remove the characters ,.!;:-"' from begin/end of all space splitted words
words = [w.strip(",.!;:- \"'") for w in text.split()]
# add words to list in dict, automatically creates list if needed
# your code uses a set as well
for w in set(words):
d[len(w)].append(w)
# output
for k in sorted(d):
print(k,d[k])
Output:
1 ['A', 'a']
2 ['to', 'an', 'At', 'do', 'on', 'In', 'On', 'as', 'by', 'or', 'of', 'in', 'is']
3 ['use', 'the', 'one', 'and', 'few', 'not', 'EOF', 'may', 'any', 'for', 'are', 'two', 'end', 'new', 'old']
4 ['have', 'that', 'such', 'type', 'need', 'text', 'more', 'done', 'kind', 'Some', 'does', 'most', 'file', 'with', 'line', 'ways', 'keep', 'CP/M', 'name', 'will', 'Text', 'data', 'last', 'size']
5 ['track', 'those', 'bytes', 'fixed', 'known', 'where', 'which', 'there', 'while', 'There', 'lines', 'kinds', 'store', 'files', 'plain', 'after', 'level']
6 ['exists', 'modern', 'MS-DOS', 'system', 'within', 'refers', 'length', 'marker', 'stored', 'binary']
7 ['because', 'placing', 'content', 'Windows', 'padding', 'systems', 'records', 'contain', 'special', 'generic', 'denoted', 'spelled']
8 ['computer', 'sequence', 'textfile', 'variable']
9 ['Microsoft', 'depending', 'different', 'Unix-like', 'flatfile)', 'primarily', 'container', 'character', 'separated', 'operating']
10 ['delimiters', 'characters', 'electronic', '(sometimes', 'structured']
11 ['end-of-file', 'alternative', 'end-of-line', 'description']
17 ['record-orientated']

Python: gensim: RuntimeError: you must first build vocabulary before training the model

I know that this question has been asked already, but I was still not able to find a solution for it.
I would like to use gensim's word2vec on a custom data set, but now I'm still figuring out in what format the dataset has to be. I had a look at this post where the input is basically a list of lists (one big list containing other lists that are tokenized sentences from the NLTK Brown corpus). So I thought that this is the input format I have to use for the command word2vec.Word2Vec(). However, it won't work with my little test set and I don't understand why.
What I have tried:
This worked:
from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
brown_vecs = word2vec.Word2Vec(brown.sents())
This didn't work:
sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)
I don't understand why it doesn't work with the "mock" data, even though it has the same data structure as the sentences from the Brown corpus:
vocab:
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]
brown.sents(): (the beginning of it)
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
Can anyone please tell me what I'm doing wrong?
Default min_count in gensim's Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error. Try
voc_vec = word2vec.Word2Vec(vocab, min_count=1)
Input to the gensim's Word2Vec can be a list of sentences or list of words or list of list of sentences.
E.g.
1. sentences = ['I love ice-cream', 'he loves ice-cream', 'you love ice cream']
2. words = ['i','love','ice - cream', 'like', 'ice-cream']
3. sentences = [['i love ice-cream'], ['he loves ice-cream'], ['you love ice cream']]
build the vocab before training
model.build_vocab(sentences, update=False)
just check out the link for detailed info

easiest way to count the number of occurences of a word in a string of a paragraph

example how to count the word "paragraph" in the paragraph below..
A paragraph in Word is any text that ends with a hard return. You
insert a hard return anytime you press the Enter key. Paragraph
formatting lets you control the appearance if individual paragraphs.
For example, you can change the alignment of text from left to center
or the spacing between lines form single to double. You can indent
paragraphs, number them, or add borders and shading to them.
Paragraph formatting is applied to an entire paragraph. All formatting
for a paragraph is stored in the paragraph mark and carried to the
next paragraph when you press the Enter key. You can copy paragraph
formats from paragraph to paragraph and view formats through task
panes.
You want to use the count method on the input string, passing "paragraph" as the argument.
>>> text = """A paragraph in Word is any text that ends with a hard return. You insert a hard return anytime you press the Enter key. Paragraph formatting lets you control the appearance if individual paragraphs. For example, you can change the alignment of text from left to center or the spacing between lines form single to double. You can indent paragraphs, number them, or add borders and shading to them.
Paragraph formatting is applied to an entire paragraph. All formatting for a paragraph is stored in the paragraph mark and carried to the next paragraph when you press the Enter key. You can copy paragraph formats from paragraph to paragraph and view formats through task panes."""
>>> text.count('paragraph') # case sensitive
10
>>> text.lower().count('paragraph') # case insensitive
12
As mentioned in the comments, you can use lower() to transform the text to be all lowercase. This will include instances of "paragraph" and "Paragraph" in the count.
I would do the following:
Split into a list of words (although not totally necessary)
Lowercase all the words
Use count to count the number of instances
>>> s
'A paragraph in Word is any text that ends with a hard return. You insert a hard return anytime you press the Enter key. Paragraph formatting lets you control the appearance if individual paragraphs. For example, you can change the alignment of text from left to center or the spacing between lines form single to double. You can indent paragraphs, number them, or add borders and shading to them.\n\n Paragraph formatting is applied to an entire paragraph. All formatting for a paragraph is stored in the paragraph mark and carried to the next paragraph when you press the Enter key. You can copy paragraph formats from paragraph to paragraph and view formats through task panes.'
>>> s.split()
['A', 'paragraph', 'in', 'Word', 'is', 'any', 'text', 'that', 'ends', 'with', 'a', 'hard', 'return.', 'You', 'insert', 'a', 'hard', 'return', 'anytime', 'you', 'press', 'the', 'Enter', 'key.', 'Paragraph', 'formatting', 'lets', 'you', 'control', 'the', 'appearance', 'if', 'individual', 'paragraphs.', 'For', 'example,', 'you', 'can', 'change', 'the', 'alignment', 'of', 'text', 'from', 'left', 'to', 'center', 'or', 'the', 'spacing', 'between', 'lines', 'form', 'single', 'to', 'double.', 'You', 'can', 'indent', 'paragraphs,', 'number', 'them,', 'or', 'add', 'borders', 'and', 'shading', 'to', 'them.', 'Paragraph', 'formatting', 'is', 'applied', 'to', 'an', 'entire', 'paragraph.', 'All', 'formatting', 'for', 'a', 'paragraph', 'is', 'stored', 'in', 'the', 'paragraph', 'mark', 'and', 'carried', 'to', 'the', 'next', 'paragraph', 'when', 'you', 'press', 'the', 'Enter', 'key.', 'You', 'can', 'copy', 'paragraph', 'formats', 'from', 'paragraph', 'to', 'paragraph', 'and', 'view', 'formats', 'through', 'task', 'panes.']
>>> [word.lower() for word in s.split()].count("paragraph")
9
Here's another example of splitting the paragraph into words and then looping through the word list and incrementing a counter when the target word is found.
paragraph = '''insert paragraph here'''
wordlist = paragraph.split(" ")
count = 0
for word in wordlist:
if word == "paragraph":
count += 1

Split a large string into multiple substrings containing 'n' number of words via python

Source text: United States Declaration of Independence
How can one split the above source text into a number of sub-strings, containing an 'n' number of words?
I use split(' ') to extract each word, however I do not know how to do this with multiple words in one operation.
I could run through the list of words that I have, and create another by gluing together words in the first list (whilst adding spaces). However my method isn't very pythonic.
text = """
When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.
We hold these Truths to be self-evident, that all Men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness?-That to secure these Rights, Governments are instituted among Men, deriving their just Powers from the Consent of the Governed, that whenever any Form of Government becomes destructive of these Ends, it is the Right of the People to alter or abolish it, and to institute a new Government, laying its Foundation on such Principles, and organizing its Powers in such Form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient Causes; and accordingly all Experience hath shewn, that Mankind are more disposed to suffer, while Evils are sufferable, than to right themselves by abolishing the Forms to which they are accustomed. But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their future Security. Such has been the patient Sufferance of these Colonies; and such is now the Necessity which constrains them to alter their former Systems of Government. The History of the Present King of Great-Britain is a History of repeated Injuries and Usurpations, all having in direct Object the Establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid World.
"""
words = text.split()
subs = []
n = 4
for i in range(0, len(words), n):
subs.append(" ".join(words[i:i+n]))
print subs[:10]
prints:
['When in the course', 'of human Events, it', 'becomes necessary for one', 'People to dissolve the', 'Political Bands which have', 'connected them with another,', 'and to assume among', 'the Powers of the', 'Earth, the separate and', 'equal Station to which']
or, as a list comprehension:
subs = [" ".join(words[i:i+n]) for i in range(0, len(words), n)]
You're trying to create n-grams? Here's how I do it, using the NLTK.
punct = re.compile(r'^[^A-Za-z0-9]+|[^a-zA-Z0-9]+$')
is_word=re.compile(r'[a-z]', re.IGNORECASE)
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer=nltk.tokenize.punkt.PunktWordTokenizer()
def get_words(sentence):
return [punct.sub('',word) for word in word_tokenizer.tokenize(sentence) if is_word.search(word)]
def ngrams(text, n):
for sentence in sentence_tokenizer.tokenize(text.lower()):
words = get_words(sentence)
for i in range(len(words)-(n-1)):
yield(' '.join(words[i:i+n]))
Then
for ngram in ngrams(sometext, 3):
print ngram
For large string, iterator is recommended for speed and low memory footprint.
import re, itertools
# Original text
text = "When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation."
n = 10
# An iterator which will extract words one by one from text when needed
words = itertools.imap(lambda m:m.group(), re.finditer(r'\w+', text))
# The final iterator that combines words into n-length groups
word_groups = itertools.izip_longest(*(words,)*n)
for g in word_groups: print g
will get the following result:
('When', 'in', 'the', 'course', 'of', 'human', 'Events', 'it', 'becomes', 'necessary')
('for', 'one', 'People', 'to', 'dissolve', 'the', 'Political', 'Bands', 'which', 'have')
('connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'Powers')
('of', 'the', 'Earth', 'the', 'separate', 'and', 'equal', 'Station', 'to', 'which')
('the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', 's', 'God', 'entitle')
('them', 'a', 'decent', 'Respect', 'to', 'the', 'Opinions', 'of', 'Mankind', 'requires')
('that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to')
('the', 'Separation', None, None, None, None, None, None, None, None)

Categories

Resources