This question already has answers here:
python create dict using list of strings with length of strings as values
(6 answers)
Closed 2 years ago.
I have a string here:
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
I am supposed to create a dictionary where the keys are the length of the words and the
values are all the words with the same length. And use a list to store all those words.
This is what i have tried, it works, but I'm not sure how to use a loop efficiently to do this, can anyone please share the answer.
files_dict_values = {}
files_list = list(set(str_file_txt.split()))
values_1=[]
values_2=[]
values_3=[]
values_4=[]
values_5=[]
values_6=[]
values_7=[]
values_8=[]
values_9=[]
values_10=[]
values_11=[]
for ele in files_list:
if len(ele) == 1:
values_1.append(ele)
files_dict_values.update({len(ele):values_1})
elif len(ele) == 2:
values_2.append(ele)
files_dict_values.update({len(ele):values_2})
elif len(ele) == 3:
values_3.append(ele)
files_dict_values.update({len(ele):values_3})
elif len(ele) == 4:
values_4.append(ele)
files_dict_values.update({len(ele):values_4})
elif len(ele) == 5:
values_5.append(ele)
files_dict_values.update({len(ele):values_5})
elif len(ele) == 6:
values_6.append(ele)
files_dict_values.update({len(ele):values_6})
elif len(ele) == 7:
values_7.append(ele)
files_dict_values.update({len(ele):values_7})
elif len(ele) == 8:
values_8.append(ele)
files_dict_values.update({len(ele):values_8})
elif len(ele) == 9:
values_9.append(ele)
files_dict_values.update({len(ele):values_9})
elif len(ele) == 10:
values_10.append(ele)
files_dict_values.update({len(ele):values_10})
print(files_dict_values)
Here is the output i got:
{6: ['modern', 'bytes,', 'stored', 'within', 'exists', 'bytes.', 'system', 'binary', 'length', 'files:', 'refers'], 8: ['sequence', 'content.', 'variable', 'records.', 'systems,', 'computer'], 10: ['container,', 'electronic', 'delimiters', 'structured', '(sometimes', 'character,'], 1: ['A', 'a'], 4: ['will', 'line', 'data', 'done', 'last', 'more', 'kind', 'such', 'text', 'Some', 'size', 'need', 'ways', 'have', 'file', 'CP/M', 'with', 'that', 'most', 'name', 'type', 'keep', 'does'], 5: ['store', 'after', 'files', 'while', 'file"', 'known', 'those', 'plain', 'there', 'fixed', 'which', '"Text', 'file.', 'level', 'where', 'track', 'lines', 'kinds', 'text.', 'There'], 9: ['depending', 'Unix-like', 'primarily', 'textfile;', 'separated', 'Microsoft', 'flatfile)', 'operating', 'different'], 3: ['EOF', 'may', 'one', 'and', 'use', 'are', 'two', 'new', 'the', 'end', 'any', 'for', 'few', 'old', 'not'], 7: ['systems', 'denoted', 'Windows', 'because', 'spelled', 'marker,', 'padding', 'special', 'MS-DOS,', 'generic', 'contain', 'system.', 'placing'], 2: ['At', 'do', 'of', 'on', 'as', 'in', 'an', 'or', 'is', 'In', 'On', 'by', 'to']}
How about using loops and let json create keys on its own
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
op={}
for items in str_files_txt.split():
if len(items) not in op:
op[len(items)]=[]
op[len(items)].append(items)
for items in op:
op[items]=list(set(op[items]))
answer = {}
for word in str_files_text.split(): # loop over all the words
# use setdefault to create an empty set if the key doesn't exist
answer.setdefault(len(word), set()).add(word) # add the word to the set
# the set will handle deduping
# turn those sets into lists
for k,v in answer.items():
answer[k] = list(v)
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
lengthWordDict = {}
for word in str_files_txt.split(' '):
wordWithoutSpecialChars = ''.join([char for char in word if char.isalpha()])
wordWithoutSpecialCharsLength = len(wordWithoutSpecialChars)
if(wordWithoutSpecialCharsLength in lengthWordDict.keys()):
lengthWordDict[wordWithoutSpecialCharsLength].append(word)
else:
lengthWordDict[wordWithoutSpecialCharsLength] = [word]
print(lengthWordDict)
This is my solution, it gets the length of the word(Without special characters ex. Punctuation)
To get the absolute length of the word(With punctuation) replace wordWithoutSpecialChars with word
Output:
{1: ['A', 'a', 'a', 'A', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'], 4: ['text', 'file', 'name', 'kind', 'file', 'that', 'text.', 'text', 'file', 'data', 'file', 'such', 'does', 'keep', 'file', 'size', 'text', 'file', 'more', 'last', 'line', 'text', 'file.', 'such', 'text', 'file', 'keep', 'file', 'size', 'most', 'text', 'need', 'have', 'done', 'ways', 'Some', 'with', 'file', 'line', 'will', 'text', 'with', "'Text", "file'", 'type', 'text', 'type', 'text'], 9: ['(sometimes', 'operating', 'operating', 'end-of-file', 'operating', 'Microsoft', 'character,', 'operating', 'end-of-line', 'different', 'depending', 'operating', 'operating', 'primarily', 'separated', 'container,'], 7: ['spelled', 'systems', 'denoted', 'placing', 'special', 'padding', 'systems', 'Windows', 'systems,', 'contain', 'special', 'because', 'systems', 'systems', 'systems', 'systems', 'records.', 'content.', 'generic'], 8: ['textfile;', 'flatfile)', 'computer', 'sequence', 'computer', 'Unix-like', 'variable', 'computer'], 2: ['an', 'is', 'is', 'of', 'is', 'as', 'of', 'of', 'as', 'In', 'as', 'of', 'in', 'of', 'is', 'by', 'or', 'as', 'an', 'as', 'in', 'On', 'as', 'do', 'on', 'of', 'in', 'to', 'in', 'on', 'as', 'or', 'to', 'of', 'to', 'of', 'At', 'of', 'of'], 3: ['old', 'CP/M', 'and', 'the', 'not', 'the', 'the', 'end', 'one', 'the', 'and', 'not', 'any', 'EOF', 'the', 'are', 'for', 'are', 'few', 'may', 'not', 'use', 'new', 'and', 'are', 'two', 'and'], 11: ['alternative', 'description,'], 10: ['structured', 'electronic', 'characters,', 'delimiters,', 'delimiters'], 5: ['lines', 'MS-DOS,', 'where', 'track', 'bytes,', 'known', 'after', 'files', 'those', 'track', 'bytes.', 'There', 'files', 'which', 'store', 'files', 'lines', 'fixed', 'while', 'plain', 'level', 'there', 'kinds', 'files:', 'files', 'files'], 6: ['exists', 'stored', 'within', 'system.', 'system', 'marker,', 'modern', 'system.', 'length', 'refers', 'refers', 'binary'], 16: ['record-orientated']}
You can directly add the strings to the dictionary at the right position as follows:
res = {}
for ele in list(set(str_files_txt.split())):
if len(ele) in res:
res[len(ele)].append(ele)
else:
res[len(ele)] = [ele]
print(res)
You got two problems: cleaning your data and creation of the dictionary.
Use a defaultdict(list) after cleaning your words from characters not belonging to them. (This is similar to the dupe's answer ).
from collections import defaultdict
d = defaultdict(list)
text = """A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
"""
# remove the characters ,.!;:-"' from begin/end of all space splitted words
words = [w.strip(",.!;:- \"'") for w in text.split()]
# add words to list in dict, automatically creates list if needed
# your code uses a set as well
for w in set(words):
d[len(w)].append(w)
# output
for k in sorted(d):
print(k,d[k])
Output:
1 ['A', 'a']
2 ['to', 'an', 'At', 'do', 'on', 'In', 'On', 'as', 'by', 'or', 'of', 'in', 'is']
3 ['use', 'the', 'one', 'and', 'few', 'not', 'EOF', 'may', 'any', 'for', 'are', 'two', 'end', 'new', 'old']
4 ['have', 'that', 'such', 'type', 'need', 'text', 'more', 'done', 'kind', 'Some', 'does', 'most', 'file', 'with', 'line', 'ways', 'keep', 'CP/M', 'name', 'will', 'Text', 'data', 'last', 'size']
5 ['track', 'those', 'bytes', 'fixed', 'known', 'where', 'which', 'there', 'while', 'There', 'lines', 'kinds', 'store', 'files', 'plain', 'after', 'level']
6 ['exists', 'modern', 'MS-DOS', 'system', 'within', 'refers', 'length', 'marker', 'stored', 'binary']
7 ['because', 'placing', 'content', 'Windows', 'padding', 'systems', 'records', 'contain', 'special', 'generic', 'denoted', 'spelled']
8 ['computer', 'sequence', 'textfile', 'variable']
9 ['Microsoft', 'depending', 'different', 'Unix-like', 'flatfile)', 'primarily', 'container', 'character', 'separated', 'operating']
10 ['delimiters', 'characters', 'electronic', '(sometimes', 'structured']
11 ['end-of-file', 'alternative', 'end-of-line', 'description']
17 ['record-orientated']
I can only get stop words to implement into the document and then create a new file with the stop words removed. I cannot get word tokenize, porterstemmer, or sent tokenize to process.
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
file1 = open("data/hw1datasets/100554newsML.txt")
This is the part I cannot get to execute into the new txt file.
text = fileObj.read()
stokens = nltk.sent_tokenize(text)
wtokens = nltk.word_tokenize(text)
This part create the new file
line = file1.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('h1doc1.txt','a')
appendFile.write(" "+r)
Not entirely sure of your problem. I think your code is close, but maybe some file Input/Output is your issue. You should not use .open() in a loop, as it will repeatedly open the file. Just open it once, and be sure to .close() your file at the end.
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
file1 = open(r"./100554newsML.txt")
text = file1.read()
stokens = nltk.sent_tokenize(text)
wtokens = nltk.word_tokenize(text)
words = text.split()
appendFile = open(r'h1doc1.txt','w+')
for r in wtokens:
if r not in stop_words:
appendFile.write(" "+r)
appendFile.close()
Using print on stokens & wtokens works fine.
Output of print(stokens):
['Channel tunnel operator Eurotunnel on Monday announced details of a deal giving bank creditors 45.5 percent of the company in return for wiping out 1.0 billion pounds ($1.6 billion) of its massive debts.', 'The long-awaited but highly complex restructuring of nearly nearly nine billion pounds of debt and unpaid interest throws the company a lifeline which could secure what is still likely to be a difficult future.', 'The deal, announced simultaneously in Paris and London, brings the company back from the brink of bankruptcy but leaves current shareholders, who have already seen their investment dwindle, owning only 54.5 percent of the company.', '"We have fixed and capped the interest payments and arranged only to pay what is available in cash," Eurotunnel co-chairman Alastair Morton told reporters at a news conference.', '"Avoiding having to do this again is the name of the game."', 'Morton said the plan provides the Anglo-French company with the medium term financial stability to consolidate its commercial position and develop its operations, adding that the firm was now making a profit before interest.', "Although shareholders will see their holdings diluted, they were offered the prospect of a brighter future and urged to be patient after months of uncertainty while Eurotunnel wrestled to reduce the crippling interest payments negotiated during the tunnel's construction.", 'Eurotunnel, which has taken around half of the market in the busiest cross-Channel route from the European ferry companies, said a strong operating performance could allow it to pay its first dividend within the next 10 years.', 'French co-chairman Patrick Ponsolle told reporters at a Paris news conference that the dividend could come as early as 2004 if the company performed "very well".', 'Eurotunnel and the banks have come up with an ingenious formula to help the company get over the early years of the deal when, despite the swaps of debt for equity and bonds, it will still not be able to afford the annual interest bill of 400 million pounds.', 'If its revenue, after costs and depreciation, is less than 400 million pounds, then the company will issue "Stabilisation notes" to a maximum of 1.85 billion pounds to the banks.', 'Eurotunnel would not pay interest on these notes (which would constitute a debt issue) for ten years.', "Analysts said that under the deal, Eurotunnel's ability to finance its debt would become sustainable, at least for a few years.", '"If you look at the current cash flow of between 150 and 200 million pounds a year, what they can\'t find (to meet the bill) they will roll forward into the stabilisation notes, and they can keep that going for seven, eight, nine years," said an analyst at one major investment bank.', '"So they are here for that time," he added.', 'The company said in a statement there was still considerable work to be done to finalise and agree the details of the plan before it can be submitted to shareholders and the bank group for approval, probably early in the Spring of 1997.', 'Eurotunnel said the debt-for-equity swap would be at 130 pence, or 10.40 francs, per share -- considerably below the level of 160 pence widely reported in the run up to the deal\nThe company said a further 3.7 billion pounds of debt would be converted into new financial instruments and existing shareholders would be able to participate in this issue.', "If they choose not to take up free warrants entitling them to subscribe to this, Eurotunnel said shareholders' interests may be reduced further to just over 39 percent of the company by the end of December 2003.", "Eurotunnel's shares, which were suspended last week at 113.5 pence ahead of Monday's announcement, will resume trading on Tuesday.", 'Shareholders and all 225 creditor banks have to agree the deal.', '"I\'m hopeful but I\'m not taking it (approval) for granted," Morton admitted, "Shareholders are pretty angry in France."', 'Asked what would happen if the banks reject the deal, Morton said, "Nobody wants a collapse, nobody wants a doomsday scenario."', '($1=.6393 Pound)']
Output of print(wtokens)
['ï', '»', '¿Channel', 'tunnel', 'operator', 'Eurotunnel', 'on', 'Monday', 'announced', 'details', 'of', 'a', 'deal', 'giving', 'bank', 'creditors', '45.5', 'percent', 'of', 'the', 'company', 'in', 'return', 'for', 'wiping', 'out', '1.0', 'billion', 'pounds', '(', '$', '1.6', 'billion', ')', 'of', 'its', 'massive', 'debts', '.', 'The', 'long-awaited', 'but', 'highly', 'complex', 'restructuring', 'of', 'nearly', 'nearly', 'nine', 'billion', 'pounds', 'of', 'debt', 'and', 'unpaid', 'interest', 'throws', 'the', 'company', 'a', 'lifeline', 'which', 'could', 'secure', 'what', 'is', 'still', 'likely', 'to', 'be', 'a', 'difficult', 'future', '.', 'The', 'deal', ',', 'announced', 'simultaneously', 'in', 'Paris', 'and', 'London', ',', 'brings', 'the', 'company', 'back', 'from', 'the', 'brink', 'of', 'bankruptcy', 'but', 'leaves', 'current', 'shareholders', ',', 'who', 'have', 'already', 'seen', 'their', 'investment', 'dwindle', ',', 'owning', 'only', '54.5', 'percent', 'of', 'the', 'company', '.', '``', 'We', 'have', 'fixed', 'and', 'capped', 'the', 'interest', 'payments', 'and', 'arranged', 'only', 'to', 'pay', 'what', 'is', 'available', 'in', 'cash', ',', "''", 'Eurotunnel', 'co-chairman', 'Alastair', 'Morton', 'told', 'reporters', 'at', 'a', 'news', 'conference', '.', '``', 'Avoiding', 'having', 'to', 'do', 'this', 'again', 'is', 'the', 'name', 'of', 'the', 'game', '.', "''", 'Morton', 'said', 'the', 'plan', 'provides', 'the', 'Anglo-French', 'company', 'with', 'the', 'medium', 'term', 'financial', 'stability', 'to', 'consolidate', 'its', 'commercial', 'position', 'and', 'develop', 'its', 'operations', ',', 'adding', 'that', 'the', 'firm', 'was', 'now', 'making', 'a', 'profit', 'before', 'interest', '.', 'Although', 'shareholders', 'will', 'see', 'their', 'holdings', 'diluted', ',', 'they', 'were', 'offered', 'the', 'prospect', 'of', 'a', 'brighter', 'future', 'and', 'urged', 'to', 'be', 'patient', 'after', 'months', 'of', 'uncertainty', 'while', 'Eurotunnel', 'wrestled', 'to', 'reduce', 'the', 'crippling', 'interest', 'payments', 'negotiated', 'during', 'the', 'tunnel', "'s", 'construction', '.', 'Eurotunnel', ',', 'which', 'has', 'taken', 'around', 'half', 'of', 'the', 'market', 'in', 'the', 'busiest', 'cross-Channel', 'route', 'from', 'the', 'European', 'ferry', 'companies', ',', 'said', 'a', 'strong', 'operating', 'performance', 'could', 'allow', 'it', 'to', 'pay', 'its', 'first', 'dividend', 'within', 'the', 'next', '10', 'years', '.', 'French', 'co-chairman', 'Patrick', 'Ponsolle', 'told', 'reporters', 'at', 'a', 'Paris', 'news', 'conference', 'that', 'the', 'dividend', 'could', 'come', 'as', 'early', 'as', '2004', 'if', 'the', 'company', 'performed', '``', 'very', 'well', "''", '.', 'Eurotunnel', 'and', 'the', 'banks', 'have', 'come', 'up', 'with', 'an', 'ingenious', 'formula', 'to', 'help', 'the', 'company', 'get', 'over', 'the', 'early', 'years', 'of', 'the', 'deal', 'when', ',', 'despite', 'the', 'swaps', 'of', 'debt', 'for', 'equity', 'and', 'bonds', ',', 'it', 'will', 'still', 'not', 'be', 'able', 'to', 'afford', 'the', 'annual', 'interest', 'bill', 'of', '400', 'million', 'pounds', '.', 'If', 'its', 'revenue', ',', 'after', 'costs', 'and', 'depreciation', ',', 'is', 'less', 'than', '400', 'million', 'pounds', ',', 'then', 'the', 'company', 'will', 'issue', '``', 'Stabilisation', 'notes', "''", 'to', 'a', 'maximum', 'of', '1.85', 'billion', 'pounds', 'to', 'the', 'banks', '.', 'Eurotunnel', 'would', 'not', 'pay', 'interest', 'on', 'these', 'notes', '(', 'which', 'would', 'constitute', 'a', 'debt', 'issue', ')', 'for', 'ten', 'years', '.', 'Analysts', 'said', 'that', 'under', 'the', 'deal', ',', 'Eurotunnel', "'s", 'ability', 'to', 'finance', 'its', 'debt', 'would', 'become', 'sustainable', ',', 'at', 'least', 'for', 'a', 'few', 'years', '.', '``', 'If', 'you', 'look', 'at', 'the', 'current', 'cash', 'flow', 'of', 'between', '150', 'and', '200', 'million', 'pounds', 'a', 'year', ',', 'what', 'they', 'ca', "n't", 'find', '(', 'to', 'meet', 'the', 'bill', ')', 'they', 'will', 'roll', 'forward', 'into', 'the', 'stabilisation', 'notes', ',', 'and', 'they', 'can', 'keep', 'that', 'going', 'for', 'seven', ',', 'eight', ',', 'nine', 'years', ',', "''", 'said', 'an', 'analyst', 'at', 'one', 'major', 'investment', 'bank', '.', '``', 'So', 'they', 'are', 'here', 'for', 'that', 'time', ',', "''", 'he', 'added', '.', 'The', 'company', 'said', 'in', 'a', 'statement', 'there', 'was', 'still', 'considerable', 'work', 'to', 'be', 'done', 'to', 'finalise', 'and', 'agree', 'the', 'details', 'of', 'the', 'plan', 'before', 'it', 'can', 'be', 'submitted', 'to', 'shareholders', 'and', 'the', 'bank', 'group', 'for', 'approval', ',', 'probably', 'early', 'in', 'the', 'Spring', 'of', '1997', '.', 'Eurotunnel', 'said', 'the', 'debt-for-equity', 'swap', 'would', 'be', 'at', '130', 'pence', ',', 'or', '10.40', 'francs', ',', 'per', 'share', '--', 'considerably', 'below', 'the', 'level', 'of', '160', 'pence', 'widely', 'reported', 'in', 'the', 'run', 'up', 'to', 'the', 'deal', 'The', 'company', 'said', 'a', 'further', '3.7', 'billion', 'pounds', 'of', 'debt', 'would', 'be', 'converted', 'into', 'new', 'financial', 'instruments', 'and', 'existing', 'shareholders', 'would', 'be', 'able', 'to', 'participate', 'in', 'this', 'issue', '.', 'If', 'they', 'choose', 'not', 'to', 'take', 'up', 'free', 'warrants', 'entitling', 'them', 'to', 'subscribe', 'to', 'this', ',', 'Eurotunnel', 'said', 'shareholders', "'", 'interests', 'may', 'be', 'reduced', 'further', 'to', 'just', 'over', '39', 'percent', 'of', 'the', 'company', 'by', 'the', 'end', 'of', 'December', '2003', '.', 'Eurotunnel', "'s", 'shares', ',', 'which', 'were', 'suspended', 'last', 'week', 'at', '113.5', 'pence', 'ahead', 'of', 'Monday', "'s", 'announcement', ',', 'will', 'resume', 'trading', 'on', 'Tuesday', '.', 'Shareholders', 'and', 'all', '225', 'creditor', 'banks', 'have', 'to', 'agree', 'the', 'deal', '.', '``', 'I', "'m", 'hopeful', 'but', 'I', "'m", 'not', 'taking', 'it', '(', 'approval', ')', 'for', 'granted', ',', "''", 'Morton', 'admitted', ',', '``', 'Shareholders', 'are', 'pretty', 'angry', 'in', 'France', '.', "''", 'Asked', 'what', 'would', 'happen', 'if', 'the', 'banks', 'reject', 'the', 'deal', ',', 'Morton', 'said', ',', '``', 'Nobody', 'wants', 'a', 'collapse', ',', 'nobody', 'wants', 'a', 'doomsday', 'scenario', '.', "''", '(', '$', '1=.6393', 'Pound', ')']
Output of h1doc1.txt
ï » ¿Channel tunnel operator Eurotunnel Monday announced details deal giving bank creditors 45.5 percent company return wiping 1.0 billion pounds ( $ 1.6 billion ) massive debts . The long-awaited highly complex restructuring nearly nearly nine billion pounds debt unpaid interest throws company lifeline could secure still likely difficult future . The deal , announced simultaneously Paris London , brings company back brink bankruptcy leaves current shareholders , already seen investment dwindle , owning 54.5 percent company . `` We fixed capped interest payments arranged pay available cash , '' Eurotunnel co-chairman Alastair Morton told reporters news conference . `` Avoiding name game . '' Morton said plan provides Anglo-French company medium term financial stability consolidate commercial position develop operations , adding firm making profit interest . Although shareholders see holdings diluted , offered prospect brighter future urged patient months uncertainty Eurotunnel wrestled reduce crippling interest payments negotiated tunnel 's construction . Eurotunnel , taken around half market busiest cross-Channel route European ferry companies , said strong operating performance could allow pay first dividend within next 10 years . French co-chairman Patrick Ponsolle told reporters Paris news conference dividend could come early 2004 company performed `` well '' . Eurotunnel banks come ingenious formula help company get early years deal , despite swaps debt equity bonds , still able afford annual interest bill 400 million pounds . If revenue , costs depreciation , less 400 million pounds , company issue `` Stabilisation notes '' maximum 1.85 billion pounds banks . Eurotunnel would pay interest notes ( would constitute debt issue ) ten years . Analysts said deal , Eurotunnel 's ability finance debt would become sustainable , least years . `` If look current cash flow 150 200 million pounds year , ca n't find ( meet bill ) roll forward stabilisation notes , keep going seven , eight , nine years , '' said analyst one major investment bank . `` So time , '' added . The company said statement still considerable work done finalise agree details plan submitted shareholders bank group approval , probably early Spring 1997 . Eurotunnel said debt-for-equity swap would 130 pence , 10.40 francs , per share -- considerably level 160 pence widely reported run deal The company said 3.7 billion pounds debt would converted new financial instruments existing shareholders would able participate issue . If choose take free warrants entitling subscribe , Eurotunnel said shareholders ' interests may reduced 39 percent company end December 2003 . Eurotunnel 's shares , suspended last week 113.5 pence ahead Monday 's announcement , resume trading Tuesday . Shareholders 225 creditor banks agree deal . `` I 'm hopeful I 'm taking ( approval ) granted , '' Morton admitted , `` Shareholders pretty angry France . '' Asked would happen banks reject deal , Morton said , `` Nobody wants collapse , nobody wants doomsday scenario . '' ( $ 1=.6393 Pound )
There's a niffy command line tool from https://github.com/nltk/nltk/blob/develop/nltk/cli.py
Install NLTK with CLI:
pip install -U nltk[cli]
To use, in terminal / command prompt, call nltk tokenize:
$ nltk tokenize --help
Usage: nltk tokenize [OPTIONS]
This command tokenizes text stream using nltk.word_tokenize
Options:
-l, --language TEXT The language for the Punkt sentence tokenization.
-l, --preserve-line An option to keep the preserve the sentence and not
sentence tokenize it.
-j, --processes INTEGER No. of processes.
-e, --encoding TEXT Specify encoding of file.
-d, --delimiter TEXT Specify delimiter to join the tokens.
-h, --help Show this message and exit.
Example usage:
nltk tokenize -l en -j 4 --preserve-line -d " " -e utf8 < 100554newsML.txt > h1doc1.txt
So, I have a keyword list lowercase. Let's say
keywords = ['machine learning', 'data science', 'artificial intelligence']
and a list of texts in lowercase. Let's say
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
I need to transform the texts into:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]
What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.
I was trying to use Phraser, but I can't manage to build one with only my keywords.
Could someone suggest me a more optimized way of doing it?
The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)
You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you've already tried and found wanting.
For example, let's first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that's just an empty-tuple. (The reason for this will become clear later.)
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ())
for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict
After this step, combinations_dict is:
{('machine', 'learning'): ('machine_learning', ()),
('data', 'science'): ('data_science', ()),
('artificial', 'intelligence'): ('artificial_intelligence', ())}
Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.
For example:
def combining_generator(tokens, comb_dict):
buff = () # start with empty buffer
for in_tok in tokens:
buff += (in_tok,) # add latest to buffer
if len(buff) < 2: # grow buffer to 2 tokens if possible
continue
# lookup what to do for current pair...
# ...defaulting to emit-[0]-item, keep-[1]-item in new buff
out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
yield out_tok
if buff:
yield buff[0] # last solo token if any
Here we see the reason for the earlier () empty-tuples: that's the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn't found.
Now designated combinations can be applied via:
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts
...which reports tokenized_texts as:
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'],
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]
Note that the tokens ('artificial', 'intelligence.') aren't combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.
Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing - and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:
import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts
Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts
Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.
This is probably not the best pythonic way to do it but it works with 3 steps.
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
#Add underscore
for idx, text in enumerate(texts):
for keyword in keywords:
reload_text = texts[idx]
if keyword in text:
texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
#Split text for each "." encountered
for idx, text in enumerate(texts):
texts[idx] = list(filter(None, text.split(".")))
print(texts)
#Split text to get each word
for idx,text in enumerate(texts):
for idx_s,sentence in enumerate(text):
texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters
print(texts)
Output
[
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
],
[
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'],
['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
]
]