How to tokenize words and input them into another file? - python

I can only get stop words to implement into the document and then create a new file with the stop words removed. I cannot get word tokenize, porterstemmer, or sent tokenize to process.
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
file1 = open("data/hw1datasets/100554newsML.txt")
This is the part I cannot get to execute into the new txt file.
text = fileObj.read()
stokens = nltk.sent_tokenize(text)
wtokens = nltk.word_tokenize(text)
This part create the new file
line = file1.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('h1doc1.txt','a')
appendFile.write(" "+r)

Not entirely sure of your problem. I think your code is close, but maybe some file Input/Output is your issue. You should not use .open() in a loop, as it will repeatedly open the file. Just open it once, and be sure to .close() your file at the end.
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
file1 = open(r"./100554newsML.txt")
text = file1.read()
stokens = nltk.sent_tokenize(text)
wtokens = nltk.word_tokenize(text)
words = text.split()
appendFile = open(r'h1doc1.txt','w+')
for r in wtokens:
if r not in stop_words:
appendFile.write(" "+r)
appendFile.close()
Using print on stokens & wtokens works fine.
Output of print(stokens):
['Channel tunnel operator Eurotunnel on Monday announced details of a deal giving bank creditors 45.5 percent of the company in return for wiping out 1.0 billion pounds ($1.6 billion) of its massive debts.', 'The long-awaited but highly complex restructuring of nearly nearly nine billion pounds of debt and unpaid interest throws the company a lifeline which could secure what is still likely to be a difficult future.', 'The deal, announced simultaneously in Paris and London, brings the company back from the brink of bankruptcy but leaves current shareholders, who have already seen their investment dwindle, owning only 54.5 percent of the company.', '"We have fixed and capped the interest payments and arranged only to pay what is available in cash," Eurotunnel co-chairman Alastair Morton told reporters at a news conference.', '"Avoiding having to do this again is the name of the game."', 'Morton said the plan provides the Anglo-French company with the medium term financial stability to consolidate its commercial position and develop its operations, adding that the firm was now making a profit before interest.', "Although shareholders will see their holdings diluted, they were offered the prospect of a brighter future and urged to be patient after months of uncertainty while Eurotunnel wrestled to reduce the crippling interest payments negotiated during the tunnel's construction.", 'Eurotunnel, which has taken around half of the market in the busiest cross-Channel route from the European ferry companies, said a strong operating performance could allow it to pay its first dividend within the next 10 years.', 'French co-chairman Patrick Ponsolle told reporters at a Paris news conference that the dividend could come as early as 2004 if the company performed "very well".', 'Eurotunnel and the banks have come up with an ingenious formula to help the company get over the early years of the deal when, despite the swaps of debt for equity and bonds, it will still not be able to afford the annual interest bill of 400 million pounds.', 'If its revenue, after costs and depreciation, is less than 400 million pounds, then the company will issue "Stabilisation notes" to a maximum of 1.85 billion pounds to the banks.', 'Eurotunnel would not pay interest on these notes (which would constitute a debt issue) for ten years.', "Analysts said that under the deal, Eurotunnel's ability to finance its debt would become sustainable, at least for a few years.", '"If you look at the current cash flow of between 150 and 200 million pounds a year, what they can\'t find (to meet the bill) they will roll forward into the stabilisation notes, and they can keep that going for seven, eight, nine years," said an analyst at one major investment bank.', '"So they are here for that time," he added.', 'The company said in a statement there was still considerable work to be done to finalise and agree the details of the plan before it can be submitted to shareholders and the bank group for approval, probably early in the Spring of 1997.', 'Eurotunnel said the debt-for-equity swap would be at 130 pence, or 10.40 francs, per share -- considerably below the level of 160 pence widely reported in the run up to the deal\nThe company said a further 3.7 billion pounds of debt would be converted into new financial instruments and existing shareholders would be able to participate in this issue.', "If they choose not to take up free warrants entitling them to subscribe to this, Eurotunnel said shareholders' interests may be reduced further to just over 39 percent of the company by the end of December 2003.", "Eurotunnel's shares, which were suspended last week at 113.5 pence ahead of Monday's announcement, will resume trading on Tuesday.", 'Shareholders and all 225 creditor banks have to agree the deal.', '"I\'m hopeful but I\'m not taking it (approval) for granted," Morton admitted, "Shareholders are pretty angry in France."', 'Asked what would happen if the banks reject the deal, Morton said, "Nobody wants a collapse, nobody wants a doomsday scenario."', '($1=.6393 Pound)']
Output of print(wtokens)
['ï', '»', '¿Channel', 'tunnel', 'operator', 'Eurotunnel', 'on', 'Monday', 'announced', 'details', 'of', 'a', 'deal', 'giving', 'bank', 'creditors', '45.5', 'percent', 'of', 'the', 'company', 'in', 'return', 'for', 'wiping', 'out', '1.0', 'billion', 'pounds', '(', '$', '1.6', 'billion', ')', 'of', 'its', 'massive', 'debts', '.', 'The', 'long-awaited', 'but', 'highly', 'complex', 'restructuring', 'of', 'nearly', 'nearly', 'nine', 'billion', 'pounds', 'of', 'debt', 'and', 'unpaid', 'interest', 'throws', 'the', 'company', 'a', 'lifeline', 'which', 'could', 'secure', 'what', 'is', 'still', 'likely', 'to', 'be', 'a', 'difficult', 'future', '.', 'The', 'deal', ',', 'announced', 'simultaneously', 'in', 'Paris', 'and', 'London', ',', 'brings', 'the', 'company', 'back', 'from', 'the', 'brink', 'of', 'bankruptcy', 'but', 'leaves', 'current', 'shareholders', ',', 'who', 'have', 'already', 'seen', 'their', 'investment', 'dwindle', ',', 'owning', 'only', '54.5', 'percent', 'of', 'the', 'company', '.', '``', 'We', 'have', 'fixed', 'and', 'capped', 'the', 'interest', 'payments', 'and', 'arranged', 'only', 'to', 'pay', 'what', 'is', 'available', 'in', 'cash', ',', "''", 'Eurotunnel', 'co-chairman', 'Alastair', 'Morton', 'told', 'reporters', 'at', 'a', 'news', 'conference', '.', '``', 'Avoiding', 'having', 'to', 'do', 'this', 'again', 'is', 'the', 'name', 'of', 'the', 'game', '.', "''", 'Morton', 'said', 'the', 'plan', 'provides', 'the', 'Anglo-French', 'company', 'with', 'the', 'medium', 'term', 'financial', 'stability', 'to', 'consolidate', 'its', 'commercial', 'position', 'and', 'develop', 'its', 'operations', ',', 'adding', 'that', 'the', 'firm', 'was', 'now', 'making', 'a', 'profit', 'before', 'interest', '.', 'Although', 'shareholders', 'will', 'see', 'their', 'holdings', 'diluted', ',', 'they', 'were', 'offered', 'the', 'prospect', 'of', 'a', 'brighter', 'future', 'and', 'urged', 'to', 'be', 'patient', 'after', 'months', 'of', 'uncertainty', 'while', 'Eurotunnel', 'wrestled', 'to', 'reduce', 'the', 'crippling', 'interest', 'payments', 'negotiated', 'during', 'the', 'tunnel', "'s", 'construction', '.', 'Eurotunnel', ',', 'which', 'has', 'taken', 'around', 'half', 'of', 'the', 'market', 'in', 'the', 'busiest', 'cross-Channel', 'route', 'from', 'the', 'European', 'ferry', 'companies', ',', 'said', 'a', 'strong', 'operating', 'performance', 'could', 'allow', 'it', 'to', 'pay', 'its', 'first', 'dividend', 'within', 'the', 'next', '10', 'years', '.', 'French', 'co-chairman', 'Patrick', 'Ponsolle', 'told', 'reporters', 'at', 'a', 'Paris', 'news', 'conference', 'that', 'the', 'dividend', 'could', 'come', 'as', 'early', 'as', '2004', 'if', 'the', 'company', 'performed', '``', 'very', 'well', "''", '.', 'Eurotunnel', 'and', 'the', 'banks', 'have', 'come', 'up', 'with', 'an', 'ingenious', 'formula', 'to', 'help', 'the', 'company', 'get', 'over', 'the', 'early', 'years', 'of', 'the', 'deal', 'when', ',', 'despite', 'the', 'swaps', 'of', 'debt', 'for', 'equity', 'and', 'bonds', ',', 'it', 'will', 'still', 'not', 'be', 'able', 'to', 'afford', 'the', 'annual', 'interest', 'bill', 'of', '400', 'million', 'pounds', '.', 'If', 'its', 'revenue', ',', 'after', 'costs', 'and', 'depreciation', ',', 'is', 'less', 'than', '400', 'million', 'pounds', ',', 'then', 'the', 'company', 'will', 'issue', '``', 'Stabilisation', 'notes', "''", 'to', 'a', 'maximum', 'of', '1.85', 'billion', 'pounds', 'to', 'the', 'banks', '.', 'Eurotunnel', 'would', 'not', 'pay', 'interest', 'on', 'these', 'notes', '(', 'which', 'would', 'constitute', 'a', 'debt', 'issue', ')', 'for', 'ten', 'years', '.', 'Analysts', 'said', 'that', 'under', 'the', 'deal', ',', 'Eurotunnel', "'s", 'ability', 'to', 'finance', 'its', 'debt', 'would', 'become', 'sustainable', ',', 'at', 'least', 'for', 'a', 'few', 'years', '.', '``', 'If', 'you', 'look', 'at', 'the', 'current', 'cash', 'flow', 'of', 'between', '150', 'and', '200', 'million', 'pounds', 'a', 'year', ',', 'what', 'they', 'ca', "n't", 'find', '(', 'to', 'meet', 'the', 'bill', ')', 'they', 'will', 'roll', 'forward', 'into', 'the', 'stabilisation', 'notes', ',', 'and', 'they', 'can', 'keep', 'that', 'going', 'for', 'seven', ',', 'eight', ',', 'nine', 'years', ',', "''", 'said', 'an', 'analyst', 'at', 'one', 'major', 'investment', 'bank', '.', '``', 'So', 'they', 'are', 'here', 'for', 'that', 'time', ',', "''", 'he', 'added', '.', 'The', 'company', 'said', 'in', 'a', 'statement', 'there', 'was', 'still', 'considerable', 'work', 'to', 'be', 'done', 'to', 'finalise', 'and', 'agree', 'the', 'details', 'of', 'the', 'plan', 'before', 'it', 'can', 'be', 'submitted', 'to', 'shareholders', 'and', 'the', 'bank', 'group', 'for', 'approval', ',', 'probably', 'early', 'in', 'the', 'Spring', 'of', '1997', '.', 'Eurotunnel', 'said', 'the', 'debt-for-equity', 'swap', 'would', 'be', 'at', '130', 'pence', ',', 'or', '10.40', 'francs', ',', 'per', 'share', '--', 'considerably', 'below', 'the', 'level', 'of', '160', 'pence', 'widely', 'reported', 'in', 'the', 'run', 'up', 'to', 'the', 'deal', 'The', 'company', 'said', 'a', 'further', '3.7', 'billion', 'pounds', 'of', 'debt', 'would', 'be', 'converted', 'into', 'new', 'financial', 'instruments', 'and', 'existing', 'shareholders', 'would', 'be', 'able', 'to', 'participate', 'in', 'this', 'issue', '.', 'If', 'they', 'choose', 'not', 'to', 'take', 'up', 'free', 'warrants', 'entitling', 'them', 'to', 'subscribe', 'to', 'this', ',', 'Eurotunnel', 'said', 'shareholders', "'", 'interests', 'may', 'be', 'reduced', 'further', 'to', 'just', 'over', '39', 'percent', 'of', 'the', 'company', 'by', 'the', 'end', 'of', 'December', '2003', '.', 'Eurotunnel', "'s", 'shares', ',', 'which', 'were', 'suspended', 'last', 'week', 'at', '113.5', 'pence', 'ahead', 'of', 'Monday', "'s", 'announcement', ',', 'will', 'resume', 'trading', 'on', 'Tuesday', '.', 'Shareholders', 'and', 'all', '225', 'creditor', 'banks', 'have', 'to', 'agree', 'the', 'deal', '.', '``', 'I', "'m", 'hopeful', 'but', 'I', "'m", 'not', 'taking', 'it', '(', 'approval', ')', 'for', 'granted', ',', "''", 'Morton', 'admitted', ',', '``', 'Shareholders', 'are', 'pretty', 'angry', 'in', 'France', '.', "''", 'Asked', 'what', 'would', 'happen', 'if', 'the', 'banks', 'reject', 'the', 'deal', ',', 'Morton', 'said', ',', '``', 'Nobody', 'wants', 'a', 'collapse', ',', 'nobody', 'wants', 'a', 'doomsday', 'scenario', '.', "''", '(', '$', '1=.6393', 'Pound', ')']
Output of h1doc1.txt
ï » ¿Channel tunnel operator Eurotunnel Monday announced details deal giving bank creditors 45.5 percent company return wiping 1.0 billion pounds ( $ 1.6 billion ) massive debts . The long-awaited highly complex restructuring nearly nearly nine billion pounds debt unpaid interest throws company lifeline could secure still likely difficult future . The deal , announced simultaneously Paris London , brings company back brink bankruptcy leaves current shareholders , already seen investment dwindle , owning 54.5 percent company . `` We fixed capped interest payments arranged pay available cash , '' Eurotunnel co-chairman Alastair Morton told reporters news conference . `` Avoiding name game . '' Morton said plan provides Anglo-French company medium term financial stability consolidate commercial position develop operations , adding firm making profit interest . Although shareholders see holdings diluted , offered prospect brighter future urged patient months uncertainty Eurotunnel wrestled reduce crippling interest payments negotiated tunnel 's construction . Eurotunnel , taken around half market busiest cross-Channel route European ferry companies , said strong operating performance could allow pay first dividend within next 10 years . French co-chairman Patrick Ponsolle told reporters Paris news conference dividend could come early 2004 company performed `` well '' . Eurotunnel banks come ingenious formula help company get early years deal , despite swaps debt equity bonds , still able afford annual interest bill 400 million pounds . If revenue , costs depreciation , less 400 million pounds , company issue `` Stabilisation notes '' maximum 1.85 billion pounds banks . Eurotunnel would pay interest notes ( would constitute debt issue ) ten years . Analysts said deal , Eurotunnel 's ability finance debt would become sustainable , least years . `` If look current cash flow 150 200 million pounds year , ca n't find ( meet bill ) roll forward stabilisation notes , keep going seven , eight , nine years , '' said analyst one major investment bank . `` So time , '' added . The company said statement still considerable work done finalise agree details plan submitted shareholders bank group approval , probably early Spring 1997 . Eurotunnel said debt-for-equity swap would 130 pence , 10.40 francs , per share -- considerably level 160 pence widely reported run deal The company said 3.7 billion pounds debt would converted new financial instruments existing shareholders would able participate issue . If choose take free warrants entitling subscribe , Eurotunnel said shareholders ' interests may reduced 39 percent company end December 2003 . Eurotunnel 's shares , suspended last week 113.5 pence ahead Monday 's announcement , resume trading Tuesday . Shareholders 225 creditor banks agree deal . `` I 'm hopeful I 'm taking ( approval ) granted , '' Morton admitted , `` Shareholders pretty angry France . '' Asked would happen banks reject deal , Morton said , `` Nobody wants collapse , nobody wants doomsday scenario . '' ( $ 1=.6393 Pound )

There's a niffy command line tool from https://github.com/nltk/nltk/blob/develop/nltk/cli.py
Install NLTK with CLI:
pip install -U nltk[cli]
To use, in terminal / command prompt, call nltk tokenize:
$ nltk tokenize --help
Usage: nltk tokenize [OPTIONS]
This command tokenizes text stream using nltk.word_tokenize
Options:
-l, --language TEXT The language for the Punkt sentence tokenization.
-l, --preserve-line An option to keep the preserve the sentence and not
sentence tokenize it.
-j, --processes INTEGER No. of processes.
-e, --encoding TEXT Specify encoding of file.
-d, --delimiter TEXT Specify delimiter to join the tokens.
-h, --help Show this message and exit.
Example usage:
nltk tokenize -l en -j 4 --preserve-line -d " " -e utf8 < 100554newsML.txt > h1doc1.txt

Related

Tokenize text containing digits

I want to create a text-classifier, the input to the model contains digits along with the text that contains important information (don't think I can just throw away the digits). Is there a way to tokenize this kind of input?
The input looks like this:
input:
-------
Please have a look at case#345
injector 1 and injector 3 is not responding for model 8
Car has been running for 2345 km, try to do this procedure
.....
.....
This helps:
from keras.preprocessing.text import text_to_word_sequence
text = 'Please have a look at case#345. injector1 and injector3 is not responding for model8. Car has been running for 2345 km, try to do this procedure .'
# tokenize the document
result = text_to_word_sequence(text)
print(result)
Output:
['please', 'have', 'a', 'look', 'at', 'case', '345', 'injector1', 'and', 'injector3', 'is', 'not', 'responding', 'for', 'model8', 'car', 'has', 'been', 'running', 'for', '2345', 'km', 'try', 'to', 'do', 'this', 'procedure']

Str into Dict, len of each str as k and list of words with len as v [duplicate]

This question already has answers here:
python create dict using list of strings with length of strings as values
(6 answers)
Closed 2 years ago.
I have a string here:
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
I am supposed to create a dictionary where the keys are the length of the words and the
values are all the words with the same length. And use a list to store all those words.
This is what i have tried, it works, but I'm not sure how to use a loop efficiently to do this, can anyone please share the answer.
files_dict_values = {}
files_list = list(set(str_file_txt.split()))
values_1=[]
values_2=[]
values_3=[]
values_4=[]
values_5=[]
values_6=[]
values_7=[]
values_8=[]
values_9=[]
values_10=[]
values_11=[]
for ele in files_list:
if len(ele) == 1:
values_1.append(ele)
files_dict_values.update({len(ele):values_1})
elif len(ele) == 2:
values_2.append(ele)
files_dict_values.update({len(ele):values_2})
elif len(ele) == 3:
values_3.append(ele)
files_dict_values.update({len(ele):values_3})
elif len(ele) == 4:
values_4.append(ele)
files_dict_values.update({len(ele):values_4})
elif len(ele) == 5:
values_5.append(ele)
files_dict_values.update({len(ele):values_5})
elif len(ele) == 6:
values_6.append(ele)
files_dict_values.update({len(ele):values_6})
elif len(ele) == 7:
values_7.append(ele)
files_dict_values.update({len(ele):values_7})
elif len(ele) == 8:
values_8.append(ele)
files_dict_values.update({len(ele):values_8})
elif len(ele) == 9:
values_9.append(ele)
files_dict_values.update({len(ele):values_9})
elif len(ele) == 10:
values_10.append(ele)
files_dict_values.update({len(ele):values_10})
print(files_dict_values)
Here is the output i got:
{6: ['modern', 'bytes,', 'stored', 'within', 'exists', 'bytes.', 'system', 'binary', 'length', 'files:', 'refers'], 8: ['sequence', 'content.', 'variable', 'records.', 'systems,', 'computer'], 10: ['container,', 'electronic', 'delimiters', 'structured', '(sometimes', 'character,'], 1: ['A', 'a'], 4: ['will', 'line', 'data', 'done', 'last', 'more', 'kind', 'such', 'text', 'Some', 'size', 'need', 'ways', 'have', 'file', 'CP/M', 'with', 'that', 'most', 'name', 'type', 'keep', 'does'], 5: ['store', 'after', 'files', 'while', 'file"', 'known', 'those', 'plain', 'there', 'fixed', 'which', '"Text', 'file.', 'level', 'where', 'track', 'lines', 'kinds', 'text.', 'There'], 9: ['depending', 'Unix-like', 'primarily', 'textfile;', 'separated', 'Microsoft', 'flatfile)', 'operating', 'different'], 3: ['EOF', 'may', 'one', 'and', 'use', 'are', 'two', 'new', 'the', 'end', 'any', 'for', 'few', 'old', 'not'], 7: ['systems', 'denoted', 'Windows', 'because', 'spelled', 'marker,', 'padding', 'special', 'MS-DOS,', 'generic', 'contain', 'system.', 'placing'], 2: ['At', 'do', 'of', 'on', 'as', 'in', 'an', 'or', 'is', 'In', 'On', 'by', 'to']}
How about using loops and let json create keys on its own
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
op={}
for items in str_files_txt.split():
if len(items) not in op:
op[len(items)]=[]
op[len(items)].append(items)
for items in op:
op[items]=list(set(op[items]))
answer = {}
for word in str_files_text.split(): # loop over all the words
# use setdefault to create an empty set if the key doesn't exist
answer.setdefault(len(word), set()).add(word) # add the word to the set
# the set will handle deduping
# turn those sets into lists
for k,v in answer.items():
answer[k] = list(v)
str_files_txt = "A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records. 'Text file' refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files"
lengthWordDict = {}
for word in str_files_txt.split(' '):
wordWithoutSpecialChars = ''.join([char for char in word if char.isalpha()])
wordWithoutSpecialCharsLength = len(wordWithoutSpecialChars)
if(wordWithoutSpecialCharsLength in lengthWordDict.keys()):
lengthWordDict[wordWithoutSpecialCharsLength].append(word)
else:
lengthWordDict[wordWithoutSpecialCharsLength] = [word]
print(lengthWordDict)
This is my solution, it gets the length of the word(Without special characters ex. Punctuation)
To get the absolute length of the word(With punctuation) replace wordWithoutSpecialChars with word
Output:
{1: ['A', 'a', 'a', 'A', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'], 4: ['text', 'file', 'name', 'kind', 'file', 'that', 'text.', 'text', 'file', 'data', 'file', 'such', 'does', 'keep', 'file', 'size', 'text', 'file', 'more', 'last', 'line', 'text', 'file.', 'such', 'text', 'file', 'keep', 'file', 'size', 'most', 'text', 'need', 'have', 'done', 'ways', 'Some', 'with', 'file', 'line', 'will', 'text', 'with', "'Text", "file'", 'type', 'text', 'type', 'text'], 9: ['(sometimes', 'operating', 'operating', 'end-of-file', 'operating', 'Microsoft', 'character,', 'operating', 'end-of-line', 'different', 'depending', 'operating', 'operating', 'primarily', 'separated', 'container,'], 7: ['spelled', 'systems', 'denoted', 'placing', 'special', 'padding', 'systems', 'Windows', 'systems,', 'contain', 'special', 'because', 'systems', 'systems', 'systems', 'systems', 'records.', 'content.', 'generic'], 8: ['textfile;', 'flatfile)', 'computer', 'sequence', 'computer', 'Unix-like', 'variable', 'computer'], 2: ['an', 'is', 'is', 'of', 'is', 'as', 'of', 'of', 'as', 'In', 'as', 'of', 'in', 'of', 'is', 'by', 'or', 'as', 'an', 'as', 'in', 'On', 'as', 'do', 'on', 'of', 'in', 'to', 'in', 'on', 'as', 'or', 'to', 'of', 'to', 'of', 'At', 'of', 'of'], 3: ['old', 'CP/M', 'and', 'the', 'not', 'the', 'the', 'end', 'one', 'the', 'and', 'not', 'any', 'EOF', 'the', 'are', 'for', 'are', 'few', 'may', 'not', 'use', 'new', 'and', 'are', 'two', 'and'], 11: ['alternative', 'description,'], 10: ['structured', 'electronic', 'characters,', 'delimiters,', 'delimiters'], 5: ['lines', 'MS-DOS,', 'where', 'track', 'bytes,', 'known', 'after', 'files', 'those', 'track', 'bytes.', 'There', 'files', 'which', 'store', 'files', 'lines', 'fixed', 'while', 'plain', 'level', 'there', 'kinds', 'files:', 'files', 'files'], 6: ['exists', 'stored', 'within', 'system.', 'system', 'marker,', 'modern', 'system.', 'length', 'refers', 'refers', 'binary'], 16: ['record-orientated']}
You can directly add the strings to the dictionary at the right position as follows:
res = {}
for ele in list(set(str_files_txt.split())):
if len(ele) in res:
res[len(ele)].append(ele)
else:
res[len(ele)] = [ele]
print(res)
You got two problems: cleaning your data and creation of the dictionary.
Use a defaultdict(list) after cleaning your words from characters not belonging to them. (This is similar to the dupe's answer ).
from collections import defaultdict
d = defaultdict(list)
text = """A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. There are for most text files a need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.
'Text file' refers to a type of container, while plain text refers to a type of content.
At a generic level of description, there are two kinds of computer files: text files and binary files"
"""
# remove the characters ,.!;:-"' from begin/end of all space splitted words
words = [w.strip(",.!;:- \"'") for w in text.split()]
# add words to list in dict, automatically creates list if needed
# your code uses a set as well
for w in set(words):
d[len(w)].append(w)
# output
for k in sorted(d):
print(k,d[k])
Output:
1 ['A', 'a']
2 ['to', 'an', 'At', 'do', 'on', 'In', 'On', 'as', 'by', 'or', 'of', 'in', 'is']
3 ['use', 'the', 'one', 'and', 'few', 'not', 'EOF', 'may', 'any', 'for', 'are', 'two', 'end', 'new', 'old']
4 ['have', 'that', 'such', 'type', 'need', 'text', 'more', 'done', 'kind', 'Some', 'does', 'most', 'file', 'with', 'line', 'ways', 'keep', 'CP/M', 'name', 'will', 'Text', 'data', 'last', 'size']
5 ['track', 'those', 'bytes', 'fixed', 'known', 'where', 'which', 'there', 'while', 'There', 'lines', 'kinds', 'store', 'files', 'plain', 'after', 'level']
6 ['exists', 'modern', 'MS-DOS', 'system', 'within', 'refers', 'length', 'marker', 'stored', 'binary']
7 ['because', 'placing', 'content', 'Windows', 'padding', 'systems', 'records', 'contain', 'special', 'generic', 'denoted', 'spelled']
8 ['computer', 'sequence', 'textfile', 'variable']
9 ['Microsoft', 'depending', 'different', 'Unix-like', 'flatfile)', 'primarily', 'container', 'character', 'separated', 'operating']
10 ['delimiters', 'characters', 'electronic', '(sometimes', 'structured']
11 ['end-of-file', 'alternative', 'end-of-line', 'description']
17 ['record-orientated']

Transformers and BERT: dealing with possessives and apostrophes when encode

Let's consider two sentences:
"why isn't Alex's text tokenizing? The house on the left is the Smiths' house"
Now let's tokenize and decode:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")))
We get:
"why isn't alex's text tokenizing? the house on the left is the smiths'house"
My question is how dealing with missing space in some possessives like smiths'house?
For me, it seems that the process of tokenization in Transformers is done not right. Let's consider output of
tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")
we get:
['why', 'isn', "'", 't', 'alex', "'", 's', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "'", 'house']
So in this step, we already have lost important information about the last apostrophe. It would be much better if tokenization was done in the another way:
['why', 'isn', "##'", '##t', 'alex', "##'", '##s', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "##'", 'house']
In this way, tokenization keeps all information about apostrophes, and we will not have problems with possessives.

read multiple txt files python

I have 6000 txt files to read in python. I am trying to read but all txt files are line by line.
Subject: key dates and impact of upcoming sap implementation
over the next few weeks , project apollo and beyond will conduct its final sap
implementation ) this implementation will impact approximately 12 , 000 new
users plus all existing system users . sap brings a new dynamic to enron ,
enhancing the timely flow and sharing of specific project , human resources ,
procurement , and financial information across business units and across
continents .
this final implementation will retire multiple , disparate systems and replace
them with a common , integrated system encompassing many processes including
payroll , timekeeping ...
So python seperate it to rows when I read files one by one(I know thats ridiculos). In the end, 1 mail dividing multiple rows. I have tried read_csv all txt files but python give error that ValueError: stat: path too long for Windows . I don't know what should I do from now.
I tried this:
import glob
import errno
path =r'C:\Users\frknk\OneDrive\Masaüstü\enron6\emails\*.txt'
files = glob.glob(path)
for name in files:
try:
with open(name) as f:
for line in f:
print(line.split())
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
['Subject:', 'key', 'dates', 'and', 'impact', 'of', 'upcoming', 'sap', 'implementation']
['over', 'the', 'next', 'few', 'weeks', ',', 'project', 'apollo', 'and', 'beyond', 'will', 'conduct', 'its', 'final', 'sap']
I need this email by email but it seperated line by line. So what I want is each row represented by one email.
You can read the whole text file into a variable and later manipulate however you want. Just replace for line in f with data=f.read().So, below I read each txt file into data variable and later I split to get words separated by " ". Hope this helps.
for name in files:
try:
with open(name) as f:
data = f.read().replace("\n","")
print(data.split())
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
Output would look like this:
['Subject:', 'key', 'dates', 'and', 'impact', 'of', 'upcoming', 'sap', 'implementationover', 'the', 'next', 'few', 'weeks', ',', 'project', 'apollo', 'and', 'beyond', 'will', 'conduct', 'its', 'final', 'sapimplementation', ')', 'this', 'implementation', 'will', 'impact', 'approximately', '12', ',', '000', 'newusers', 'plus', 'all', 'existing', 'system', 'users', '.', 'sap', 'brings', 'a', 'new', 'dynamic', 'to', 'enron', ',enhancing', 'the', 'timely', 'flow', 'and', 'sharing', 'of', 'specific', 'project', ',', 'human', 'resources', ',procurement', ',', 'and', 'financial', 'information', 'across', 'business', 'units', 'and', 'acrosscontinents', '.this', 'final', 'implementation', 'will', 'retire', 'multiple', ',', 'disparate', 'systems', 'and', 'replacethem', 'with', 'a', 'common', ',', 'integrated', 'system', 'encompassing', 'many', 'processes', 'includingpayroll', ',', 'timekeeping', '...']```

Python: gensim: RuntimeError: you must first build vocabulary before training the model

I know that this question has been asked already, but I was still not able to find a solution for it.
I would like to use gensim's word2vec on a custom data set, but now I'm still figuring out in what format the dataset has to be. I had a look at this post where the input is basically a list of lists (one big list containing other lists that are tokenized sentences from the NLTK Brown corpus). So I thought that this is the input format I have to use for the command word2vec.Word2Vec(). However, it won't work with my little test set and I don't understand why.
What I have tried:
This worked:
from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
brown_vecs = word2vec.Word2Vec(brown.sents())
This didn't work:
sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)
I don't understand why it doesn't work with the "mock" data, even though it has the same data structure as the sentences from the Brown corpus:
vocab:
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]
brown.sents(): (the beginning of it)
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
Can anyone please tell me what I'm doing wrong?
Default min_count in gensim's Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error. Try
voc_vec = word2vec.Word2Vec(vocab, min_count=1)
Input to the gensim's Word2Vec can be a list of sentences or list of words or list of list of sentences.
E.g.
1. sentences = ['I love ice-cream', 'he loves ice-cream', 'you love ice cream']
2. words = ['i','love','ice - cream', 'like', 'ice-cream']
3. sentences = [['i love ice-cream'], ['he loves ice-cream'], ['you love ice cream']]
build the vocab before training
model.build_vocab(sentences, update=False)
just check out the link for detailed info

Categories

Resources