IndexError: cannot fit 'int' into an index-sized integer - python

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
As i said the first part works fine but then i get the error:-
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
This error occurs when the program runs through 'sentence_seq' towards the bottom of the code
newfiles is the original text file - a random article with more than one sentence with punctuation
list_with_positions is the list with the actual positions of where each word occurs within the original text
matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.
Does anyone know why I get the error?

The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?
Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.
I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way
import string
punctuation_list = set(string.punctuation) # Has to be treated differently
word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
raw_data = infile.read().split()
for index, item in enumerate(raw_data):
index_dict[item] = index
word_base.append(item)
with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
for item in word_base:
outfile1.write(str(item) + ' ')
outfile2.write(str(index_dict[item]) + ' ')
reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
indices = infile1.read().split()
words = infile2.read().split()
reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])

Related

Deleting punctuation from a large text file (TASK: counting words, deleting common words, deleting punctuation)

I am new to programming.
I need some help to understand the code for removing punctuation from a large text file.
I came across few solutions and tried to code as per below:
import string
fname = input("Enter file name: ")
# if len(fname) < 1: fname = "98-0.txt" # Can Enter without typing, but not working
# on shell
fh = open(fname)
# 1. Read in each word from the file,
# 1a. Making it lower case
# 1b. Removing punctuation. (Optionally, skip common words).
# 1c. For each remaining word, add the word to the data structure or
# update your count for the word
counts = dict()
for line in fh:
line = line.strip() # 1
line = line.lower() # 1a.
line = line.split()
# print(string.punctuation) # Provides all the different punctuations that might
# exist in a text
print(line.translate(line.maketrans(" ", " ", string.punctuation)))
# print(words)
But, I am getting a Traceback:
Traceback (most recent call last):
File "wcloud.py", line 29, in <module>
print(line.translate(line.maketrans(" ", " ", string.punctuation)))
AttributeError: 'list' object has no attribute 'translate'
I tried to update Atom with the latest python (I hope I did it correct.. I'm not sure).
As it has already been noted, line = line.split() transforms your original string in a list of strings - i.e. splits your line in words. So, since translate and maketrans are string methods you will need to either loop on the items of the list:
for word in line:
word.translate(word.maketrans(" ", " ", string.punctuation))
or, better, remove punctuation before splitting the line:
line.translate(line.maketrans(" ", " ", string.punctuation))
line = line.split()
In the latter case you will still need to loop on the words in line to add them to count. Or, you may have a look to collections.Counter, that can do the job for you :)

find end of line after another in text file in Python

Issue
Hello all,
in a text file i need to replace an unknown string by another,
first to find it i need to find the line before it 'name Blur2'
as there is many line beginnig by 'xpos':
name Blur2
xpos 12279 # 12279 is the end of line to find and put in a variable
Code to get unknow string:
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("input_file.txt", 'r+') as f1:
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
if keyString in line:
nextLine = lines[i + 1]
print ' nextLine: ',nextLine #result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
print ' number: ',number #result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print ' newString: ',newString #result: newString: 12289
f2.write("".join([nextLine.replace(number, str(newString))])) #this line isn't working
f1.close()
f2.close()
so, i had completely change of method but the last line: f2.write... isn't working as expected, did someone know why?
thanks again for your help :)
regex seems like it would help, https://regex101.com/.
Regex searches a string with a language that defines a pattern. I listed the most important ones for learning the pattern itself, but it is sometimes a better alternative than python's native string manipulation.
You first describe the pattern that you will be using, then actually compile the pattern. For the string check, I defined it as a raw string using r''. This means I don't have to escape a \ within a string (example: printing \ would be print('\') instead of print(r'').
There are a couple of parts to this regex.
\s for whitespace(characters like space, ' ')
\n or \r for newline and carriage return, [^] defines which characters not to look for (so [^\n\r] searches for anything not containing a newline or carriage return), the * indicates it can have 0 or more of the characters indicated. $ in the regex string accounts for everything before the line end.
so the pattern searches for 'name Blur2' specifically with any number of whitespaces afterwards and a newline. The parentheses allow this to be group 1 (explained later). The second part '([^\n\r]*$)' captures any number of characters that aren't a newline or carriage return up until the end of that line.
Groups account for the parentheses, so '(name blue\n)' is group 1, and the line you want replaced '([^\n\r]*$)' is group 2. checkre.sub should replace the whole text with group 1
and the new string, so it replaces the first line with the first line, and replaces the second line with your new string
import re
check = r'(name Blur2\s*\n)([^\n\r]*$)'
checkre = re.compile(check, re.MULTILINE)
checkre.sub(\g<1>+newstring, file)
You need to set re.MULTILINE since you're checking multiple lines, if the '\n' isn't matched, you could use [\n\r\z] which gets one of either end of the line, carriage return, or absolute end of the string.
rioV8's comment works, but you could also use '.{5}$', which accounts for any 5 characters before the end of the line. It could be helpful within a re
It should be possible to get the old string with
oldstring = checkre.search(filestring).group(1)
I have not played with span yet, but
stringmatch = checkre.search(filestring)
oldstring = stringmatch.group(2)
newfilestring = filestring[0:stringmatch.span[0]] + stringmatch.group(1) + newstring + filestring[stringmatch.span[1]]:]
should be pretty close to what you're looking for, although the splice may not be exactly correct.
The initial program was pretty close. I edited a little bit of it to tweak a few things that were wrong.
You weren't initially writing the lines that needed to be replaced, I'm not sure why you needed to join things. Just replacing the number directly seemed to work. Python doesn't allow changes to the i in a for loop, and you need to skip one line so it isn't written to the file, so I changed it to a while loop. Anyway ask any questions you have, but the below code seems to work.
#string to find:
keyString = ' name Blur2'
f2 = open("output_file.txt", 'w+')
with open("test.txt", 'r+') as f1:
lines = f1.readlines()
i=0
while i <len(lines):
line = lines[i]
if keyString in line:
f2.write(line)
nextLine = lines[i + 1]
#end of necessary 'i' calls, increment i to avoid reprinting writing the replaced line string
i+=1
print (' nextLine: ',nextLine )#result: nextLine: xpos 12279
number = nextLine.rsplit(' xpos ', 1)[1]
#as was said in a comment, this coula also be number = nextLine[-5:]
print (' number: ',number )#result: number: 12279
#convert string to float:
newString = '{0}\n'.format(int(number)+ 10)
print (' newString: ',newString) #result: newString: 12289
f2.write(nextLine.replace(number, str(newString))) #this line isn't working
else:
f2.write(line)
i+=1
f1.close()
f2.close()

Take tokens from a text file, calculate their frequency, and return them in a new text file in Python

After a long time researching and asking friends, I am still a dumb-dumb and don't know how to solve this.
So, for homework, we are supposed to define a function which accesses two files, the first of which is a text file with the following sentence, from which we are to calculate the word frequencies:
In a Berlin divided by the Berlin Wall , two angels , Damiel and Cassiel , watch the city , unseen and unheard by its human inhabitants .
We are also to include commas and periods: each single item has already been tokenised (individual items are surrounded by whitespaces - including the commas and periods). Then, the word frequencies must be entered into a new txt-file as "word:count", and in the order in which the words appear, i.e.:
In:1
a:1
Berlin:2
divided:1
etc.
I have tried the following:
def find_token_frequency(x, y):
with open(x, encoding='utf-8') as fobj_1:
with open(y, 'w', encoding='utf-8') as fobj_2:
fobj_1list = fobj_1.split()
unique_string = []
for i in fobj_1list:
if i not in unique_string:
unique_string.append(i)
for i in range(0, len(unique_string)):
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
I am not sure I need to actually use .split() at all, but I don't know what else to do, and it does not work anyway, since it tells me I cannot split that object.
I am told:
Traceback (most recent call last):
[...]
fobj_1list = fobj_1.split()
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
When I remove the .split(), the displayed error is:
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
Let's divide your problem into smaller problems so we can more easily solve this.
First we need to read a file, so let's do so and save it into a variable:
with open("myfile.txt") as fobj_1:
sentences = fobj_1.read()
Ok, so now we have your file as a string stored in sentences. Let's turn it into a list and count the occurrence of each word:
words = sentence.split(" ")
frequency = {word:words.count(word) for word in set(words)}
Here frequency is a dictionary where each word in the sentences is a key with the value being how many times they appear on the sentence. Note the usage of set(words). A set does not have repeated elements, that's why we are iterating over the set of words and not the word list.
Finally, we can save the word frequencies into a file
with open("results.txt", 'w') as fobj_2:
for word in frequency: fobj_2.write(f"{word}:{frequency[word]}\n")
Here we use f strings to format each line into the desired output. Note that f-strings are available for python3.6+.
I'm unable to comment as I don't have the required reputation, but the reason split() isn't working is because you're calling it on the file object itself, not a string. Try calling:
fobj_1list = fobj_1.readline().split()
instead. Also, when I ran this locally, I got an error saying that TypeError: 'encoding' is an invalid keyword argument for this function. You may want to remove the encoding argument from your function calls.
I think that should be enough to get you going.
The following script should do what you want.
#!/usr/local/bin/python3
def find_token_frequency(inputFileName, outputFileName):
# wordOrderList to maintain order
# dict to keep track of count
wordOrderList = []
wordCountDict = dict()
# read the file
inputFile = open(inputFileName, encoding='utf-8')
lines = inputFile.readlines()
inputFile.close()
# iterate over all lines in the file
for line in lines:
# and split them into words
words = line.split()
# now, iterate over all words
for word in words:
# and add them to the list and dict
if word not in wordOrderList:
wordOrderList.append(word)
wordCountDict[word] = 1
else:
# or increment their count
wordCountDict[word] = wordCountDict[word] +1
# store result in outputFile
outputFile = open(outputFileName, 'w', encoding='utf-8')
for index in range(0, len(wordOrderList)):
word = wordOrderList[index]
outputFile.write(f'{word}:{wordCountDict[word]}\n')
outputFile.close()
find_token_frequency("input.txt", "output.txt")
I changed your variable names a bit to make the code more readable.

Read each line of a text file and then split each line by spaces in python

I am having a silly issue whereby I have a text file with user inputs structured as follows:
x = variable1
y = variable2
and so on. I want to grab the variables, to do this I was going to just import the text file and then grab out the UserInputs[2], UserInputs[5] etc. I have spent a lot of time reading through how to do this and the closest I got was initially with the csv package but this resulted in just getting the '=' signs when I printed it so I went back to just using the open command and readlines and then trying to iterate through the lines and splitting by ' '.
So far I have the following code:
Text_File_Import = open('USER_INPUTS.txt', 'r')
Text_lines = Text_File_Import.readlines()
for line in Text_lines:
User_Inputs = line.split(' ')
print User_Inputs
However this only outputs the first line from my text file (i.e I get 'x', '=', 'variable1'but it never enters the next line. How would I iterate this code through the imported text file?
I have bodged it a bit for the time being and rearranged the text file to be variable1 = x and so on. This way I can still import the variable and the x has the /n after it if I just import them with the folloing code:
def ReadTextFile(textfilename):
Text_File_Import = open(textfilename, 'r')
Text_lines = Text_File_Import.readlines()
User_Inputs = Text_lines[1].split(' ')
User_Inputs_clength = User_Inputs[0]
#print User_Inputs[2] + User_Inputs_clength
User_Inputs = Text_lines[2].split(' ')
User_Inputs_cradius = User_Inputs[0]
#print User_Inputs[2], ' ', User_Inputs_cradius
return User_Inputs_clength, User_Inputs_cradius
Thanks
I don't quite understand the question. If you want to store the variables:
As long as the variables in the text file are valid python syntax (eg. strings surrounded by parentheses), here is an easy but very insecure method:
file=open('file.txt')
exec(file.read())
It will store all the variables, with their names.
If you want to split a text file between the spaces:
file=open('file.txt')
output=file.read().split(' ')
And if you want to replace newlines with spaces:
file=open('file.txt')
output=file.read().replace('\n', ' ')
You have a lot in indentation issues. To read lines and split by space the below snippet should help.
Demo
with open('USER_INPUTS.txt', 'r') as infile:
data = infile.readlines()
for i in data:
print(i.split())

Print full sequence not just first line | Python 3.3 | Print from specific line to end (")

I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.

Categories

Resources