Reading file multiple ways in Python - python

I am trying to set up a system for running various statistics on a text file. In this endeavor I need to open a file in Python (v2.7.10) and read it both as lines, and as a string, for the statistical functions to work.
So far I have this:
import csv, json, re
from textstat.textstat import textstat
file = "Data/Test.txt"
data = open(file, "r")
string = data.read().replace('\n', '')
lines = 0
blanklines = 0
word_list = []
cf_dict = {}
word_dict = {}
punctuations = [",", ".", "!", "?", ";", ":"]
sentences = 0
This sets up the file and the preliminary variables. At this point, print textstat.syllable_count(string) returns a number. Further, I have:
for line in data:
lines += 1
if line.startswith('\n'):
blanklines += 1
word_list.extend(line.split())
for char in line.lower():
cf_dict[char] = cf_dict.get(char, 0) + 1
for word in word_list:
lastchar = word[-1]
if lastchar in punctuations:
word = word.rstrip(lastchar)
word = word.lower()
word_dict[word] = word_dict.get(word, 0) + 1
for key in cf_dict.keys():
if key in '.!?':
sentences += cf_dict[key]
number_words = len(word_list)
num = float(number_words)
avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num
mcw = sorted([(v, k) for k, v in word_dict.items()], reverse=True)
print( "Total lines: %d" % lines )
print( "Blank lines: %d" % blanklines )
print( "Sentences: %d" % sentences )
print( "Words: %d" % number_words )
print('-' * 30)
print( "Average word length: %0.2f" % avg_wordsize )
print( "30 most common words: %s" % mcw[:30] )
But this fails as 22 avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num returns a ZeroDivisionError: float division by zero. However, if I comment out the string = data.read().replace('\n', '') from the first piece of code, I can run the second piece without problem and get the expected output.
Basically, how do I set this up so that I can run the second piece of code on data, as well as textstat on string?

The call to data.read() places the file pointer at the end of the file, so you dont have anything more to read at this point. You either have to close and reopen the file or more simply reset the pointer at the begining using data.seek(0)

First see the line:
string = data.read().replace('\n', '')
You are reading from data once. Now, cursor is in the end of data.
Then see the line,
for line in data:
You are trying to read it again, but you just can't do it, because there is nothing else in data, you are at the end of it.so len(word_list) are returning 0.
You are dividing by it and getting the error.
ZeroDivisionError: float division by zero.
But when you comment it, now you are reading only once, which is valid, so second portion of your codes now work.
Clear now?
So, what to do now?
Use data.seek() after data.read()
Demo:
>>> a = open('file.txt')
>>> a.read()
#output
>>>a.read()
#nothing
>>> a.seek(0)
>>> a.read()
#output again

Here is a simple fix. Replace the line for line in data: by :
data.seek(0)
for line in data.readlines():
...
It basically points back to the beginning of the file and read it again line by line.
While this should work, you may want to simplify the code and read the file only once. Something like:
with open(file, "r") as fin:
lines = fin.readlines()
string = ''.join(lines).replace('\n', '')

Related

What is the simplist way to take the first 1000 or defined number of words from .txt file with Python?

Here is the context for the question: I have a .txt file that contains verses of scripture line by line. Each line contains a distinct number of words. Anyway, is there a way to take the first 1000 words the file, create a distinct file (like for instance Block 1) and input the information into that file, and then creating another one with the next 1000 words from which the first 1000 words were taken and so on and so forth, while also ignoring the numbers of chapters?
A response would be greatly appreciated since I am doing this for a person statistical project.
This should work:
from string import ascii_letters
with open( 'scripture.txt' ) as fin :
text = fin.read()
valid_characters = ascii_letters + '\n\t '
text = ''.join( t for t in text if t in valid_characters )
text = text.split()
for i in range(len(text)//1000) :
with open( 'part_%03d.txt' % i, 'w') as fout :
thousand_words = text[i*1000:min((i+1)*1000,len(text))]
fout.write( ' '.join( thousand_words ))
with open('scripture_verses.txt') as f:
words = []
i = 0
for line in f:
for word in line.split():
words.append(word)
i += 1
if i % 1000 == 0:
with open('out{}.txt'.format(i // 1000), 'w') as out:
print(' '.join(words), file=out)
words = []
else:
with open('out{}.txt'.format(i // 1000 + 1), 'w') as out:
print(' '.join(words), file=out)
words = []

Increment Number In a String + 1

I put together python script which will read the string "BatchSequence="NUMBER INCREMENT HERE" and just return the integers. How can i find a certain integer and increment the rest by one but leaving the integers before the same? It skips 3 and goes to 5. I want it to go 3,4,5.
Also,
Once i have figured this script out. How can i replace the numbers of the original text file with the new script numbers? Would i have to write into a new file?
I have tried increment the numbers by one but it starts from the beginning.
code that i tried:
import re
file = '\\\MyDataNEE\\user$\\bxt058y\\Desktop\\75736.oxi.error'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(int(match.group(1)) + 1)
Original Code:
import re
file = 'FILENAME HERE'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(match.group(1))
Currently:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"
New output should be:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
My take on the problem:
txt = '''BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"'''
import re
def fn(my_number):
val = yield
while True:
val = yield str(val) if val < my_number else str(val-1)
f = fn(4)
next(f)
s = re.sub(r'BatchSequence="(\d+)"', lambda g: 'BatchSequence="' + f.send(int(g.group(1))) + '"', txt)
print(s)
Prints:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
The function fn(my_number) will return same values until it reaches my_number, then the values are decremented by one.

Appending into a dictionary within a loop - strange behavior

This might sound banal but it has being a pain.
So I wrote code that parses lines. The .txt file has a line which match my re.match and a line which doesnt.
cat file.txt
00.00.00 : Blabla
x
In this case I treat checking the first letter "x".
def parser():
path = "file.txt"
with open (path, 'r+') as file:
msg = {}
list = []
start = 0
lines = file.readlines()
for i in range (0,len(lines)):
line = lines[i]
if re.match('MY RULES', line) is not None:
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
if line.startswith('x'):
msg['msg'] += line
list.append(msg)
print chat
OUTPUT for 2 lines
{'date': '0.0.00', 'msg': 'BlaBla'}
{'msg': 'x'}
The problem is I cant append the second dict message['msg'] to the last message, if starts with "x".
The expected output is:
{'date': '0.0.00', 'msg': 'BlaBlax'}
I tried using the variant, for changing the last appended chat:
else:
list[len(list) - 1]['msg'] += + line
but then I get the error:
IndexError: list index out of range
I also tried using next(infile) to predict the next line, but then it output every other line.
How would you trick a nested loop to append a dict entry?
Cheers
First of all do not use list as a name for a variable it is builtin keyword and you are shadowing it.
Secondly if I understand correctly you would like to append the last result.
Here:
if re.match('MY RULES', line) is not None:
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
if line.startswith('x'):
msg['msg'] += line
You are analyzing the same line and this msg['msg'] = line[start + 2:] in the next iteration overwrites your key msg in dictionary msg and clear the previous value. So this code
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
Always gets executed even for a simple x in your input file and clears the previous values under the key msg
If you would like it to work you need if else although I would recommend storing intermediate values it in a different way then in locally scoped variable.
Full example with code fix:
def parser():
path = "file.txt"
with open(path, 'r+') as file:
msg = {}
chat = []
start = 0
lines = file.readlines()
for i in range(0, len(lines)):
line = lines[i]
if True:
if line.startswith('x'):
msg['msg'] += line
else:
msg['date'] = line[:8]
msg['msg'] = line[12:]
chat.append(msg)
print(chat)
parser()
Result:
[{'date': '00.00.00', 'msg': 'Blabla\nx'}]
Assuming that the line if re.match('MY RULES', line) is not None:
is True for all the lines in the file that is:
00.00.00 : Blabla
x
How about this:
path = "file.txt"
with open (path, 'r') as f:
msg = dict()
for line in f.readlines():
if line[0].isdigit():
tmp = line.split(':')
date = tmp[0].strip()
msg[date] = ' '.join(*[x.split() for x in tmp[1:]])
else:
msg[date] += ' ' + ' '.join(*[line.split()])
We go line by line, in case first letter of the line is a digit we assume it is a date and add it to our dict - otherwise we add the string found to the last dict entry we made. str.split() makes sure you get ride of all different whitespace characters.
You can for sure replace the if statement in the for loop with your regex... The issue i see with your implementation in general is that as soon as the input varies slightly (e.g. more whitespace chars as intended) your solution produces faulty results. Basic python string manipulations are really powerful ;)
Update
This should produce the right output:
*file.txt*
00.00.00 : Blabla
x
00.00.00 : Blabla2
x2
path = "file.txt"
with open (path, 'r') as f:
lst = list()
for line in f.readlines():
if line[0].isdigit():
tmp = line.split(':')
date = tmp[0].strip()
msg = {date: ' '.join(*[x.split() for x in tmp[1:]])}
lst.append(msg)
else:
msg[date] += ' ' + ' '.join(*[line.split()])
print(lst)
>>> [{'00.00.00': 'Blabla x'}, {'00.00.00': 'Blabla2 x2'}]
I missed the part that you want to store each pair separately in a dict and append it to a list.

how can I print lines of a file that specefied by a list of numbers Python?

I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)

Python Beginning Program Dictionary and List Issue

Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt

Categories

Resources