Reading a file by word without using split in Python - python

I have a one line file that I want to read word by word, i.e., with space separating words. Is there a way to do this without loading the data into the memory and using split? The file is too large.

You can read the file char by char and yield a word after each new white space, below is a simple solution for a file with single white spaces, you should refine it for complex cases (tabs, multiple spaces, etc).
def read_words(filename):
with open(filename) as f:
out = ''
while True:
c = f.read(1)
if not c:
break
elif c == ' ':
yield out
out = ''
else:
out += c
Example:
for i in read_words("test"):
print i
It uses a generator to avoid have to allocate a big chunk of memory

Try this little function:
def readword(file):
c = ''
word = ''
while c != ' ' and c != '\n':
word += c
c = file.read(1)
return word
Then to use it, you can do something like:
f = open('file.ext', 'r')
print(readword(f))
This will read the first word in the file, so if your file is like this:
12 22 word x yy
another word
...
then the output should be 12.
Next time you call this function, it will read the next word, and so on...

Related

How can I go to the next line in .txt file?

How can I read only first symbol in each line with out reading all line, using python?
For example, if I have file like:
apple
pear
watermelon
In each iteration I must store only one (the first) letter of line.
Result of program should be ["a","p","w"], I tried to use file.seek(), but how can I move it to the new line?
ti7 answer is great, but if the lines might be too long to save in memory, you might wish to read char-by-char to prevent storing the whole line in memory:
from pathlib import Path
from typing import Iterator
NEWLINE_CHAR = {'\n', '\r'}
def first_chars(file_path: Path) -> Iterator[str]:
with open(file_path) as fh:
new_line = True
while c := fh.read(1):
if c in NEWLINE_CHAR:
new_line = True
elif new_line:
yield c
new_line = False
Test:
path = Path('/some/path/a.py')
easy_first_chars = [l[0] for l in path.read_text().splitlines() if l]
smart_first_chars = list(first_chars(path))
assert smart_first_chars == easy_first_chars
file-like objects are iterable, so you can directly use them like this
collection = []
with open("input.txt") as fh:
for line in fh: # iterate by-lines over file-like
try:
collection.append(line[0]) # get the first char in the line
except IndexError: # line has no chars
pass # consider other handling
# work with collection
You may also consider enumerate() if you cared about which line a particular value was on, or yielding line[0] to form a generator (which may allow a more efficient process if it can halt before reading the entire file)
def my_generator():
with open("input.txt") as fh:
for lineno, line in enumerate(fh, 1): # lines are commonly 1-indexed
try:
yield lineno, line[0] # first char in the line
except IndexError: # line has no chars
pass # consider other handling
for lineno, first_letter in my_generator():
# work with lineno and first_letter here and break when done
You can read one letter with file.read(1)
file = open(filepath, "r")
letters = []
# Initilalized to '\n' to sotre first letter
previous = '\n'
while True:
# Read only one letter
letter = file.read(1)
if letter == '':
break
elif previous == '\n':
# Store next letter after a next line '\n'
letters.append(letter)
previous = letter

find words in txt files Python 3

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Printing character percent in a text file

I just wrote a function which prints character percent in a text file. However, I got a problem. My program is counting uppercase characters as a different character and also counting spaces. That's why the result is wrong. How can i fix this?
def count_char(text, char):
count = 0
for character in text:
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / len(text)
print("{0} - {1}%".format(char, round(perc, 2)))
You should try making the text lower case using text.lower() and then to avoid spaces being counted you should split the string into a list using: text.lower().split(). This should do:
def count_char(text, char):
count = 0
for word in text.lower().split(): # this iterates returning every word in the text
for character in word: # this iterates returning every character in each word
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
totalChars = sum([len(i) for i in text.lower().split()]
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / totalChars
print("{0} - {1}%".format(char, round(perc, 2)))
Notice the change in perc definition, sum([len(i) for i in text.lower().split()] returns the number of characters in a list of words, len(text) also counts spaces.
You can use a counter and a generator expression to count all letters like so:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
Explanation of generator expression:
c=Counter(c.lower() for line in f # continued below
^ create a counter
^ ^ each character, make lower case
^ read one line from the file
# continued
for c in line if c.isalpha())
^ one character from each line of the file
^ iterate over line one character at a time
^ only add if a a-zA-Z letter
Then get the total letter counts:
total_letters=float(sum(c.values()))
Then the total percent of any letter is c[letter] / total_letters * 100
Note that the Counter c only has letters -- not spaces. So the calculated percent of each letter is the percent of that letter of all letters.
The advantage here:
You are reading the entire file anyway to get the total count of the character in question and the total of all characters. You might as well just count the frequency of all character as you read them;
You do not need to read the entire file into memory. That is fine for smaller files but not for larger ones;
A Counter will correctly return 0 for letters not in the file;
Idiomatic Python.
So your entire program becomes:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
total_letters=float(sum(c.values()))
for char in "abcdefghijklmnopqrstuvwxyz":
print("{} - {:.2%}".format(char, c[char] / total_letters))
You want to make the text lower case before counting the char:
def count_char(text, char):
count = 0
for character in text.lower():
if character == char:
count += 1
return count
You can use the built in .count function to count the characters after converting everything to lowercase via .lower. Additionally, your current program doesn't work properly as it doesn't exclude spaces and punctuation when calling the len function.
import string
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read().lower()
chars = {char:text.count(char) for char in string.ascii_lowercase}
allLetters = float(sum(chars.values()))
for char in chars:
print("{} - {}%".format(char, round(chars[char]/allLetters*100, 2)))

how can I print lines of a file that specefied by a list of numbers Python?

I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)

counting letters in a text file in python

So I'm trying to do this problem
Write a program that reads a file named text.txt and prints the following to the
screen:
 The number of characters in that file
 The number of letters in that file
 The number of uppercase letters in that file
 The number of vowels in that file
I have gotten this so far but I am stuck on step 2 this is what I got so far.
file = open('text.txt', 'r')
lineC = 0
chC = 0
lowC = 0
vowC = 0
capsC = 0
for line in file:
for ch in line:
words = line.split()
lineC += 1
chC += len(ch)
for letters in file:
for ch in line:
print("Charcter Count = " + str(chC))
print("Letter Count = " + str(num))
You can do this using regular expressions. Find all occurrences of your pattern as your list and then finding the length of that list.
import re
with open('text.txt') as f:
text = f.read()
characters = len(re.findall('\S', text))
letters = len(re.findall('[A-Za-z]', text))
uppercase = len(re.findall('[A-Z]', text))
vowels = len(re.findall('[AEIOUYaeiouy]', text))
The answer above uses regular expressions, which are very useful and worth learning about if you haven't used them before. Bunji's code is also more efficient, as looping through characters in a string in Python is relatively slow.
However, if you want to try doing this using just Python, take a look at the code below. A couple of points: First, wrap your open() inside a using statement, which will automatically call close() on the file when you are finished. Next, notice that Python lets you use the in keyword in all kinds of interesting ways. Anything that is a sequence can be "in-ed", including strings. You could replace all of the string.xxx lines with your own string if you would like.
import string
chars = []
with open("notes.txt", "r") as f:
for c in f.read():
chars.append(c)
num_chars = len(chars)
num_upper = 0;
num_vowels = 0;
num_letters = 0
vowels = "aeiouAEIOU"
for c in chars:
if c in vowels:
num_vowels += 1
if c in string.ascii_uppercase:
num_upper += 1
if c in string.ascii_letters:
num_letters += 1
print(num_chars)
print(num_letters)
print(num_upper)
print(num_vowels)

Categories

Resources