I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.
Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>
The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)
Related
Im pretty new to this and i was trying to write a program which counts the words in txt files. There is probably a better way of doing this, but this was the idea i came up with, so i wanted to go through with it. I just don´t understand, why i, or any variable, does´nt work for as an index for the string of the page, that i´m counting on...
Do you guys have a solution or should i just take a different approach?
page = open("venv\harrry_potter.txt", "r")
alphabet = "qwertzuiopüasdfghjklöäyxcvbnmßQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM"
# Counting the characters
list_of_lines = page.readlines()
characternum = 0
textstr = "" # to convert the .txt file to string
for line in list_of_lines:
for character in line:
characternum += 1
textstr += character
# Counting the words
i = 0
wordnum = 1
while i <= characternum:
if textstr[i] not in alphabet and textstr[i+1] in alphabet:
wordnum += 1
i += 1
print(wordnum)
page.close()
Counting the characters and converting the .txt file to string is done a bit weird, because i thought the other way could be the source of the problem...
Can you help me please?
Typically you want to use split for simplistically counting words. They way you are doing it you will get right-minded as two words, or don't as 2 words. If you can just rely on spaces then you can just use split like this:
book = "Hello, my name is Inigo Montoya, you killed my father, prepare to die."
words = book.split()
print(f'word count = {len(words)}')
you can also use parameters to split to add more options if the given doesn't suit you.
https://pythonexamples.org/python-count-number-of-words-in-text-file/
You want to get the word count of a text file
The shortest code is this (that I could come up with):
with open('lorem.txt', 'r') as file:
print(len(file.read().split()))
First of for smaller files this is fine but this loads all of the data into the memory so not that great for large files. First of use a context manager (with), it helps with error handling an other stuff. What happens is you print the length of the whole file read and split by space so file.read() reads the whole file and returns a string, so you use .split() on it and it splits the whole string by space and returns a list of each word in between spaces so you get the lenght of that.
A better approach would be this:
word_count = 0
with open('lorem.txt', 'r') as file:
for line in file:
word_count += len(line.split())
print(word_count)
Because here the whole file is not saved into memory, you read each line separately and overwrite the previous in the memory. Here again for each line you split it by space and measure the length of the returned list, then add to the total word count. At the end simply print out the total word count.
Useful sources:
about with
Context Managers - Efficiently Managing Resources (to learn how they work a bit in detail) by Corey Schafer
.split() "docs"
I have the following text file:
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
each key pair is how many times each string appears in a document [docID]:[stringFq]
How could you calculate the number of key pairs in this text file?
Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.
Given
%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
Code
import itertools as it
with open("foo.txt") as f:
lines = f.readlines()
#print(lines)
pred = lambda x: x.isalpha()
count = 0
for line in lines:
line = line.strip("\n")
line = "".join(it.dropwhile(pred, line))
pairs = line.strip().split(" ")
#print(pairs)
count += len(pairs)
count
# 15
Details
First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines(). We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:
def pred(x):
return x.isaplha()
We initialize a count variable and start iterating each line. Every line may have a trailing newline character \n, so we first strip() them away before feeding the line to dropwhile.
dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs.
Finally the length of each line of pairs is incrementally added to count. The final count is the sum of all lengths of pairs.
Summary
The code above shows how to tackle basic file handling with simple, iterative steps:
open the file
split the file into lines
while iterating each line, clean and process data
output a result
import re
file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2
print(numLen)
I am in need of inserting unique words into a text file. One word per line and that word will be unique in whole file.
Whenever a new word comes as variable "word" then I need a way to check if it exists in the file. If it exists then it will again pick another word. It will again check if it exists or not and will do so until an unique word comes.
How can I do that?
By the way, I was doing:
newword = "learn"
f = open('wordlist.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
while s.find(newword) != -1:
return newword
else:
return newword
Thanks in advance.
Assuming the file isn't too large, you could just read it all to memory, and check if the word is there:
def word_in_file(filename, word):
with open(filename) as f:
words = f.read().splitlines()
return word in words
If you're going to do this more than once, you'd better keep the words list and append to it each word you add to the file instead of reading it multiple times.
Additionally, creating a set from words should improve the search time and may be worth the "effort" if you're adding multiple words to the file.
Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.
This is supposed to be a compression function. We're supposed to read in a text file with just words, and sort them by frequencies and by the amount of words. Upper and lower case. I don't an answer to solve it, I just want help.
for each word in the input list
if the word is new to the list of unique words
insert it into the list and set its frequency to 1
otherwise
increase its frequency by 1
sort unique word list by frequencies (function)
open input file again
open output file with appropriate name based on input filename (.cmp)
write out to the output file all the words in the unique words list
for each line in the file (delimited by newlines only!)
split the line into words
for each word in the line
look up each word in the unique words list
and write its location out to the output file
don't output a newline character
output a newline character after the line is finished
close both files
tell user compression is done
This is my next step:
def compression():
for line in infile:
words = line.split()
def get_file():
opened = False
fn = input("enter a filename ")
while not opened:
try:
infile = open(fn, "r")
opended = True
except:
print("Won't open")
fn = input("enter a filename ")
return infile
def compression():
get_file()
data = infile.readlines()
infile.close()
for line in infile:
words = line.split()
words = []
word_frequencies = []
def main():
input("Do you want to compress or decompress? Enter 'C' or 'D' ")
main()
So I'm not going to do an entire assignment for you, but I can try my best and walk you through it one by one.
It would seem the first step is to create an array of all the words in the text file (assuming you know file reading methods). For this, you should look into Python's split function (regular expressions can be used for more complex variations of this). So you need to store each word somewhere, and pair that value with the amount of times it appears. Sounds like the job of a dictionary, to me. That should get you off on the right track.
Thanks to the pseudocode I can more or less figure out what this should look like. I have to do some guesswork since you say you're restricted to what you've covered in class, and we have no way to know what that includes and what it doesn't.
I am assuming you can handle opening the input file and splitting it into words. That's pretty basic stuff - if your class is asking you to handle input files it must have covered it. If not, the main thing to look back to is that you can iterate over a file and get its lines:
with open('example.txt') as infile:
for line in infile:
words = line.split()
Now, for each word you need to keep track of two things - the word itself and its frequency. Your problem requires you to use lists to store your information. This means you have to either use two different lists, one storing the words and one storing their frequencies, or use one list that stores two facts for each of its positions. There are disadvantages either way - lists are not the best data structure to use for this, but your problem definition puts the better tools off limits for now.
I would probably use two different lists, one containing the words and one containing the frequency counts for the word in the same position in the word list. The advantage there is that that would allow using the index method on the word list to find the list position of a given word, and then increment the matching frequency count. That will be much faster than searching a list that stores both the word and its frequency count using a for loop. The down side is that sorting is harder - you have to find a way to retain the list position of each frequency when you sort, or to combine the words and their frequencies before sorting. In this approach, you probably need to build a list data that stores two pieces of information - the frequency count and then either the word list index or the word itself - and then sort that list by the frequency count.
I hope that part of the point of this exercise is to help drive home how useful the better data structures are when you're allowed to use them.
edited
So the inner loop is going to look something like this:
words = []
word_frequencies = []
for line in infline:
for word in line.split():
try:
word_position = words.index(word)
except ValueError:
# word is not in words
words.append(word)
# what do you think should happen to word_frequencies here?
else:
# now word_position is a number giving the position in both words and word_frequencies for the word
# what do you think should happen to word_frequences here?