I have the following text file:
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
each key pair is how many times each string appears in a document [docID]:[stringFq]
How could you calculate the number of key pairs in this text file?
Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.
Given
%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
Code
import itertools as it
with open("foo.txt") as f:
lines = f.readlines()
#print(lines)
pred = lambda x: x.isalpha()
count = 0
for line in lines:
line = line.strip("\n")
line = "".join(it.dropwhile(pred, line))
pairs = line.strip().split(" ")
#print(pairs)
count += len(pairs)
count
# 15
Details
First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines(). We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:
def pred(x):
return x.isaplha()
We initialize a count variable and start iterating each line. Every line may have a trailing newline character \n, so we first strip() them away before feeding the line to dropwhile.
dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs.
Finally the length of each line of pairs is incrementally added to count. The final count is the sum of all lengths of pairs.
Summary
The code above shows how to tackle basic file handling with simple, iterative steps:
open the file
split the file into lines
while iterating each line, clean and process data
output a result
import re
file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2
print(numLen)
Related
hi I made this little exercise for myself, I want to pull out the last number in each line In this text file which has 5 lines and 6 numbers/line separated by spaces. I made a loop to get all the remaining characters of the selected line starting from the 5th space. it works for every line print(findtext(0 to 3)), except the last line if the last number has less than 3 characters... what is wrong? I can't figure it out
text = open("text","r")
lines = text.readlines()
def findtext(c):
count = 0
count2 = 0
while count < len(lines[c]) and count2<5:
if lines[c][count] == " ":
count2=count2+1
count=count+1
return float(lines[c][count:len(lines[c])-1])
print(findtext(0))
You proposed solution doesn't seem very Pythonic to me.
with open('you_file') as lines:
for line in lines:
# Exhaust the iterator
pass
# Split by whitespace and get the last element
*_, last = line.split()
print(last)
Several things:
Access files within context managers, as this guarantees resources are destroyed correctly
Don't keep track of indexes if you don't need to, it makes the code harder to read
Use split instead of counting the literal whitespace character
with open('file') as f :
numbers = f.readlines()
last_nums = [ line.split()[-1] for line in numbers ]
line.split() will split the string into elements of a list using the space as a separator (if you put no arguments in it),
[-1] will get the last element of this list for you
I have an array containing strings.
I have a text file.
I want to loop through the text file line by line.
And check whether each element of my array is present or not.
(they must be whole words and not substrings)
I am stuck because my script only checks for the presence of the first array element.
However, I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not.
#!/usr/bin/python
with open("/home/all_genera.txt") as file:
generaA=[]
for line in file:
line=line.strip('\n')
generaA.append(line)
with open("/home/config/config2.cnf") as config_file:
counter = 0
for line in config_file:
line=line.strip('\n')
for part in line .split():
if generaA[counter]in part:
print (generaA[counter], "is -----> PRESENT")
else:
continue
counter += 1
If I understand correctly, you want a sequence of words that are in both files. If yes, set is your friend:
def parse(f):
return set(word for line in f for word in line.strip().split())
with open("path/to/genera/file") as f:
source = parse(f)
with open("path/to/conf/file" as f:
conf = parse(f)
# elements that are common to both sets
common = conf & source
print(common)
# elements that are in `source` but not in `conf`
print(source - conf)
# elements that are in `conf` but not in `source`
print(conf - source)
So to answer "I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not", you can use either common elements or the source - conf difference to annotate your source list:
# using common elements
common = conf & source
result = [(word, word in common) for word in source]
print(result)
# using difference
diff = source - conf
result = [(word, word not in diff) for word in source]
Both will yeld the same result and since set lookup is O(1) perfs should be similar too, so I suggest the first solution (positive assertions are easier to the brain than negative ones).
You can of course apply further cleaning / normalisation when building the sets, ie if you want case insensitive search:
def parse(f):
return set(word.lower() for line in f for word in line.strip().split())
from collection import Counter
import re
#first normalize the text (lowercase everything and remove puncuation(anything not alphanumeric)
normalized_text = re.sub("[^a-z0-9 ]","",open("some.txt","rb").read().lower())
# note that this normalization is subject to the rules of the language/alphabet/dialect you are using, and english ascii may not cover it
#counter will collect all the words into a dictionary of [word]:count
words = Counter(normalized_text.split())
# create a new set of all the words in both the text and our word_list_array
set(my_word_list_array).intersection(words.keys())
the counter is not increasing because it's outside the for loops.
with open("/home/all_genera.txt") as myfile: # don't use 'file' as variable, is a reserved word! use myfile instead
generaA=[]
for line in myfile: # use .readlines() if you want a list of lines!
generaA.append(line)
# if you just need to know if string are present in your file, you can use .read():
with open("/home/config/config2.cnf") as config_file:
mytext = config_file.read()
for mystring in generaA:
if mystring in mytext:
print mystring, "is -----> PRESENT"
# if you want to check if your string in line N is present in your file in the same line, you can go with:
with open("/home/config/config2.cnf") as config_file:
for N, line in enumerate(config):
if generaA[N] in line:
print "{0} is -----> PRESENT in line {1}".format(generaA[N], N)
I hope that everything is clear.
This code could be improved in many ways, but i tried to have it as similar as yours so it will be easier to understand
I am writing a program that is supposed to return the minimum sequence alignment score (smaller = better), and it worked with the Coursera sample inputs, but for the dataset we're given, I can't manually input the sequences, so I have to resort to using a textfile. There are a few things which I found weird.
First things first,
pattern = 'AAA'
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line=lines.strip().strip('\n')
empty.append(line)
print(empty)
print(smallest_distance(pattern, DNA))
If I run this, my program outputs 0. If I comment out for loop, my program outputs 2. I didn't change DNA, so why should my program behave differently? Also, my strip('\n') is working (and for some reason, strip('n') works just as well) but my strip() is not working. Once I figure this out, I can test out empty in my smallest_distance function.
Here is what my data looks like:
ACTAG
CTTAGTATCACTCTGAAAAGAGATTCCGTATCGATGACCGCCAGTTAATACGTGCGAGAAGTGGACACGGCCGCCGACGGCTTCTACACGCTATTACGATG AACCAACAATTGCTCGAATCCTTCCTCAAAATCGCACACGTCTCTCTGGTCGTAGCACGGATCGGCGACCCACGCGTGACAGCCATCACCTATGATTGCCG
TTAAGGTACTGCTTCATTGATCAACACCCCTCAGCCGGCAATCACTCTGGGTGCGGGCTGGGTTTACAGGGGTATACGGAAACCGCTGCTTGCCCAATAAT
etc...
Solution:
pattern = 'AAA'
with open('practice_data.txt') as f_dna:
dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))
Explanation:
You were close to the solution, but you needed to remplace strip() by split()
-> strip() remove the extra characters, so your strip('\n') was a good guess.
But since \n is at the end of the line, split will automatically get rid of it because it is count as a delimitor
e.g
>>> 'test\ntest'.split()
>>> ['test', 'test']
>>> 'test\n'.split()
>>> ['test']
Now you have to remplace .append() by a simple addition between list operation since split returns a list.
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line = lines.split()
empty += line
But, there is still some problems in your code:
It is better to use the with statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:
empty = []
with open('practice_data.txt') as DNA:
for lines in DNA:
line = lines.split()
empty += line
Your code is now fine, you can still refactor using list-comprehension (very common in python)
with open('practice_data.txt') as DNA:
empty = [sequence for line in DNA for sequence in line.split()]
If you struggle understanding this; try to recompose it with for loop
empty = []
with open('practice_data.txt') as DNA:
for line in DNA:
for sequence in line.split():
empty.append(sequence)
Note: #MrGeek solution works, but as two major defaults:
as it is not using a with statement, the file is never closed, causing memory issue,
using .read().splitlines() will load ALL the content of the file in memory, this could lead to MemoryError exception if the file is too big.
Go further, handle huge file:
Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict, a better pratice will be to create another file for the result and process your DNA on the fly:
e.g
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for line in DNA:
for sequence in line.split():
result = smallest_distance(pattern, sequence)
f_result.write(result)
Warning: You will have to make sure your function smallest_distance accepts a string rather than a list.
If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.
Now you can refactor a bit using for example a genetor function to improve readability
def extract_sequence(file, pattern):
for line in file:
for sequence in line.split():
yield smallest_distance(pattern, sequence)
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for result in extract_sequence(f_dna, pattern):
f_result.write(result)
potential errors:
print(smallest_distance(pattern, DNA))
DNA is file descriptor, not a string array. Because DNA = open('practice_data.txt')
For loop consume DNA. So, if you are using for loop for lines in DNA: again in smallest_distance, it doesn't work.
Update:
In this case, the for loop go from the beginning of file to the end. It would not go back again like a list. Unless you call DNS.close() and re-initialize file descriptor again DNA = open('practice_data.txt')
An simple example you can try
DNA = open('text.txt')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # print everything in the file here
print ('try again')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # will not print anything at all
print ('done')
Read For loop not working twice on the same file descriptor for more detail
Write :
pattern = 'AAA'
DNA = open('practice_data.txt').read().splitlines()
newDNA = []
for line in DNA:
newDNA += line.split() # create an array with strings then concatenate it with the newDNA array
print(smallest_distance(pattern, newDNA))
I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.
Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>
The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)
I have appended excel sheet values to a list using xlrd. I called the list a_master. I have a text file with words that I want to count the occurrences of that appear in this list (I called this file dictionary and theirs 1 word per line). Here is the code:
with open("dictionary.txt","r") as f:
for line in f:
print "Count " + line + str((a_master).count(line))
For some reason though, the count comes back with zero for every count word that exists in the text file. If I write out the count for one of these words myself:
print str((a_master).count("server"))
It counts the occurrences no problem.I have also tried
print line
in order to see if it is seeing the words in the dictionary.txt file correctly and it is.
Lines read from the file is terminated by newline character. There may also be white space at the end. It is better to strip out any whitespace before doing a lookup
with open("dictionary.txt","r") as f:
for line in f:
print "Count " + line + str((a_master).count(line.strip()))
Note Ideally, searching a list is linear and may not be optimal in most cases. I think collections.Counter is suitable for situation as you depicted.
Re-interpret your list as a dictionary where the key is the item and the value is the occurrence by passing it through collections.Counter as shown below
a_master = collections.Counter(a_master)
and you can re-write your code as
from itertools import imap
with open("dictionary.txt","r") as f:
for line in imap(str.strip, f):
print "Count {} {}".format(line, a_master[line])
Use collections.Counter():
import re
import collections
words = re.findall(r'\w+', open('dictionary.txt').read().lower())
collections.Counter(words)
Why is this question tagged xlrd by the way?