char1= "P"
length=5
f = open("wl.txt", 'r')
for line in f:
if len(line)==length and line.rstrip() == char1:
z=Counter(line)
print z
I want to output only lines where length=5 and contains character p.So far
f = open("wl.txt", 'r')
for line in f:
if len(line)==length :#This one only works with the length
z=Counter(line)
print z
Any guess someone?
Your problem is:
if len(line)==length and line.rstrip() == char1:
If a line is 5 characters long, then after removing trailing whitespace, you're then comparing to see if it's equal to a string of length 1... 'abcde' is never going to equal 'p' for instance, and your check will never run if your line contains 'p' as it's not 5 characters...
I'm not sure what you're trying to do with Counter
Corrected code is:
# note in capitals to indicate 'constants'
LENGTH = 5
CHAR = 'p'
with open('wl.txt') as fin:
for line in fin:
# Check length *after* any trailing whitespace has been removed
# and that CHAR appears anywhere **in** the line
if len(line.rstrip()) == LENGTH and CHAR in line:
print 'match:', line
Related
How can I read only first symbol in each line with out reading all line, using python?
For example, if I have file like:
apple
pear
watermelon
In each iteration I must store only one (the first) letter of line.
Result of program should be ["a","p","w"], I tried to use file.seek(), but how can I move it to the new line?
ti7 answer is great, but if the lines might be too long to save in memory, you might wish to read char-by-char to prevent storing the whole line in memory:
from pathlib import Path
from typing import Iterator
NEWLINE_CHAR = {'\n', '\r'}
def first_chars(file_path: Path) -> Iterator[str]:
with open(file_path) as fh:
new_line = True
while c := fh.read(1):
if c in NEWLINE_CHAR:
new_line = True
elif new_line:
yield c
new_line = False
Test:
path = Path('/some/path/a.py')
easy_first_chars = [l[0] for l in path.read_text().splitlines() if l]
smart_first_chars = list(first_chars(path))
assert smart_first_chars == easy_first_chars
file-like objects are iterable, so you can directly use them like this
collection = []
with open("input.txt") as fh:
for line in fh: # iterate by-lines over file-like
try:
collection.append(line[0]) # get the first char in the line
except IndexError: # line has no chars
pass # consider other handling
# work with collection
You may also consider enumerate() if you cared about which line a particular value was on, or yielding line[0] to form a generator (which may allow a more efficient process if it can halt before reading the entire file)
def my_generator():
with open("input.txt") as fh:
for lineno, line in enumerate(fh, 1): # lines are commonly 1-indexed
try:
yield lineno, line[0] # first char in the line
except IndexError: # line has no chars
pass # consider other handling
for lineno, first_letter in my_generator():
# work with lineno and first_letter here and break when done
You can read one letter with file.read(1)
file = open(filepath, "r")
letters = []
# Initilalized to '\n' to sotre first letter
previous = '\n'
while True:
# Read only one letter
letter = file.read(1)
if letter == '':
break
elif previous == '\n':
# Store next letter after a next line '\n'
letters.append(letter)
previous = letter
Im trying to write a program that counts the number of N's at the end of a string.
I have a file containing a many lines of unique sequences and I want to measure how often the sequence ends with N, and how long the series of N's are. For example, the file input will look like this:
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACNN
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGN
NTACGTTTAGTAGAGACAGTGTCTTGCTATGTTGCCCAGGCTGGTCTCAAACTCCTGAGCTCTAGCAAGCCTTCCACCTCNNN
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGANNN
Ideally, the code will read through the file, line by line and count how often a line ends with 'N'.
Then, if a line ends with N, it should read each character backwards to see how long the string of N's is. This information will be used to calculate the percentage of lines ending in N, as well as the mean, mode, median and range of N strings.
Here is what I have so far.
filename = 'N_strings_test.txt'
n_strings = 0
n_string_len = []
with open(filename, 'r') as in_f_obj:
line_count = 0
for line in in_f_obj:
line_count += 1
base_seq = line.rstrip()
if base_seq[-1] == 'N':
n_strings += 1
if base_seq[-2] == 'N':
n_string_len.append(int(2))
else:
n_string_len.append(int(1))
print(line_count)
print(n_strings)
print(n_string_len)
All i'm getting is an index out of range error, but I don't understand why. Also, what I have so far is only limited to 2 characters.
I want to try and write this for myself, so I don't want to import any modules.
Thanks.
You will probably get the IndexError because your file has empty lines!
Two sound approaches. First the generic one: iterate the line in reverse using reversed():
line = line.rstrip()
count = 0
for c in reversed(line):
if c != 'N':
break
count += 1
# count will now contain the number of N characters from the end
Another, even easier, which does modify the string, is to rstrip() all whitespace, get the length, and then rstrip() all Ns. The number of trailing Ns is the difference in lengths:
without_whitespace = line.rstrip()
without_ns = without_whitespace .rstrip('N')
count = len(without_whitespace) - len(without_ns)
This code is:
Reading line by line
Reversing the string and lstriping it. Reversing is not necessary but it make things natural.
Read last character, if N then increment
Keep reading that line until we have stream of N
n_string_count, n_string_len, line_count = 0, [], 0
with open('file.txt', 'r') as input_file:
for line in input_file:
line_count += 1
line = line[::-1].lstrip()
if line:
if line[0] == 'N':
n_string_count += 1
consecutive_n = 1
while consecutive_n < len(line) and line[consecutive_n] == 'N': consecutive_n += 1
n_string_len.append(consecutive_n)
print(line_count)
print(n_string_count)
print(n_string_len)
Im reading a text files that identifies specific characteristics in the text. Everything turns out fine until it reaches the spaces part where it displays that there are 15 spaces instead of 6.
The text file is
Hello
Do school units regularly
Attend seminars
Study 4 tests
Bye
and the script is
def main():
lower_case = 0
upper_case = 0
numbers = 0
whitespace = 0
with open("text.txt", "r") as in_file:
for line in in_file:
lower_case += sum(1 for x in line if x.islower())
upper_case += sum(1 for x in line if x.isupper())
numbers += sum(1 for x in line if x.isdigit())
whitespace += sum(1 for x in line if x.isspace())
print 'Lower case Letters: %s' % lower_case
print 'Upper case Letters: %s' % upper_case
print 'Numbers: %s' % numbers
print 'Spaces: %s' % whitespace
main()
Is there anything that should be changed so the number of spaces will turn up as 6?
The reason this happens is because line breaks are also considered to be spaces. Now, the file you are opening was probably created on Windows, and on Windows a line break is two characters (the actual line break, and a caret return). Since you have five lines, you get extra 10 whitespaces, totalling 16 (one gets lost somewhere, I can only guess that one of the lines has a different line break at the end, which lacks a caret return).
To fix it, just strip the line when you count whitespaces.
whitespace += sum(1 for x in line.strip() if x.isspace())
This will, however, also strip out any trailing and leading spaces which are not line breaks. To only strip out linebreaks from the end, you can do
whitespace += sum(1 for x in line.rstrip("\r\n") if x.isspace())
Another possibility is to not use isspace() but rather check for the characters you want, e.g.
whitespace += line.count(' ') + line.count('\t')
I have a file that has between 1 or 4 words per line, and I want to sort by the third word in the line. The below code does not work if there isn't a word in the s[2] slot. Is there anything I can do to still sort everything does? Thanks
with open('myfile.txt') as fin:
lines = [line.split() for line in fin]
lines.sort(key=lambda s: s[2])
You may want to try using the slice syntax
with open('myfile.txt') as fin:
lines = [line.split() for line in fin]
lines.sort(key=lambda s: s[2:3]) # will give empty list if there is no 3rd word
Try this:
x.sort(key=lambda s: s[2] if len(s) > 2 else ord('z')+1)
That way if there is no s[2] it returns the next thing after 'z' (presumably the last ascii character value in your strings). Feel free to change ord('z')+1 to some other large number
def sortFileByLastWords(fIn, fOut):
with open (fIn) as fin:
lines = [line.split () for line in fin]
lines.sort (key=lambda s: s[-1]) # make the last word of each line be the key
with open (fOut, "w") as fout:
for film in lines:
fout.write (' '.join (film) + '\n')
I have got this python program which reads through a wordlist file and checks for the suffixes ending which are given in another file using endswith() method.
the suffixes to check for is saved into the list: suffixList[]
The count is being taken using suffixCount[]
The following is my code:
fd = open(filename, 'r')
print 'Suffixes: '
x = len(suffixList)
for line in fd:
for wordp in range(0,x):
if word.endswith(suffixList[wordp]):
suffixCount[wordp] = suffixCount[wordp]+1
for output in range(0,x):
print "%-6s %10i"%(prefixList[output], prefixCount[output])
fd.close()
The output is this :
Suffixes:
able 0
ible 0
ation 0
the program is unable to reach this loop :
if word.endswith(suffixList[wordp]):
You need to strip the newline:
word = ln.rstrip().lower()
The words are coming from a file so each line ends with a newline character. You are then trying to use endswith which always fails as none of your suffixes end with a newline.
I would also change the function to return the values you want:
def store_roots(start, end):
with open("rootsPrefixesSuffixes.txt") as fs:
lst = [line.split()[0] for line in map(str.strip, fs)
if '#' not in line and line]
return lst, dict.fromkeys(lst[start:end], 0)
lst, sfx_dict = store_roots(22, 30) # List, SuffixList
Then slice from the end and see if the substring is in the dict:
with open('longWordList.txt') as fd:
print('Suffixes: ')
mx, mn = max(sfx_dict, key=len), min(sfx_dict, key=len)
for ln in map(str.rstrip, fd):
suf = ln[-mx:]
for i in range(mx-1, mn-1, -1):
if suf in sfx_dict:
sfx_dict[suf] += 1
suf = suf[-i:]
for k,v in sfx_dict:
print("Suffix = {} Count = {}".format(k,v))
Slicing the end of the string incrementally should be faster than checking every string especially if you have numerous suffixes that are the same length. At most it does mx - mn iterations, so if you had 20 four character suffixes you would only need to check the dict once, only one n length substring can be matched at a time so we would kill n length substrings at the one time with a single slice and lookup.
You could use a Counter to count the occurrences of suffix:
from collections import Counter
with open("rootsPrefixesSuffixes.txt") as fp:
List = [line.strip() for line in fp if line and '#' not in line]
suffixes = List[22:30] # ?
with open('longWordList.txt') as fp:
c = Counter(s for word in fp for s in suffixes if word.rstrip().lower().endswith(s))
print(c)
Note: add .split()[0] if there are more than one words per line you want to ignore, otherwise this is unnecessary.