i wrote a .txt file in which I put operators = and ==. I wrote a code which will count number of = and ==, but i dont get correct number.
lexicalClass = file.readlines()
for lex in lexicalClass:
newList = re.findall('\S+', lex)
for element in newList:
if len(re.findall('[a-z]+|[0-9]+', element)):
identifiers.append(re.findall('[a-z]+|[0-9]+', element))
num = len(re.findall('\=', element))
if int(num):
if int(num) % 2 == 1:
for i in range(int((num-1)/2)):
equal.append('==')
assignment.append('=')
else:
for i in range(int(num/2)):
equal.append('==')
print(str(len(equal)))
print(str(len(assignment)))
My .txt file : a == b a = b c = d
And as you can see my output should be 1 and 2, but im getting 0 in both.
You could probably do this with lookahead and lookbehind assertions:
one_equals = r"(?<!=)=(?!=)" # a "=" not followed or preceded by a =
two_equals = r"(?<!=)==(?!=)" # "==" not followed or preceded by a =
assignment = 0
equals = 0
with open("yourfilename.txt") as f:
for line in f:
equal += len(re.findall(one_equals, line))
assignment += len(re.findall(two_equals, line))
If this is Python source code, the correct way to do this is with the ast module, using ast.walk() and counting instances of the ast.Assign and ast.Eq nodes:
import ast
with open("yourfilename.txt") as f:
parsed_source = ast.parse(f.read())
nodes = list(ast.walk(parsed_source))
equals = sum(isinstance(n, ast.Eq) for n in nodes)
assignments = sum(isinstance(n, ast.Assign) for n in nodes)
If you don't really care about the efficiency of algorithm, this is a fairly simple solution:
file = open("asd.txt")
total_double_eq_count = 0
total_single_eq_count = 0
#iterate over the lines of file
for line in file:
#count of '=='s in the line
double_eq_count = line.count("==")
#count of '='s which are not followed by an another '='.
single_eq_count = line.count("=") - 2*double_eq_count
total_double_eq_count += double_eq_count
total_single_eq_count += single_eq_count
print(total_double_eq_count)
print(total_single_eq_count)
But this is relatively fast compared to a equivalent python code since we are using builtin methods for string processing. At least on small inputs.
Related
how to sum numbers attached to words in a text file(not separate them into digits) in python? (example: "a23 B55" - answer = 78)
thats what i did but its not quite right:
def rixum(file_name):
f = open(file_name,'r')
line = f.readline()
temp = line.split()
res = []
for word in temp:
i = 0
while i < len(word)-1:
if word[i].isdigit():
res.append(int(word[i:]))
print(sum(res))
f.close()
return sum(res)
This worked for me:
import re
string = 'F43 n90 i625'
def summ_numbers(string):
return sum([int(num) for num in re.findall('\d+', string)])
print(summ_numbers(string))
Output:
758
You don't really need to build a list - you can simply accumulate the values as you go along (line by line):
def rixum(filename):
with open(filename) as data:
for line in data:
total = 0
for token in line.split():
for i, c in enumerate(token):
if c.isdigit():
total += int(token[i:])
break
print(total)
This is my code to split a string list at the colon:
this is more info to maybe help with the question
my_file = open("Accounts.txt", "r")
rawAccounts = my_file.read()
Accounts = []
b = 0
j = 0
x = 0
size = 0
dummy= "c"
lessrawAccounts = rawAccounts.split("\n")
while x != 100000:
size = len(lessrawAccounts[j])
if lessrawAccounts[j[b]] != ":":
Accounts[j[b]] = lessrawAccounts[j[b]]
b = b + 1
else:
j = j + 1
while b <= size:
Accounts[j[b]] = lessrawAccounts[j[b]]
b = b + 1
If you want to store only emails from your list on the basis of semicolons you can use this...
lessrawAccounts = ['JohnDoe#gmail.com:userpass']
Accounts = []
passwords = []
for line in lessrawAccounts:
Accounts.append(line.split(":")[0])
passwords.append(line.split(":")[1])
print(Accounts,passwords)
it would be clearer if you gave examples of the strings you wanted to split.
To answer your "question", the reader needs to parse you code to try to work out what you want to do.
Your question is titled more or less "how to split a string at the : character".
before,_,after = "before:after".partition(":")
the partition function splits a string according to a partition string (it can be more than one character). It returns three values, I have discarded the middle value, since the middle value is the partitioning string.
The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.
first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).
I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?
I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3
The code
matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
for key, seq in seqs.items(): # getting one sequence in each iteration
for pos in range(len(seq)): # setting an index and increasing it by 1 in each iteration
if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
repeat = seq[pos:pos + rep_len] # setting a substring
if repeat not in matches: # checking if the substring was already scanned
matches.append(repeat) # adding the substring to previously checked substrings' list
for key1, seq2 in seqs.items(): # iterating over each sequence
count += seq2.count(repeat) # counting the substring's repetitions
if count > final_count: # if the count is greater than the previously saved greatest number
final_count = count # the new value is saved
count = 0
print('repetitions: ', final_count) # printing
sequences.fasta
The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.
Anyway, I (currently) just noted one small mistake:
if pos < len(seq) - rep_len:
Should be
if pos <= len(seq) - rep_len:
Currently, the last character in each sequence is ignored.
EDIT:
Here some rewriting of your code that is clearer and might help you investigate the errors:
rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"
# Extract the data into a dictionary
with open(filename, "r") as f:
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
# Store all the information, so that you can reuse it later
counter = {}
for key, seq in seqs.items():
for pos in range(len(seq)-rep_len):
repeat = seq[pos:pos + rep_len]
if repeat in counter:
counter[repeat] += 1
else:
counter[repeat] = 1
# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True )
# Display the 5 max occurrences
for i in range(5):
key, rep = sorted_counter[i]
print("{} -> {}".format(key, rep))
# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9
It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.
An example:
from collections import Counter
from nltk.util import ngrams
sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"
def reps(substr):
"Counts repeats in a substring"
return sum([i for i in Counter(substr).values() if i>1])
def make_grams(sent, n=6):
"splits a sentence into n-grams"
return ["".join(seq) for seq in (ngrams(sent,n))]
grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count
result = [dna for dna in grams if reps(dna) == max_length]
print(result)
Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']
And if the question is look for the string with the most repeated character:
repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}
another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)
Output: ['cccggc', 'cccaac', 'ccaacc']
I have a 7000+ lines .txt file, containing description and ordered path to image. Example:
abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png
abnormal /Users/alex/Documents/X-ray-classification/data/images/2.png
normal /Users/alex/Documents/X-ray-classification/data/images/3.png
normal /Users/alex/Documents/X-ray-classification/data/images/4.png
Some lines are missing. I want to somehow automate the search of missing lines. Intuitively i wrote:
f = open("data.txt", 'r')
lines = f.readlines()
num = 1
for line in lines:
if num in line:
continue
else:
print (line)
num+=1
But of course it didn't work, since lines are strings.
Is there any elegant way to sort this out? Using regex maybe?
Thanks in advance!
the following should hopefully work - it grabs the number out of the filename, sees if it's more than 1 higher than the previous number, and if so, works out all the 'in-between' numbers and prints them. Printing the number (and then reconstructing the filename later) is needed as line will never contain the names of missing files during iteration.
# Set this to the first number in the series -1
num = lastnum = 0
with open("data.txt", 'r') as f:
for line in f:
# Pick the digit out of the filename
num = int(''.join(x for x in line if x.isdigit()))
if num - lastnum > 1:
for i in range(lastnum+1, num):
print("Missing: {}.png".format(str(i)))
lastnum = num
The main advantage of working this way is that as long as your files are sorted in the list, it can handle starting at numbers other than 1, and also reports more than one missing number in the sequence.
You can try this:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if str(num) not in lines[i]:
missing.append(num)
else:
i += 1
print(missing)
Or if you want to check for the line ending with XXX.png:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if not lines[i].endswith(str(num) + ".png"):
missing.append(num)
else:
i += 1
print(missing)
Example: here
This is a pretty straight forward attempt. I haven't been using python for too long. Seems to work but I am sure I have much to learn. Someone let me know if I am way off here. Needs to find patterns, write the first line which matches, and then add a summary message for remaining consecutive lines which match pattern and return modified string.
Just to be clear...regex .*Dog.* would take
Cat
Dog
My Dog
Her Dog
Mouse
and return
Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings and tuples.
"""
# Convert string to tuple.
if type(l_regex) == types.StringType:
l_regex = l_regex,
for t in l_regex:
r = ''
p = ''
for l in l_string.splitlines(True):
if l.startswith('::::: Pattern'):
r = r + l
else:
if re.search(t, l): # If line matches regex.
m += 1
if m == 1: # If this is first match in a set of lines add line to file.
r = r + l
elif m > 1: # Else update the message string.
p = "::::: Pattern '" + t + "' repeats " + str(m-1) + ' more times.\n'
else:
if p: # Write the message string if it has value.
r = r + p
p = ''
m = 0
r = r + l
if p: # Write the message if loop ended in a pattern.
r = r + p
p = ''
l_string = r # Reset string to modified string.
return l_string
The rematcher function seems to do what you want:
def rematcher(re_str, iterable):
matcher= re.compile(re_str)
in_match= 0
for item in iterable:
if matcher.match(item):
if in_match == 0:
yield item
in_match+= 1
else:
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
in_match= 0
yield item
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
import sys, re
for line in rematcher(".*Dog.*", sys.stdin):
sys.stdout.write(line)
EDIT
In your case, the final string should be:
final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))
Updated your code to be a bit more effective
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings/patterns or tuples of strings/patterns.
"""
# Convert string/pattern to tuple.
if not hasattr(l_regex, '__iter__'):
l_regex = l_regex,
ret = []
last_regex = None
count = 0
for line in l_string.splitlines(True):
if last_regex:
# Previus line matched one of the regexes
if re.match(last_regex, line):
# This one does too
count += 1
continue # skip to next line
elif count > 1:
ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
count = 0
last_regex = None
ret.append(line)
# Look for other patterns that could match
for regex in l_regex:
if re.match(regex, line):
# Found one
last_regex = regex
count = 1
break # exit inner loop
return ''.join(ret)
First, your regular expression will match more slowly than if you had left off the greedy match.
.*Dog.*
is equivalent to
Dog
but the latter matches more quickly because no backtracking is involved. The longer the strings, the more likely "Dog" appears multiple times and thus the more backtracking work the regex engine has to do. As it is, ".*D" virtually guarantees backtracking.
That said, how about:
#! /usr/bin/env python
import re # regular expressions
import fileinput # read from STDIN or file
my_regex = '.*Dog.*'
my_matches = 0
for line in fileinput.input():
line = line.strip()
if re.search(my_regex, line):
if my_matches == 0:
print(line)
my_matches = my_matches + 1
else:
if my_matches != 0:
print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
print(line)
my_matches = 0
It's not clear what should happen with non-neighboring matches.
It's also not clear what should happen with single-line matches surrounded by non-matching lines. Append "Doggy" and "Hula" to the input file and you'll get the matching message "0" more times.