I put together python script which will read the string "BatchSequence="NUMBER INCREMENT HERE" and just return the integers. How can i find a certain integer and increment the rest by one but leaving the integers before the same? It skips 3 and goes to 5. I want it to go 3,4,5.
Also,
Once i have figured this script out. How can i replace the numbers of the original text file with the new script numbers? Would i have to write into a new file?
I have tried increment the numbers by one but it starts from the beginning.
code that i tried:
import re
file = '\\\MyDataNEE\\user$\\bxt058y\\Desktop\\75736.oxi.error'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(int(match.group(1)) + 1)
Original Code:
import re
file = 'FILENAME HERE'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(match.group(1))
Currently:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"
New output should be:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
My take on the problem:
txt = '''BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"'''
import re
def fn(my_number):
val = yield
while True:
val = yield str(val) if val < my_number else str(val-1)
f = fn(4)
next(f)
s = re.sub(r'BatchSequence="(\d+)"', lambda g: 'BatchSequence="' + f.send(int(g.group(1))) + '"', txt)
print(s)
Prints:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
The function fn(my_number) will return same values until it reaches my_number, then the values are decremented by one.
Related
I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.
Im trying to find the dinuc count and frequencies from a sequence in a text file, but my code is only outputting single nucleotide counts.
e = "ecoli.txt"
ecnt = {}
with open(e) as seq:
for line in seq:
for word in line.split():
for i in range(len(seqr)):
dinuc = (seqr[i] + seqr[i:i+2])
for dinuc in seqr:
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
for x,y in ecnt.items():
print(x, y)
Sample input: "AAATTTCGTCGTTGCCC"
Sample output:
AA:2
TT:3
TC:2
CG:2
GT:2
GC:1
CC:2
Right now, Im only getting single nucleotides for my output:
C 83550600
A 60342100
T 88192300
G 92834000
For the nucleotides that repeat i.e. "AAA", the count has to return all possible combinations of consecutive 'AA', so the output should be 2 rather than 1. It doesnt matter what order the dinucleotides are listed, I just need all combinations, and for the code to return the correct count for the repeated nucleotides. I was asking my TA and she said that my only problem was getting my 'for' loop to add the dinucleotides to my dictionary, and I think my range may or may not be wrong. The file is a really big one, so the sequence is split up into lines.
Thank you so much in advance!!!
I took a look at your code and found several things that you might want to take a look at.
For testing my solution, since I did not have ecoli.txt, I generated one of my own with random nucleotides with the following function:
import random
def write_random_sequence():
out_file = open("ecoli.txt", "w")
num_nts = 500
nts_per_line = 80
nts = []
for i in range(num_nts):
nt = random.choice(["A", "T", "C", "G"])
nts.append(nt)
lines = [nts[i:i+nts_per_line] for i in range(0, len(nts), nts_per_line)]
for line in lines:
out_file.write("".join(line) + "\n")
out_file.close()
write_random_sequence()
Notice that this file has a single sequence of 500 nucleotides separated into lines of 80 nucleotides each. In order to count dinucleotides where you have the first nucleotide at the end of one line and the second nucleotide at the start of the next line, we need to merge all of these separate lines into a single string, without spaces. Let's do that first:
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
Try printing out "seq" and notice that it should be one giant string containing all of the nucleotides. Next, we need to find the dinucleotides in the sequence string. We can do this using slicing, which I see you tried. So for each position in the string, we look at both the current nucleotide and the one after it.
for i in range(len(seq)-1):#note the -1
dinuc = seq[i:i+2]
We can then do the counting of the nucleotides and storage of them in a dictionary "ecnt" very much like you had. The final code looks like this:
ecnt = {}
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
for i in range(len(seq)-1):
dinuc = seq[i:i+2]
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
print ecnt
A perfect opportunity to use a defaultdict:
from collections import defaultdict
file_name = "ecoli.txt"
dinucleotide_counts = defaultdict(int)
sequence = ""
with open(file_name) as file:
for line in file:
sequence += line.strip()
for i in range(len(sequence) - 1):
dinucleotide_counts[sequence[i:i + 2]] += 1
for key, value in sorted(dinucleotide_counts.items()):
print(key, value)
I am just learning python and need some help for my class assignment.
I have a file with text and numbers in it. Some lines have from one to three numbers and others have no numbers at all.
I need to:
Extract numbers only from the file using regex
Find the sum of all the numbers
I used regex to extract out all the numbers. I am trying to get the total sum of all the numbers but I am just getting the sum of each line that had numbers. I have been battling with different ways to do this assignment and this is the closest I have gotten to getting it right.
I know I am missing some key parts but I am not sure what I am doing wrong.
Here is my code:
import re
text = open('text_numbers.txt')
for line in text:
line = line.strip()
y = re.findall('([0-9]+)',line)
if len(y) > 0:
print sum(map(int, y))
The result I get is something like this
(each is a sum of a line):
14151
8107
16997
18305
3866
And it needs to be one sum like this (sum of all numbers):
134058
import re
import np
text = open('text_numbers.txt')
final = []
for line in text:
line = line.strip()
y = re.findall('([0-9]+)',line)
if len(y) > 0:
lineVal = sum(map(int, y))
final.append(lineVal)
print "line sum = {0}".format(lineVal)
print "Final sum = {0}".format(np.sum(final))
Is that what you're looking for?
I dont know much python but I can give a simple solution.
Try this
import re
hand = open('text_numbers.txt')
x=list()
for line in hand:
y=re.findall('[0-9]+',line)
x=x+y
sum=0
for i in x:
sum=sum + int(i)
print sum
My first attempt to answer with the use of regular expressions, I find it a great skill to practise, that reading other's code.
import re # import regular expressions
chuck_text = open("regex_sum_286723.txt")
numbers = []
Total = 0
for line in chuck_text:
nmbrs = re.findall('[0-9]+', line)
numbers = numbers + nmbrs
for n in numbers:
Total = Total + float(n)
print "Total = ", Total
and thanx to Beer for the 'comprehension list' one liner, though his 'r' seems not needed, not sure what it does. But it reads beautifully, I get more confused reading two lots of loops like my answer
import re
print sum([int(i) for i in re.findall('[0-9]+',open("regex_sum_286723.txt").read())])
import re
text = open('text_numbers.txt')
data=text.read()
print sum(map(int,re.findall(r"\b\d+\b",data)))
Use .read to get content in string format
import re
sample = open ('text_numbers.txt')
total =0
dignum = 0
for line in sample:
line = line.rstrip()
dig= re.findall('[0-9]+', line)
if len(dig) >0:
dignum += len(dig)
linetotal= sum(map(int, dig))
total += linetotal
print 'The number of digits are: '
print dignum
print 'The sum is: '
print total
print 'The sum ends with: '
print total % 1000
import re
print sum([int(i) for i in re.findall('[0-9]+',open(raw_input('What is the file you want to analyze?\n'),'r').read())])
You can compact it into one line, but this is only for fun!
Here is my solution to this problem.
import re
file = open('text_numbers.txt')
sum = 0
for line in file:
line = line.rstrip()
line = re.findall('([0-9]+)', line)
for i in line:
i = int(i)
sum += i
print(sum)
The line elements in first for loop are the lists also and I used second for loop to convert its elements to integer from string so I can sum them.
import re
fl=open('regex_sum_7469.txt')
ls=[]
for x in fl: #create a list in the list
x=x.rstrip()
print x
t= re.findall('[0-9]+',x) #all numbers
for d in t: #for loop as there a empthy values in the list a
ls.append(int(d))
print (sum(ls))
Here is my code:
f = open('regex_sum_text.txt', 'r').read().strip()
y = re.findall('[0-9]+', f)
l = [int(s) for s in y]
s = sum(l)
print(s)
another shorter way is:
with open('regex_sum_text.txt', 'r') as f:
total = sum(map(int, re.findall(r'[0-9]+', f.read())))
print(total)
import re
print(sum(int(value) for value in re.findall('[0-9]+', open('regex_sum_1128122.txt').read())))
I have got this python program which reads through a wordlist file and checks for the suffixes ending which are given in another file using endswith() method.
the suffixes to check for is saved into the list: suffixList[]
The count is being taken using suffixCount[]
The following is my code:
fd = open(filename, 'r')
print 'Suffixes: '
x = len(suffixList)
for line in fd:
for wordp in range(0,x):
if word.endswith(suffixList[wordp]):
suffixCount[wordp] = suffixCount[wordp]+1
for output in range(0,x):
print "%-6s %10i"%(prefixList[output], prefixCount[output])
fd.close()
The output is this :
Suffixes:
able 0
ible 0
ation 0
the program is unable to reach this loop :
if word.endswith(suffixList[wordp]):
You need to strip the newline:
word = ln.rstrip().lower()
The words are coming from a file so each line ends with a newline character. You are then trying to use endswith which always fails as none of your suffixes end with a newline.
I would also change the function to return the values you want:
def store_roots(start, end):
with open("rootsPrefixesSuffixes.txt") as fs:
lst = [line.split()[0] for line in map(str.strip, fs)
if '#' not in line and line]
return lst, dict.fromkeys(lst[start:end], 0)
lst, sfx_dict = store_roots(22, 30) # List, SuffixList
Then slice from the end and see if the substring is in the dict:
with open('longWordList.txt') as fd:
print('Suffixes: ')
mx, mn = max(sfx_dict, key=len), min(sfx_dict, key=len)
for ln in map(str.rstrip, fd):
suf = ln[-mx:]
for i in range(mx-1, mn-1, -1):
if suf in sfx_dict:
sfx_dict[suf] += 1
suf = suf[-i:]
for k,v in sfx_dict:
print("Suffix = {} Count = {}".format(k,v))
Slicing the end of the string incrementally should be faster than checking every string especially if you have numerous suffixes that are the same length. At most it does mx - mn iterations, so if you had 20 four character suffixes you would only need to check the dict once, only one n length substring can be matched at a time so we would kill n length substrings at the one time with a single slice and lookup.
You could use a Counter to count the occurrences of suffix:
from collections import Counter
with open("rootsPrefixesSuffixes.txt") as fp:
List = [line.strip() for line in fp if line and '#' not in line]
suffixes = List[22:30] # ?
with open('longWordList.txt') as fp:
c = Counter(s for word in fp for s in suffixes if word.rstrip().lower().endswith(s))
print(c)
Note: add .split()[0] if there are more than one words per line you want to ignore, otherwise this is unnecessary.
I am trying to set up a system for running various statistics on a text file. In this endeavor I need to open a file in Python (v2.7.10) and read it both as lines, and as a string, for the statistical functions to work.
So far I have this:
import csv, json, re
from textstat.textstat import textstat
file = "Data/Test.txt"
data = open(file, "r")
string = data.read().replace('\n', '')
lines = 0
blanklines = 0
word_list = []
cf_dict = {}
word_dict = {}
punctuations = [",", ".", "!", "?", ";", ":"]
sentences = 0
This sets up the file and the preliminary variables. At this point, print textstat.syllable_count(string) returns a number. Further, I have:
for line in data:
lines += 1
if line.startswith('\n'):
blanklines += 1
word_list.extend(line.split())
for char in line.lower():
cf_dict[char] = cf_dict.get(char, 0) + 1
for word in word_list:
lastchar = word[-1]
if lastchar in punctuations:
word = word.rstrip(lastchar)
word = word.lower()
word_dict[word] = word_dict.get(word, 0) + 1
for key in cf_dict.keys():
if key in '.!?':
sentences += cf_dict[key]
number_words = len(word_list)
num = float(number_words)
avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num
mcw = sorted([(v, k) for k, v in word_dict.items()], reverse=True)
print( "Total lines: %d" % lines )
print( "Blank lines: %d" % blanklines )
print( "Sentences: %d" % sentences )
print( "Words: %d" % number_words )
print('-' * 30)
print( "Average word length: %0.2f" % avg_wordsize )
print( "30 most common words: %s" % mcw[:30] )
But this fails as 22 avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num returns a ZeroDivisionError: float division by zero. However, if I comment out the string = data.read().replace('\n', '') from the first piece of code, I can run the second piece without problem and get the expected output.
Basically, how do I set this up so that I can run the second piece of code on data, as well as textstat on string?
The call to data.read() places the file pointer at the end of the file, so you dont have anything more to read at this point. You either have to close and reopen the file or more simply reset the pointer at the begining using data.seek(0)
First see the line:
string = data.read().replace('\n', '')
You are reading from data once. Now, cursor is in the end of data.
Then see the line,
for line in data:
You are trying to read it again, but you just can't do it, because there is nothing else in data, you are at the end of it.so len(word_list) are returning 0.
You are dividing by it and getting the error.
ZeroDivisionError: float division by zero.
But when you comment it, now you are reading only once, which is valid, so second portion of your codes now work.
Clear now?
So, what to do now?
Use data.seek() after data.read()
Demo:
>>> a = open('file.txt')
>>> a.read()
#output
>>>a.read()
#nothing
>>> a.seek(0)
>>> a.read()
#output again
Here is a simple fix. Replace the line for line in data: by :
data.seek(0)
for line in data.readlines():
...
It basically points back to the beginning of the file and read it again line by line.
While this should work, you may want to simplify the code and read the file only once. Something like:
with open(file, "r") as fin:
lines = fin.readlines()
string = ''.join(lines).replace('\n', '')