How to create an adaptable counter python - python

So I am trying to count the number of times a certain character occurs from a score (.txt) file. The file says something like [$$$$$#$$#$$$]
Im trying to create a counter that counts the number of times $ occurs but resets every time a # happens.
This is all I have come up with so far but doesn't account for the restart.
with open ("{}".format(score), 'r') as f:
scoreLines = f.read().splitlines()
y = str(scoreLines)
$_count = int(y.count('$'))
The count is reflected in another part of the program which is outputting a wave. So every time a # occurs the wave needs to stop and start again. Any help would be appreciated!

You can simply use split
for sequence in y.split('#'):
wave_outputting_function(.., len(sequence), ...)
(assuming y contains your entire file as a string without linebreaks)

Try this using iterators:
#!/usr/bin/env python3
import sys
def handle_count(count):
print("count", count)
def main():
with open("splitter.txt", "rt") as fhandle:
for line in (d.strip() for d in fhandle):
for d in (i.count('$') for i in line.split('#')):
handle_count(d)
if __name__ == '__main__':
main()

Related

How do I place specific characters in a specific place within a file?

I want to be able to add a specific character into my file, using python code.
I have attempted using read functions, meaning lists, but those come up with an error of "TypeError: 'builtin_function_or_method"
I believe this means that python can not write a character into a specific place using the list function
Incorrect way:
file: 1)5
Code:
while true:
with open ('file','w') as f:
f.writeline[0]=+1
with open ('file','r') as f:
fc = f.read()
print (fc)
Expected output:
5,6,7,8,9,10,11,12,13,14,15....
I assumed that this line of code would increase the five until I stopped the program, but instead it sent the error code described earlier. Is there a way to write the code so that it does it as expected?
Basically, you would need to build a function to update a single character. I think this would work, but I literally wrote it in like three minutes, so take caution...
def write_at_index(filename,y_pos,x_pos,character):
"""Write a character 'character' at the given index"""
lines = 0 //begin lines outside scope of the with statement.
with open(filename,"r") as file:
lines = file.readlines()
if len(lines)<y_pos:
raise Exception('y pos out of bounds')
if len(lines[y_pos]) < x_pos
raise Exception('x_pos out of bounds')
lines[y_pos][x_pos] = character
with open(filename,"w") as file:
file.writelines(lines)
The first, your code will have an infinite loop:
while True: Do you have any check variable?
The second, I don't think this one can work for you: f.writeline[0]=+1
I'm not sure that my recommend code can help you fix your issue, but if it doesn't match your idea, please comment it.
check = True
# file = '1)5'
add_val = 5
while check:
open('file', 'w').write(add_val + ",")
add_val += 1
if add_val > 20: # condition to break the while loop
check = False
f = open('file','r').read()
print (f)

reading a text file in python and printing occurrence of a phrase

I'm struggling to get my code to work. I've been supplied with some variables to work with: The variable P is my file path, designated as content/textfiles/empty.txt; while the variable S is my string, designated as parrot. The end goal is to find out how many times Parrot (S) appears in my text file (P). The following is the variables and information I have been supplied with, and immediately following is my crude code that is attempting to complete my task:
import sys
P= sys.argv[1]
S= sys.argv[2]
file = open(P,'r')
data = file.read()
count = 0
if S in P:
count += 1
print(count)
file.close()
The primary issue is that I am supposed to have a return of 3 for output, but my code is only counting 1 occurrence, but I have no idea why.
[Answer previously posted on question]:
file = open(P,'r')
data = file.read()
count = data.count(S)
print(count)

How to use multiprocessing module to iterate a list and match it with a key in dictionary?

I have a list named master_lst created from a CSV file using the following code
infile= open(sys.argv[1], "r")
lines = infile.readlines()[1:]
master_lst = ["read"]
for line in lines:
line= line.strip().split(',')
fourth_field = line [3]
master_lst.append(fourth_field)
This master list has the unique set of sequences. Now I have to loop 30 collapsed FASTA files to count the number of occurrences of each of these sequences in the master list. The file format of the 30 files is as follow:
>AAAAAAAAAAAAAAA
7451
>AAAAAAAAAAAAAAAA
4133
>AAAAAAAAAAAAAAAAA
2783
For counting the number of occurrences, I looped through each of the 30 file and created a dictionary with sequences as key and number of occurrences as values. Then I iterated each element of the master_lst and matched it with the key in the dictionary created from the previous step. If there is a match, I appended the value of the key to a new list (ind_lst). If not I appended 0 to the ind_lst. The code for that is as follow:
for file in files:
ind_lst = []
if file.endswith('.fa'):
first = file.split(".")
first_field = first [0]
ind_lst.append(first_field)
fasta= open(file)
individual_dict= {}
for line in fasta:
line= line.strip()
if line == '':
continue
if line.startswith('>'):
header = line.lstrip('>')
individual_dict[header]= ''
else:
individual_dict[header] += line
for key in master_lst[1:]:
a = 0
if key in individual_dict.keys():
a = individual_dict[key]
else:
a = 0
ind_lst.append(a)
then I write the master_lst to a CSV file and ind_lst using the code explained here: How to append a new list to an existing CSV file?
The final output should look like this:
Read file1 file2 so on until file 30
AAAAAAAAAAAAAAA 7451 4456
AAAAAAAAAAAAAAAA 4133 3624
AAAAAAAAAAAAAAAAA 2783 7012
This codes work perfectly fine when I use a smaller master_lst. But when the size of the master_lst increases then the execution time increases too much. The master_lst I am working with right now has 35,718,501 sequences(elements). When I subset 50 sequences and run the code, the script takes 2 hours to execute. So for 35,718,501 sequences it will take forever to complete.
Now I don't know how to speed up the script. I am not quite sure if there could be some improvements that can be made to this script to make it execute in a shorter time. I am running my script on a Linux server which has 16 CPU cores. When I use the command top, I could see that the script uses only one CPU. But I am not a expert in python and I don't know how to make it run on all available CPU cores using multiprocessing module. I checked this webpage: Learning Python's Multiprocessing Module.
But, I wasn't quite sure what should come under def and if __name__ == '__main__':. I am also not quite sure what arguments should I pass to the function. I was getting an error when I try the first code from Douglas, without passing any arguments as follow:
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
I have been working this for the last few days and I haven't been successful in generating my desired output. If anyone can suggest an alternative code that could run fast or if anyone could suggest how to run this code on multiple CPUs, that would be awesome. Any help to resolve this issue would be much appreciated.
Here's a multiprocessing version. It uses a slightly different approach than you do in your code which does away with the need for creating the ind_lst.
The essence of the difference is that it first produces a transpose of the desired data, and then transpose that into the desired result.
In other words, instead of creating this directly:
Read,file1,file2
AAAAAAAAAAAAAAA,7451,4456
AAAAAAAAAAAAAAAA,4133,3624
AAAAAAAAAAAAAAAAA,2783,7012
It first produces:
Read,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAA
file1,7451,4133,2783
file2,4456,3624,7012
...and then transposes that with the built-in zip() function to obtain the desired format.
Besides not needing to create the ind_lst, it also allows the creation of one row of data per file rather than one column of it (which is easier and can be done more efficiently with less effort).
Here's the code:
from __future__ import print_function
import csv
from functools import partial
from glob import glob
from itertools import izip # Python 2
import operator
import os
from multiprocessing import cpu_count, Pool, Queue
import sys
def get_master_list(filename):
with open(filename, "rb") as csvfile:
reader = csv.reader(csvfile)
next(reader) # ignore first row
sequence_getter = operator.itemgetter(3) # retrieves fourth column of each row
return map(sequence_getter, reader)
def process_fa_file(master_list, filename):
fa_dict = {}
with open(filename) as fa_file:
for line in fa_file:
if line and line[0] != '>':
fa_dict[sequence] = int(line)
elif line:
sequence = line[1:-1]
get = fa_dict.get # local var to expedite access
basename = os.path.basename(os.path.splitext(filename)[0])
return [basename] + [get(key, 0) for key in master_list]
def process_fa_files(master_list, filenames):
pool = Pool(processes=4) # "processes" is the number of worker processes to
# use. If processes is None then the number returned
# by cpu_count() is used.
# Only one argument can be passed to the target function using Pool.map(),
# so create a partial to pass first argument, which doesn't vary.
results = pool.map(partial(process_fa_file, master_list), filenames)
header_row = ['Read'] + master_list
return [header_row] + results
if __name__ == '__main__':
master_list = get_master_list('master_list.csv')
fa_files_dir = '.' # current directory
filenames = glob(os.path.join(fa_files_dir, '*.fa'))
data = process_fa_files(master_list, filenames)
rows = zip(*data) # transpose
with open('output.csv', 'wb') as outfile:
writer = csv.writer(outfile)
writer.writerows(rows)
# show data written to file
for row in rows:
print(','.join(map(str, row)))

Getting a random word from a text file

I am trying to return a word from a text file (so that I can eventually make a game from the word) but right now I get the error
IndexError: string index out of range
this is what my text file looks like
yellow
awesome
barking
happy
dancing
laughing
and this is the code I currently have
import random
def generate_the_word(infile):
for line in infile.readlines():
random_line = random.randrange(line[0], line[len(line) + 1])
print(random_line)
def main():
infile = open("words.txt","r")
generate_the_word(infile)
infile.close
main()
Do I have the wrong idea about indexing?
import random
def generate_the_word(infile):
random_line = random.choice(open(infile).read().split('\n'))
return random_line
def main():
infile = "words.txt"
print(generate_the_word(infile))
main()
Your for loop is iterating over every line in the file and indexing into that line. You should also take advantage of Python's context managers, which take care of opening and closing files for you. What you want is to load all the lines:
with open(infile) as f:
contents_of_file = f.read()
Your second problem is that you aren't properly indexing into those lines with randrange. You want a range between 0 and the max number of lines:
lines = contents_of_file.splitlines()
line_number = random.randrange(0, len(lines))
return lines[line_number]
You also need to import the random module before this will run.
Your whole program would look like:
import random
def generate_the_word(infile):
with open(infile) as f:
contents_of_file = f.read()
lines = contents_of_file.splitlines()
line_number = random.randrange(0, len(lines))
return lines[line_number]
def main():
print(generate_the_word("Filename.txt"))
main()
You should also note that reading the file every time is inefficient; perhaps read it once and then pick lines from that. You could, for instance, read it in the main function and pass its already-read values to the generate_the_word function.
When you use readlines(), you get a list of lines. The random module has a handy method for picking a random element from such iterables which eliminates the need for "manually" dealing with indexing: random.choice(iterable).
All you need is this (no for loop necessary):
def generate_the_word(infile):
return random.choice(infile.readlines())
def main():
infile = open("words.txt","r")
generate_the_word(infile)
infile.close
main()
To avoid the costly operation of reading the file every time you want a single random word, you could also read the file in main and pass the list to generate_the_word instead:
import random
def generate_the_word(word_list):
return random.choice(word_list)
def main():
infile = open("words.txt","r")
lines = infile.readlines()
infile.close()
print generate_the_word(lines)
main()

Python program to compare two files for showing the difference

I have the following code to compare two files. I would like this program run if I point them to files which are as big as 4 or 5 MB. When I do that, the prompt cursor in python console just blinks, and no output is shown. Once, I ran it for the whole night and the next morning it was still blinking. What can I change in this code?
import difflib
file1 = open('/home/michel/Documents/first.csv', 'r')
file2 = open('/home/michel/Documents/second.csv', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(diff)
print delta
If you use linux based system, you can call external command diff and you can use result of it. I try it for two file 14M and 9.3M with diff command. It takes 1.3 second.
real 0m1.295s
user 0m0.056s
sys 0m0.192s
When I have tried to use difflib in your way I had the same issue, because for big files difflib buffer the whole file in the memory and then compare them. As the solution you can compare two file partially. Here I am doing it for each 100 line.
import difflib
file1 = open('1.csv', 'r')
file2 = open('2.csv', 'r')
lines_file1 = []
lines_file2 = []
# i: number of line
# line: content of line
for i, line in enumerate(zip(file1, file2)):
# check if it is in line 100
if not (i % 100 == 0):
lines_file1.append(line[0])
lines_file2.append(line[1])
else:
# show the different for 100 line
diff = difflib.ndiff("".join(lines_file1), "".join(lines_file2))
print ''.join(list(diff))
lines_file1 = []
lines_file2 = []
# show the different if any lines left
diff = difflib.ndiff("".join(lines_file1), "".join(lines_file2))
print ''.join(list(diff))
file1.close()
file2.close()
Hope it helps.

Categories

Resources