I have a txt file with thousands of lines as strings.
Each line start in the format of '#integer' so for example '#100'.
I read the txt file sequentially (line #1, #2, #3..) and get a specific array that I want, where the array is a collection of the line numbers and other lines connected to those lines:
The array is in the form of:
[ ['#355', '#354', '#357', '#356'], ['#10043', '#10047', '#10045'], ['#1221', '#1220', '#1223', '#1222', '#1224'], [...] ]
It can contain hundreds of numbers.
(this is because I have an array of numbers and further 'children' that are associated with them added to each sub-array.)
I have read my txt file before the below function, meaning that first I read my txt file, extract the numbers, and then pass that as an array to the extended_Strings function, which replaces each number with the actual string for that number line from the txt file.
def extended_strings(matrix,base_txt):
string_matrix = matrix #new matrix to contain our future strings
for numset in string_matrix:
for num in numset:
for line in base_txt:
results = re.findall(r'^#\d+', line) #find the line # at start of string
if len(results) > 0 and results[0] == num: #if we have a # line that matches our # in the numset
index = numset.index(num) #find index of line # in the numset
numset[index] = line #if we match line #'s, we replace the line # with the actual string from the txt
return string_matrix
I am trying to make this process shorter and more efficient, for example I have 150,000 strings in the txt, there are millions of times where the txt file is scanned with the line for line in base_txt.
Any suggestions?
I didn't do any metering. But I'm confident that this could help.
On the other hand, there is still room for lots of improvements.
text.txt:
#1 This is line #00001
#2 This is line #00002
#30 This is line #00030
#35 This is line #00035
#77 This is line #00077
#101 This is line #00101
#145 This is line #00145
#1010 This is line #01010
#8888 This is line #08888
#13331 This is line #13331
#65422 This is line #65422
code:
import re
# reo = re.compile(r'^(#\d+)\s+(.*)\n$') # exclude line numbers in "string_matrix"
reo = re.compile(r'^((#\d+)\s+.*)\n$') # include line numbers in "string_matrix"
def file_to_dict(file_name):
file_dict = {}
with open(file_name) as f:
for line in f:
mo = reo.fullmatch(line)
# file_dict[mo.group(1)] = mo.group(2) # exclude line numbers in "string_matrix"
file_dict[mo.group(2)] = mo.group(1) # include line numbers in "string_matrix"
return file_dict
def extended_strings(matrix, file_dict):
string_matrix = []
for numset in matrix:
new_numset = []
for num in numset:
new_numset.append(file_dict[num])
string_matrix.append(new_numset)
return string_matrix
matrix = [['#1010', '#35', '#2', '#145', '#8888'], ['#30', '#2'], ['#65422', '#1', '#13331', '#77', '#101', '#8888']]
file_dict = file_to_dict('text.txt')
string_matrix = extended_strings(matrix, file_dict)
for list_ in string_matrix:
for line in list_:
print(line)
print()
Thanks for the help Werner Wenzel,
I've found the solution that works for me and would like to share it here:
import re
def file_to_dict(file_name):
file_dict = {}
with open(file_name) as f:
for line in f:
stg = re.findall("(.+)",line)
stgNum = re.findall("#\d{1,10}",line)
file_dict[stgNum[0]] = stg[0]
return file_dict
def extended_strings(matrix, file_dict):
string_matrix = []
for numset in matrix:
new_numset = []
for num in numset:
new_numset.append(file_dict[num])
string_matrix.append(new_numset)
return string_matrix
matrix = [['#1010', '#35', '#2', '#145', '#8888'], ['#30', '#2'], ['#65422', '#1', '#13331', '#77', '#101', '#8888']]
file_dict = file_to_dict('text.txt')
string_matrix = extended_strings(matrix, file_dict)
for list_ in string_matrix:
for line in list_:
print line
print "done"
Related
I have a list/array called position[] and it is used to store the position line number of the string by using find(). The string is the abc.py which is a python file(I will include the abc.py below).
Question: how do find the string position of all def.
Objective: capture the line number of the first def and the second def and third etc and stored it in position[]
abc.py:
#!C:\Python\Python39\python.exe
# print ('Content-type: text/html\n\n')
def hello_step0():
create()
login()
def hello2_step1():
delete()
def hello2_step2():
delete()
What i did to find the first occurrence of def
position = []
with open("abc.py","r") as file:
line = file.readlines()
line = str(line)
print(line.strip().find("def")) # output 102
position.append(line.strip().find("def"))
Try str.startswith:
position = []
with open("abc.py") as f:
for index, line in enumerate(f.readlines()):
if line.strip().startswith('def'):
position.append(index)
print(position) #[3, 7, 10] --> line starts with 0
# 0 --> #!C:\Python\Python39\python.exe
# 1---> # print ('Content-type: text/html\n\n')
# 2-->
# 3-->def hello_step0():
# create()
To start with 1, just replace:
for index, line in enumerate(f.readlines(), 1):
The Below code will give append line numbers of lines that starts with def to position list.
position = []
with open('abc.py', 'r') as f:
for i, val in enumerate(f.readlines()):
if val.startswith('def'):
position.append(i)
Output:
position = [3,7,10]
I have data, that is set up as the following:
//Name_1 * *
>a xyzxyzyxyzyxzzxy
>b xyxyxyzxyyxzyxyz
>c xyzyxzyxyzyxyzxy
//Name_2
>a xyzxyzyxyzxzyxyx
>b zxyzxyzxyyzxyxzx
>c zxyzxyzxyxyzyzxy
//Name_3 * *
>a xyzxyzyxyzxzyxyz
>b zxyzxyzxzyyzyxyx
>c zxyzxyzxyxyzyzxy
...
The //-line refers to an ID for the following group of sequences until the next //-line is reached.
I have been working on writing a program, that reads the position of the asterix, and print the characters on the given position for the sequences.
To simplifiy things for myself, I have been working on a subset of my data, containing only one group of sequences, so e.g.:
//Name_1 * *
>a xyzxyzyxyzyxzzxy
>b xyxyxyzxyyxzyxyz
>c xyzyxzyxyzyxyzxy
My program does what I want on this subset.
import sys
import csv
datafile = open(sys.argv[1], 'r')
outfile = open(sys.argv[1]+"_FGT_Data", 'w')
csv_out = csv.writer(outfile, delimiter=',')
csv_out.writerow(['Locus', 'Individual', 'Nucleotide', 'Position'])
with (datafile) as searchfile:
var_line = [line for line in searchfile if '*' in line]
LocusID = [line[2:13].strip() for line in var_line]
poslist = [i for line in var_line for i, x in enumerate(line) if x =='*']
datafile = open(sys.argv[1], 'r')
with (datafile) as getsnps:
lines = [line for line in getsnps.readlines() if line.startswith('>')]
for pos in poslist:
for line in lines:
snp = line[pos]
individual = line[0:7]
indistr = individual.strip()
csv_out.writerow((LocusID[0], indistr, line[pos], str(pos)))
datafile.close()
outfile.close()
However, now I am trying to modify it to work on the full dataset. I am having trouble finding a way to iterate over the data in the correct way.
I need to search through the file, and when a line containing '' is reached, I need to do as in the above code for the sequences corresponding to the given line, and then continue to the next line containing an ''. Do I need to split up my data with regards to the //-lines or what is the best approach?
I have uploaded a sample of my data to dropbox:
Data_Sample.txt contains several groups, and is the kind of data, I am trying to get the program to work on.
Data_One_Group.txt contains only one group, and is the data I have gotten the program to work on so far.
https://www.dropbox.com/sh/3j4i04s2rg6b63h/AADkWG3OcsutTiSsyTl8L2Vda?dl=0
--------EDIT---------
I am trying to implement the suggestion by #Julien Spronck below.
However, I am having trouble processing the produced block. How would I be able to search through the block line for line. E.g., why does the below not work as intended? It just prints the asterix' and not the line itself.
block =''
with open('onelocus.txt', 'r') as searchfile:
for line in searchfile:
if line.startswith('//'):
#print line
if block:
for line in block:
if '*' in line:
print line
block = line
else:
block += line
---------EDIT 2----------
I am getting closer. I understand that fact, that I need to split the string into line, to be able to search through them. The below works on one group, but when I try to itereate over several, it prints the information for the first group only. But does it for as many groups, as there are. I have tried clearing LocusID and poslist before next iteration, but this does not seem to be the solution.
block =''
with (datafile) as searchfile:
for line in searchfile:
if line.startswith('//'):
if block:
var_line = [line for line in block.splitlines() if '*' in line]
LocusID = [line[2:13].strip() for line in var_line]
print LocusID
poslist = [i for line in var_line for i, x in enumerate(line) if x == '*']
print poslist
block = line
else:
block += line
Can't you do something like:
block =''
with open(filename, 'r') as fil:
for line in fil:
if line.startswith('//'):
if block:
do_something_with(block)
block = line
else:
block += line
if block:
do_something_with(block)
In this code, I just append the lines of the file to a variable block. Once I find a line that starts with //, I process the previous block and reinitialize the block for the next iteration.
The last two lines will take care of processing the last block, which would not be processed otherwise.
do_something_with(block) could be something like this:
def do_something_with(block):
lines = block.splitlines()
j = 0
first_line = lines[j]
while first_line.strip() == '':
j += 1
first_line = lines[j]
pos = []
position = first_line.find('*')
while position != -1:
pos.append(position)
position = first_line.find('*', position+1)
for k, line in enumerate(lines):
if k > j:
for p in pos:
print line[p],
print
## prints
## z y
## x z
## z y
I have created a way to make this work with the data you provided.
You should run it with 2 file locations, 1 should be your input.txt and 2 should be your output.csv
explanation
first we create a dictionary with the locus as key and the sequences as values.
We iterate over this dictionary and get the * locations in the locus and append these to a list indexes.
We iterate over the values belonging to this key and extract the sequence
per iteration we iterate over indexes so that we gather the snps.
per iteration we append to our csv file.
We empty the indexes list so we can go to the next key.
Keep in mind
This method is highly dependant on the amount of spaces you have inside your input.txt.
You should know that this will not be the fastest way to get it done. but it does get it done.
I hope this helped, if you have any questions, feel free to ask them, and if I have time, I will happily try to answer them.
script
import sys
import csv
sequences = []
dic = {}
indexes = []
datafile = sys.argv[1]
outfile = sys.argv[2]
with open(datafile,'r') as snp_file:
lines = snp_file.readlines()
for i in range(0,len(lines)):
if lines[i].startswith("//"):
dic[lines[i].rstrip()] = sequences
del sequences[:]
if lines[i].startswith(">"):
sequences.append(lines[i].rstrip())
for key in dic:
locus = key.split(" ")[0].replace("//","")
for i, x in enumerate(key):
if x == '*':
indexes.append(i-11)
for sequence in dic[key]:
seq = sequence.split(" ")[1]
seq_id = sequence.split(" ")[0].replace(">","")
for z in indexes:
position = z+1
nucleotide = seq[z]
with open(outfile,'a')as handle:
csv_out = csv.writer(handle, delimiter=',')
csv_out.writerow([locus,seq_id,position,nucleotide])
del indexes[:]
input.txt
//Locus_1 * *
>Safr01 AATCCGTTTTAAACCAGNTCYAT
>Safr02 TTAATCCGTTTTAAACCAGNTCY
//Locus_2 * *
>Safr01 AATCCGTTTTAAACCAGNTCYAT
>Safr02 TTAATCCGTTTTAAACCAGNTCY
output.csv
Locus_1,Safr01,1,A
Locus_1,Safr01,22,A
Locus_1,Safr02,1,T
Locus_1,Safr02,22,C
Locus_2,Safr01,5,C
Locus_2,Safr01,19,T
Locus_2,Safr02,5,T
Locus_2,Safr02,19,G
This is how I ended up solving the problem:
def do_something_with(block):
lines = block.splitlines()
for line in lines:
if '*' in line:
hit = line
LocusID = hit[2:13].strip()
for i, x in enumerate(hit):
if x=='*':
poslist.append(i)
for pos in poslist:
for line in lines:
if line.startswith('>'):
individual = line[0:7].strip()
snp = line[pos]
print LocusID, individual, snp, pos,
csv_out.writerow((LocusID, individual, snp, pos))
with (datafile) as searchfile:
for line in searchfile:
if line.startswith('//'):
if block:
do_something_with(block)
poslist = list()
block = line
else:
block += line
if block:
do_something_with(block)
I want to pull all data from a text file from a specified line number until the end of a file. This is how I've tried:
def extract_values(f):
line_offset = []
offset = 0
last_line_of_heading = False
if not last_line_of_heading:
for line in f:
line_offset.append(offset)
offset += len(line)
if whatever_condition:
last_line_of_heading = True
f.seek(0)
# non-functioning pseudocode follows
data = f[offset:] # read from current offset to end of file into this variable
There is actually a blank line between the header and the data I want, so ideally I could skip this also.
Do you know the line number in advance? If so,
def extract_values(f):
line_number = # something
data = f.readlines()[line_number:]
If not, and you need to determine the line number based on the content of the file itself,
def extract_values(f):
lines = f.readlines()
for line_number, line in enumerate(lines):
if some_condition(line):
data = lines[line_number:]
break
This will not be ideal if your files are enormous (since the lines of the file are loaded into memory); in that case, you might want to do it in two passes, only storing the file data on the second pass.
Your if clause is at the wrong position:
for line in f:
if not last_line_of_heading:
Consider this code:
def extract_values(f):
rows = []
last_line_of_heading = False
for line in f:
if last_line_of_heading:
rows.append(line)
elif whatever_condition:
last_line_of_heading = True
# if you want a string instead of an array of lines:
data = "\n".join(rows)
you can use enumerate:
f=open('your_file')
for i,x in enumerate(f):
if i >= your_line:
#do your stuff
here i will store line number starting from 0 and x will contain the line
using list comprehension
[ x for i,x in enumerate(f) if i >= your_line ]
will give you list of lines after specified line
using dictionary comprehension
{ i:x for i,x in enumerate(f) if i >= your_line }
this will give you line number as key and line as value, from specified line number.
Try this small python program, LastLines.py
import sys
def main():
firstLine = int(sys.argv[1])
lines = sys.stdin.read().splitlines()[firstLine:]
for curLine in lines:
print curLine
if __name__ == "__main__":
main()
Example input, test1.txt:
a
b
c
d
Example usage:
python LastLines.py 2 < test1.txt
Example output:
c
d
This program assumes that the first line in a file is the 0th line.
I'm trying to take information from a file and turn it into a 2D list, my text file has this in it:
000001,375.99
000002,212.89
000003,175.12
000002,543.23
000003,1000.01
000001,10.0
000002,23.56
000003,5.65
000009,2.79
000009,1.79
000009,0.79
000008,3.79
000008,10.0
000008,11.1
My code can read the file but I get an error:
ValueError: could not convert string to float: '000001,375.99'
How do I exclude the commas when the code is reading it?
This is my code:
def loadExpensesData():
exp = open('expense.dat','r')
data = []
for line in exp:
num_strings = line.split()
num = [float(n) for n in num_strings]
data.append(numbers)
exp.close()
print(data)
loadExpensesData()
Do something like this:-
change your lines
num_strings = line.split()
num = [float(n) for n in num_strings]
to:-
num = [float(n) for n in line.split(',')]
Full code:-
def loadExpensesData():
exp = open('new.txt','r')
data = []
for line in exp:
data.append(map(float, line.split(',')))
exp.close()
print(data)
loadExpensesData()
def line_count(filename):
for filename in os.walk(os.path.abspath('my directory filename')):
lines = 0
with open(filename) as file:
lines = len([line for line in file.readlines() if line.strip() != ''])
print lines
def find_big_files(files):
file_sizes = [(line_count(file), file) for file in files]
print sorted(file_sizes, key = lambda file_size: file_size[0], reverse = True)
sorted_files = find_big_files(file)
does not work.
Since you're looking for the LONGEST files, not the BIGGEST files, do this:
def get_length(file):
len_ = 0
with open(file,'r') as f:
for line in f: len_+=1
return len_
files = [file for file in however_you_build_your_list]
files = sorted(files, key=get_length)
# files[0] is now the longest
# files[-1] is now the shortest
Are you counting empty lines as lines?
if so, the following gives you the number of raw newlines in a file:
def line_count(filename):
lines = 0
with open(filename) as file:
lines = len(file.readlines())
return lines
If not, change the lines = ... to:
lines = len([line for line in file.readlines() if line.strip() != ''])
So, the rest of the code would look like the following:
def find_big_files(files):
largest = (0, None)
second_largest = (0, None)
for file in files:
size = line_count(file)
if size > largest[0]:
second_largest = largest
largest = (size, file)
return largest, second_largest
Note that this is really inefficient because it has to open every file and iterate across it. So it is O(file*count(file)). But if you really care about line count, not really any good way around that, at least for generic .txt files or similar.
If you want the whole list from most lines to least lines:
def find_big_files(files):
file_sizes = [(line_count(file), file) for file in files]
return sorted(file_sizes, key = lambda file_size: file_size[0])
A list of (line_count, file_name) tuples will be returned, and list[-1] will be the largest, list[-2] will be the second largest, and so on.
EDIT:
OP asked me to post the whole code in one block that solves the problem, so here it is:
def line_count(filename):
lines = 0
with open(filename) as file:
lines = len([line for line in file.readlines() if line.strip() != ''])
return lines
def find_big_files(files):
file_sizes = [(line_count(file), file) for file in files]
return sorted(file_sizes, key = lambda file_size: file_size[0], reverse = True)
The return from result = file_big_files(files) will be [(count, filename), ...] from largest to smallest, so result[0] will be the largest, result[1] will be the second largest, etc. Ties will be in the original order they were in the input list of file paths.