reading data from multiple lines as a single item - python

I have a set of data from a file as such
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
I am currently matching he regex based on splice(23): with a search as follows:
re_johnny = re.compile('splice')
with open("file.txt", 'r') as file:
read = file.readlines()
for line in read:
if re_johnny.match(line):
print(line)
I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.

Input file: fin
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
Adding to tigerhawk's suggestion you can try something like this:
Code:
import re
with open('fin', 'r') as f:
for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]:
if 'splice' in l:
print(l)
Output:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00

With regex you have multiplied your problems. Instead, keep it simple:
If a line starts with ", it begins a record.
Else, append it to the previous record.
You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.

Related

How to split lines in python

I am looking for a simple way to split lines in python from a .txt file and then just read out the names and compare them to another file.
I've had a code that split the lines successfully, but I couldn't find a way to read out just the names, unfortunately the code that split it successfully was lost.
this is what the .txt file looks like.
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
example if the code I currently have (doesn't output anything)
my_file = open("HP_liki.txt","r")
flag = index = 0
x1=""
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
list=[]
lines_to_read = [index-1]
for position, line1 in enumerate(x1):
if position in lines_to_read:
list=line1
x1=list.split(";")
print(x1[1])
I need a solution that doesn't import pandas or csv.
The first part of your code confuses me as to your purpose.
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
Your for loop iterates through the file and strips each line. Then it splits on a newline, which cannot exist at this point. The for already iterates by lines, so there is no newline in any line in this loop.
In addition, once you've stripped the line, you ignore the result, increment index, and leave the loop. As a result, all this loop accomplishes is to count the lines in the file.
The line after the loop reads from a file that has no more data, so it will simply handle the EOF exception and return nothing.
If you want the names from the file, then use the built-in file read to iterate through the file, split each line, and extract the second field:
name_list = [line.split(';')[1]
for line in open("HP_liki.txt","r") ]
name_list also includes the header "Name", which you can easily delete.
Does that handle your problem?
So without using any external library you can use simple file io and then generalize according to your need.
readfile.py
file = open('datafile.txt','r')
for line in file:
line_split = line.split(';')
if (line_split[0].isdigit()):
print(line_split[1])
file.close()
datafile.txt
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
If you run this you'll have output
James
Adam
Clare
You can change the if condition according to your need
I have my dataf.txt file:
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
I have written this to extract information:
with open('dataf.txt','r') as fl:
data = fl.readlines()
a = [i.replace('\n','').split(';')[:-1] for i in data]
print(a[1:])
Outputs:
[['1', 'James', 'IT'], ['2', 'Adam', 'Director'], ['3', 'Clare', 'Assisiant']]

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Print out lines that begin with two different string outputs?

I am trying to scan an input file and print out parts of lines that begin with a certain string. The text file is 10000+ lines, but I am only concerned with the beginning line, and more specifically the data within that line. For clarification, here are two lines of code which explain what I am trying to say.
inst "N69" "IOB",placed BIOB_X11Y0 R8 ,
inst "n0975" "SLICEX",placed CLEXL_X20Y5 SLICE_X32Y5 ,
Here is the code that I have gotten to so far:
searchfile = open("C:\PATH\TO\FILE.txt","r")
for line in searchfile:
if "inst " in line:
print line
searchfile.close()
Now this is great if I am looking for all lines that start with "inst", but I am specifically looking for lines that start with "inst "N"" or "inst "n"". From there, I wanted to extract just the string starting with N or n.
My idea was to first extract those lines (as shown above) to a new .txt file, then run another script to get only the portions of the lines that have N or n. In the example above, I am only concerned with N69 and n0975. Is there an easier method of doing this?
Yes with the re module.
re.finditer(r'^inst\s+\"n(\d+)\"', the_whole_file, re.I)
Will return you an iterator of all the matches.
For each match you will need to do .group(1) to get those numbers you wanted.
Notice that you don't need to filter the file first using this method. You can do this for the whole file.
The output in your case will be:
69
0975
With re.search() function:
Sample file.txt content:
inst "N69" "IOB",placed BIOB_X11Y0 R8 ,
some text
inst "n0975" "SLICEX",placed CLEXL_X20Y5 SLICE_X32Y5 ,
text
another text
import re
with open('file.txt', 'r') as f:
for l in f.read().splitlines():
m = re.search(r'^inst "([Nn][^"]+)"', l)
if m:
print(m.group(1))
The output:
N69
n0975
Here is one solution:
with open('nfile.txt','r') as f:
for line in f:
if line.startswith('inst "n') or line.startswith('inst "N'):
print line.split()[1]
For each line in the file startswith part checks if the line starts with one of your target patters. If yes, it splits the line using split and prints the second component which is the part with n or N.

Python, Extracting 3 lines before and after a match

I am trying to figure out how to extract 3 lines before and after a matched word.
At the moment, my word is found. I wrote up some text to test my code. And, I figured out how to print three lines after my match.
But, I am having difficulty trying to figure out how to print three lines before the word, "secure".
Here is what I have so far:
from itertools import islice
with open("testdoc.txt", "r") as f:
for line in f:
if "secure" in line:
print("".join(line))
print ("".join(islice(f,3)))
Here is the text I created for testing:
----------------------------
This is a test to see
if i can extract information
using this code
I hope, I try,
maybe secure shell will save thee
Im adding extra lines to see my output
hoping that it comes out correctly
boy im tired, sleep is nice
until then, time will suffice
You need to buffer your lines so you can recall them. The simplest way is to just load all the lines into a list:
with open("testdoc.txt", "r") as f:
lines = f.readlines() # read all lines into a list
for index, line in enumerate(lines): # enumerate the list and loop through it
if "secure" in line: # check if the current line has your substring
print(line.rstrip()) # print the current line (stripped off whitespace)
print("".join(lines[max(0,index-3):index])) # print three lines preceeding it
But if you need maximum storage efficiency you can use a buffer to store the last 3 lines as you loop over the file line by line. A collections.deque is ideal for that.
i came up with this solution, just adding the previous lines in a list, and deleting the first one after 4 elements
from itertools import islice
with open("testdoc.txt", "r") as f:
linesBefore = list()
for line in f:
linesBefore.append(line.rstrip())
if len(linesBefore) > 4: #Adding up to 4 lines
linesBefore.pop(0)
if "secure" in line:
if len(linesBefore) == 4: # if there are at least 3 lines before the match
for i in range(3):
print(linesBefore[i])
else: #if there are less than 3 lines before the match
print(''.join(linesBefore))
print("".join(line.rstrip()))
print ("".join(islice(f,3)))

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Categories

Resources