How to split lines in python

How to split lines in python - python

I am looking for a simple way to split lines in python from a .txt file and then just read out the names and compare them to another file.
I've had a code that split the lines successfully, but I couldn't find a way to read out just the names, unfortunately the code that split it successfully was lost.
this is what the .txt file looks like.
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
example if the code I currently have (doesn't output anything)
my_file = open("HP_liki.txt","r")
flag = index = 0
x1=""
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
list=[]
lines_to_read = [index-1]
for position, line1 in enumerate(x1):
if position in lines_to_read:
list=line1
x1=list.split(";")
print(x1[1])
I need a solution that doesn't import pandas or csv.

The first part of your code confuses me as to your purpose.
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
Your for loop iterates through the file and strips each line. Then it splits on a newline, which cannot exist at this point. The for already iterates by lines, so there is no newline in any line in this loop.
In addition, once you've stripped the line, you ignore the result, increment index, and leave the loop. As a result, all this loop accomplishes is to count the lines in the file.
The line after the loop reads from a file that has no more data, so it will simply handle the EOF exception and return nothing.
If you want the names from the file, then use the built-in file read to iterate through the file, split each line, and extract the second field:
name_list = [line.split(';')[1]
for line in open("HP_liki.txt","r") ]
name_list also includes the header "Name", which you can easily delete.
Does that handle your problem?

So without using any external library you can use simple file io and then generalize according to your need.
readfile.py
file = open('datafile.txt','r')
for line in file:
line_split = line.split(';')
if (line_split[0].isdigit()):
print(line_split[1])
file.close()
datafile.txt
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
If you run this you'll have output
James
Adam
Clare
You can change the if condition according to your need

I have my dataf.txt file:
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
I have written this to extract information:
with open('dataf.txt','r') as fl:
data = fl.readlines()
a = [i.replace('\n','').split(';')[:-1] for i in data]
print(a[1:])
Outputs:
[['1', 'James', 'IT'], ['2', 'Adam', 'Director'], ['3', 'Clare', 'Assisiant']]

Related

Using Regex to search a plaintext file line by line and cherry pick lines based on matches

I'm trying to read a plaintext file line by line, cherry pick lines that begin with a pattern of any six digits. Pass those to a list and then write that list row by row to a .csv file.
Here's an example of a line I'm trying to match in the file:
**000003** ANW2248_08_DESOLATE-WASTELAND-3. A9 C 00:55:25:17 00:55:47:12 10:00:00:00 10:00:21:20
And here is a link to two images, one showing the above line in context of the rest of the file and the expected result: https://imgur.com/a/XHjt9e1
import csv
identifier = re.compile(r'^(\d\d\d\d\d\d)')
matched_line = []
with open('file.edl', 'r') as file:
reader = csv.reader(file)
for line in reader:
line = str(line)
if identifier.search(line) == True:
matched_line.append(line)
else: continue
with open('file.csv', 'w') as outputEDL:
print('Copying EDL contents into .csv file for reformatting...')
outputEDL.write(str(matched_line))
Expected result would be the reader gets to a line, searches using the regex, then if the result of the search finds the series of 6 numbers at the beginning, it appends that entire line to the matched_line list.
What I'm actually getting is, once I write what reader has read to a .csv file, it has only picked out [], so the regex search obviously isn't functioning properly in the way I've written this code. Any tips on how to better form it to achieve what I'm trying to do would be greatly appreciated.
Thank you.

Some more examples of expected input/output would better help with solving this problem but from what I can see you are trying to write each line within a text file that contains a timestamp to a csv. In that case here is some psuedo code that might help you solve your problem as well as a separate regex match function to make your code more readable
import re
def match_time(line):
pattern = re.compile(r'(?:\d+[:]\d+[:]\d+[:]\d+)+')
result = pattern.findall(line)
return " ".join(result)
This will return a string of the entire timecode if a match is found
lines = []
with open('yourfile', 'r') as txtfile:
with open('yourfile', 'w') as csvfile:
for line in txtfile:
res = match_line(line)
#alternatively you can test if res in line which might be better
if res != "":
lines.append(line)
for item in lines:
csvfile.write(line)
Opens a text file for reading, if the line contains a timecode, appends the line to a list, then iterates that list and writes the line to the csv.

reading data from multiple lines as a single item

I have a set of data from a file as such
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
I am currently matching he regex based on splice(23): with a search as follows:
re_johnny = re.compile('splice')
with open("file.txt", 'r') as file:
read = file.readlines()
for line in read:
if re_johnny.match(line):
print(line)
I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.

Input file: fin
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
Adding to tigerhawk's suggestion you can try something like this:
Code:
import re
with open('fin', 'r') as f:
for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]:
if 'splice' in l:
print(l)
Output:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00

With regex you have multiplied your problems. Instead, keep it simple:
If a line starts with ", it begins a record.
Else, append it to the previous record.
You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.

Read in every line that starts with a certain character from a file

I am trying to read in every line in a file that starts with an 'X:'. I don't want to read the 'X:' itself just the rest of the line that follows.
with open("hnr1.abc","r") as file: f = file.read()
id = []
for line in f:
if line.startswith("X:"):
id.append(f.line[2:])
print(id)
It doesn't have any errors but it doesn't print anything out.

try this:
with open("hnr1.abc","r") as fi:
id = []
for ln in fi:
if ln.startswith("X:"):
id.append(ln[2:])
print(id)
dont use names like file or line
note the append just uses the item name not as part of the file
by pre-reading the file into memory the for loop was accessing the data by character not by line

for line in f:
search = line.split
if search[0] = "X":
storagearray.extend(search)
That should give you an array of all the lines you want, but they'll be split into separate words. Also, you'll need to have defined storagearray before we call it in the above block of code. It's an inelegant solution, as I'm a learner myself, but it should do the job!
edit: If you want to output the lines, simply use python's inbuilt print function:
str(storagearray)
print storagearray

Read every line in the file (for loop)
Select lines that contains X:
Slice the line with index 0: with starting char's/string as X: = ln[0:]
Print lines that begins with X:
for ln in input_file:
if ln.startswith('X:'):
X_ln = ln[0:]
print (X_ln)

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Elegant way to skip first line when using python fileinput module?

Is there an elegant way of skipping first line of file when using python fileinput module?
I have data file with nicely formated data but the first line is header. Using fileinput I would have to include check and discard line if the line does not seem to contain data.
The problem is that it would apply the same check for the rest of the file.
With read() you can open file, read first line then go to loop over the rest of the file. Is there similar trick with fileinput?
Is there an elegant way to skip processing of the first line?
Example code:
import fileinput
# how to skip first line elegantly?
for line in fileinput.input(["file.dat"]):
data = proces_line(line);
output(data)

lines = iter(fileinput.input(["file.dat"]))
next(lines) # extract and discard first line
for line in lines:
data = proces_line(line)
output(data)
or use the itertools.islice way if you prefer
import itertools
finput = fileinput.input(["file.dat"])
lines = itertools.islice(finput, 1, None) # cuts off first line
dataset = (process_line(line) for line in lines)
results = [output(data) for data in dataset]
Since everything used are generators and iterators, no intermediate list will be built.

The fileinput module contains a bunch of handy functions, one of which seems to do exactly what you're looking for:
for line in fileinput.input(["file.dat"]):
if not fileinput.isfirstline():
data = proces_line(line);
output(data)
fileinput module documentation

It's right in the docs: http://docs.python.org/library/fileinput.html#fileinput.isfirstline

One option is to use openhook:
The openhook, when given, must be a function that takes two arguments,
filename and mode, and returns an accordingly opened file-like object.
You cannot use inplace and openhook together.
One could create helper function skip_header and use it as openhook, something like:
import fileinput
files = ['file_1', 'file_2']
def skip_header(filename, mode):
f = open(filename, mode)
next(f)
return f
for line in fileinput.input(files=files, openhook=skip_header):
# do something

Do two loops where the first one calls break immediately.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
for line in f:
# add print() here if you only want to empty the line
break
for line in f:
process(line)
Lets say you want to remove or empty all of the first 5 lines.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
for line in f:
# add print() here if you only want to empty the first 5 lines
if f._filelineno == 5:
break
for line in f:
process(line)
But if you only want to get rid of the first line, just use next before the loop to remove the first line.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
next(f)
for line in f:
process(line)

with open(file) as j: #open file as j
for i in j.readlines()[1:]: #start reading j from second line.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split lines in python - python

Related

Using Regex to search a plaintext file line by line and cherry pick lines based on matches

reading data from multiple lines as a single item

Read in every line that starts with a certain character from a file

Refering to a list of names using Python

Elegant way to skip first line when using python fileinput module?

Categories

Resources