Spliting / Slicing Text File with Python - python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')

So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

I have a set of three .fasta files of standardized format. Each one begins with a string that acts as a header on line 1, followed by a long string of nucleotides on line 2, where the header string denotes the animal that the nucleotide sequence came from. There are 14 of them altogether, for a total of 28 lines, and each of the three files has the headers in the same order. A snippet of one of the files is included below as an example, with the sequences shortened for clarity.
anas-crecca-crecca_KSW4951-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM021-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM020-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
What I would like to do is write a script or program that cats each of the strings of nucleotides together, but keeps them in the same position. My knowledge, however, is limited to rudimentary python, and I'd appreciate any help or tips someone could give me.
Try this:
data = ""
with open('filename.fasta') as f:
i = 0
for line in f:
i=i+1
if (i%2 == 0):
data = data + line[:-1]
# Copy and paste above block for each file,
# replacing filename with the actual name.
print(data)
Remember to replace "filename.fasta" with your actual file name!
How it works
Variable i acts as a line counter, when it is even, i%2 will be zero and the new line is concatenated to the "data" string. This way, the odd lines are ignored.
The [:-1] at the end of the data line removes the line break, allowing you to add all sequences to the same line.

For loop repeats the same line when writing to output file

I have a folder of 295 text files with each one containing a couple rows of data that I need to condense. I want to open one file, grab two lines from that file, and combine those two lines into another text file I've created, then close the data file and repeat for the next.
I currently have a for loop that does mostly that but the problem I run into is that the for loop copies the same text of the first instance 295 times. How do I get it so that it moves onto the next list item in the filelist? This is my first program in Python so I'm pretty new.
My code:
import os
filelist = os.listdir('/colors')
colorlines = []
for x in filelist:
with open('/colors/'+x, 'rt') as colorfile: #opens single txt file for reading text as colorfile
for colorline in colorfile: #Creates list from each line of the txt file and puts it into colorline
colorlines.append(colorline.rstrip('\n')) #removes the paragraph space in list
tup = (colorlines[1], colorlines[3]) #combines second and fourth line into one line into a tuple
str = ''.join(tup) #joins the tuple into a string with no space between the two
print(str)
newtext = open("colorcode_rework.txt","a") #opens output file for the reworked data
newtext.write(str+'\n') #pastes the string and inserts a new line
newtext.close()
colorfile.close()
You need to reset the color line list for each file. AS you are calling a specific item in the list (1 and 3) you are always calling the same even though more items have been added.
To reset the colorline list for each file:
for x in filelist:
colorlines = []

python merge files by rules

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines
Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

Read first not empty line from file and store in XML

I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()

Deleting Relative Lines with Regex

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text.
9
Title 2012 and 2013
\fCompany
Important text begins again.
The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines()
data = range(len(report))
name = []
for line_i in data:
line = report[line_i]
if re.match('.*\\x0cCompany', line ):
name.append(report[line_i])
print name
This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.
Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.
It would also be more efficient to iterate your actual file object, rather than putting it all into one list:
keeplines = []
with open('file.txt') as b:
for line in b:
if re.match('.*\\x0cCompany', line):
keeplines = keeplines[:-3] #shave off the preceding lines
else:
keeplines.append(line)
file = open('file.txt', 'w'):
for line in keeplines:
file.write(line)

Categories

Resources