python merge files by rules

python merge files by rules - python

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines

Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

Related

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')

So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Replacing string with id using dictionary in python

I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:
Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!
Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.
This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?
title_map = {}
token_file = open("titles-sorted.txt")
for number, line in enumerate(token_file):
title_map[line.rstrip().lower()] = str(number + 1)
token_file.close()
input_file = open("a.txt")
output_file = open("a.index", "w")
for line in input_file:
tokens = line.split("\t")
if tokens[0] in title_map:
output_list = [title_map[tokens[0]]]
output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
output_file.write("\t".join(output_list) + "\n")
output_file.close()
input_file.close()
If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

Read multiple files with fileinput at a certain line

I have multiple files which I need to open and read (I thought may it will be easier with fileinput.input()). Those files contain at the very beginning non-relevant information, what I need is all the information below this specific line ID[tab]NAME[tab]GEO[tab]FEATURE (some times from line 32, but unfortunately some times at any other line), then I want to store them in a list ("entries")
ID[tab]NAME[tab]GEO[tab]FEATURE
1 aa us A1
2 bb ko B1
3 cc ve C1
.
.
.
Now, before reading from line 32 (see code below), I will like to read from the above line. Is it possible to do this with fileinput? or am I going the wrong way. Is there another mor simple way to do this? Here is my code until now:
entries = list()
for line in fileinput.input():
if fileinput.filelineno() > 32:
entries.append(line.strip().split("\t"))
I'm trying to implement this idea with Python 3.2
UPDATE:
Here is how my code looks now, but still out of range. I need to add some of the entries to a dictionary. Am I missing something?
filelist = fileinput.input()
entries = []
for fn in filelist:
for line in fn:
if line.strip() == "ID\tNAME\tGEO\tFEATURE":
break
entries.extend(line.strip().split("\t")for line in fn)
dic = collections.defaultdict(set)
for e in entries:
dic[e[1]].add(e[3])
Error:
dic[e[1]].add(e[3])
IndexError: list index out of range

Just iterate through the file looking for the marker line and add everything after that to the list.
EDIT Your second problem happens because not all of the lines in the original file split to at least 3 fields. A blank line, for instance, results in an empty list so e[1] is invalid. I've updated the example with a nested iterator that filters out lines that are not the right size. You may want to do something different (maybe strip empty lines but otherwise assert that the remaining lines need to split to exactly 3 columns), but you get the idea
entries = []
for fn in filelist:
with open('fn') as fp:
for line in fp:
if line.strip() == 'ID\tNAME\tGEO\tFEATURE':
break
#entries.extend(line.strip().split('\t') for line in fp)
entries.extend(items for items in (line.strip().split('\t') for line in fp) if len(items) >= 3)

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Removing unwanted characters in each line of a file then matching what is left to another file in Python

I would like to write a python script that addresses the following problem:
I have two tab separated files, one has just one column of a variety of words. The other file has one column that contains similar words, as well as columns other information. However, within the first file, some lines contain multiple words, separated by " /// ". The other file has a similar problem, but the separator is " | ".
File #1
RED
BLUE /// GREEN
YELLOW /// PINK /// PURPLE
ORANGE
BROWN /// BLACK
File #2 (Which contains additional columns of other measurements)
RED|PINK
ORANGE
BROWN|BLACK|GREEN|PURPLE
YELLOW|MAGENTA
I want to parse through each file and match the words that are the same, and then append the columns of additional measurements too. But I want to ignore the /// in the first file, and the | in the second, so that each word will be compared to the other list on its own. The output file should have just one column of any words that appear in both lists, and then the appended additional information from file 2. Any help??
Addition info / update:
Here are 8 lines of File #1, I used color names above to make it more simple but this is what the words really are: These are the "symbols":
ANKRD38
ANKRD57
ANKRD57
ANXA8 /// ANXA8L1 /// ANXA8L2
AOF1
AOF2
AP1GBP1
APOBEC3F /// APOBEC3G
Here is one line of file #2: What I need to do is run each symbol from file1 and see if it matches with any one of the "synonyms", found in file2, in column 5 (here the synonyms are A1B|ABG|GAP|HYST2477). If any symbols from file1 match ANY of the synonyms from col 5 file 2, then I need to append the additional information (the other columns in file2) onto the symbol in file1 and create one big output file.
9606 '\t' 1 '\t' A1BG '\t' - '\t' A1B|ABG|GAB|HYST2477'\t' HGNC:5|MIM:138670|Ensembl:ENSG00000121410|HPRD:00726 '\t' 19 '\t' 19q13.4'\t' alpha-1-B glycoprotein '\t' protein-coding '\t' A1BG'\t' alpha-1-B glycoprotein'\t' O '\t' alpha-1B-glycoprotein '\t' 20120726
File2 is 22,000 KB, file 1 is much smaller. I have thought of creating a dict much like has been suggested, but I keep getting held up with the different separators in each of the files. Thank you all for questions and help thus far.

EDIT
After your comments below, I think this is what you want to do. I've left the original post below in case anything in that was useful to you.
So, I think you want to do the following. Firstly, this code will read every separate synonym from file1 into a set - this is a useful structure because it will automatically remove any duplicates, and is very fast to look things up. It's like a dictionary but with only keys, no values. If you don't want to remove duplicates, we'll need to change things slightly.
file1_data = set()
with open("file1.txt", "r") as fd:
for line in fd:
file1_data.update(i.strip() for i in line.split("///") if i.strip())
Then you want to run through file2 looking for matches:
with open("file2.txt", "r") as in_fd:
with open("output.txt", "w") as out_fd:
for line in in_fd:
items = line.split("\t")
if len(items) < 5:
# This is so we don't crash if we find a line that's too short
continue
synonyms = set(i.strip() for i in items[4].split("|"))
overlap = synonyms & file1_data
if overlap:
# Build string of columns from file2, stripping out 5th column.
output_str = "\t".join(items[:4] + items[5:])
for item in overlap:
out_fd.write("\t".join((item, output_str)))
So what this does is open file2 and an output file. It goes through each line in file2, and first checks it has enough columns to at least have a column 5 - if not, it ignores that line (you might want to print an error).
Then it splits column 5 by | and builds a set from that list (called synonyms). The set is useful because we can find the intersection of this with the previous set of all the synonyms from file1 very fast - this intersection is stored in overlap.
What we do then is check if there was any overlap - if not, we ignore this line because no synonym was found in file1. This check is mostly for speed, so we don't bother building the output string if we're not going to use it for this line.
If there was an overlap, we build a string which is the full list of columns we're going to append to the synonym - we can build this as a string once even if there's multiple matches because it's the same for each match, because it all comes from the line in file2. This is faster than building it as a string each time.
Then, for each synonym that matched in file1, we write to the output a line which is the synonym, then a tab, then the rest of the line from file2. Because we split by tabs we have to put them back in with "\t".join(...). This is assuming I am correct you want to remove column 5 - if you do not want to remove it, then it's even easier because you can just use the line from file2 having stripped off the newline at the end.
Hopefully that's closer to what you need?
ORIGINAL POST
You don't give any indication of the size of the files, but I'm going to assume they're small enough to fit into memory - if not, your problem becomes slightly trickier.
So, the first step is probably to open file #2 and read in the data. You can do it with code something like this:
file2_data = {}
with open("file2.txt", "r") as fd:
for line in fd:
items = line.split("\t")
file2_data[frozenset(i.strip() for i in items[0].split("|"))] = items[1:]
This will create file2_data as a dictionary which maps a word on to a list of the remaining items on that line. You also should consider whether words can repeat and how you wish to handle that, as I mentioned in my earlier comment.
After this, you can then read the first file and attach the data to each word in that file:
with open("file1.txt", "r") as fd:
with open("output.txt", "w") as fd_out:
for line in fd:
words = set(i.strip() for i in line.split("///"))
for file2_words, file2_cols in file2_data.iteritems():
overlap = file2_words & words
if overlap:
fd_out.write("///".join(overlap) + "\t" + "\t".join(file2_cols))
What you should end up with is each row in output.txt being one where the list of words in the two files had at least one word in common and the first item is the words in common separated by ///. The other columns in that output file will be the other columns from the matched row in file #2.
If that's not what you want, you'll need to be a little more specific.
As an aside, there are probably more efficient ways to do this than the O(N^2) approach I outlined above (i.e. it runs across one entire file as many times as there are rows in the other), but that requires more detailed information on how you want to match the lines.
For example, you could construct a dictionary mapping a word to a list of the rows in which that word occurs - this makes it a lot faster to check for matching rows than the complete scan performed above. This is rendered slightly fiddly by the fact you seem to want the overlaps between the rows, however, so I thought the simple approach outlined above would be sufficient without more specifics.

Look at http://docs.python.org/2/tutorial/inputoutput.html for file i/o
Loop through each line in each file
file1set = set(file1line.split(' /// '))
file2set = set(file2line.split('|'))
wordsineach = list(file1set & file2set)
split will create an array of the color names
set() turns it into a set so we can easily compare differences in each line
Loop over 'wordsineach' and write to your new file

Use the str.replace function
with open('file1.txt', 'r') as f1:
content1 = f1.read()
content1 = content1.replace(' /// ', '\n').split('\n')
with open('file2.txt', 'r') as f2:
content2 = f2.read()
content2 = content1.replace('|', '\n').split('\n')
Then use a list comprehension
common_words = [i for i in content1 if i in content2]
However, if you already know that none of the words in each file are the same, you can use set intersections to make life easier
common_words = list(set(content1) & set(content2))
Then to output the remainder to another file:
common_words = [i + '\n' for i in common_words] #so that we print each word on a new line
with open('common_words.txt', 'w') as f:
f.writelines(common_words)
As to your 'additional information', I cannot help you unless you tell us how it is formatted, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.