Combining two files' data into one list - python

I am new to python and have only started working with files. I am wondering how to combine the data of two files into one list using list comprehension to read and combine them.
#for instance line 1 of galaxies = I
#line 1 of cycles = 0
#output = [IO] (list)
This is what I have so far. Thanks in advance!
comlist =[line in open('galaxies.txt') and line in open('cycles.txt')]
Update:
comlist = [mylist.append(gline[i]+cline[i]) for i in range(r)]
However, this is only returning none

Like this:
#from itertools import chain
def chainer(*iterables):
# chain('ABC', 'DEF') --> A B C D E F
for it in iterables:
for element in it:
yield element
comlist = list(chainer(open('galaxies.txt'), open('cycles.txt')))
print(comlist)
Although leaving files open like that isn't generally considered a good practice.

You can use zip to combine iterables
https://docs.python.org/3/library/functions.html#zip

If its only 2 files why do you want to use comprehension together? Something like this would be easier:
[l for l in open('galaxies.txt')]+[l for l in open('cycles.txt')]
The question is, what if you had n files? lets say in a list ... fileList = ['f1.txt', 'f2.txt', ... , 'fn.txt']. Then you may consider itertools.chain
import itertools as it
filePointers = map(open, fileList)
lines = it.chain(filePointers)
map(close, filePointers)
I haven't tested it, but this should work ...

f1 = open('galaxies.txt')
f2 = open('cycles.txt')
If you want to combine them by alternating the lines, use zip and comprehension:
comlist = [line for two_lines in zip(f1, f2) for line in two_lines]
You need two iterations here because the return value from zip is itself an iterable, in this case consisting of two lines, one from f1 and one from f2. You can combine two iterations in a single comprehension as shown.
If you want to combine them one after the other, use "+" for concatenation:
comlist = [line for line in f1] + [line for line in f2]
In both cases, it's a good practice to close each file:
f1.close()
f2.close()

You can achieve your task within lambda and map:
I assume a data in in_file (first file) like this :
1 2
3 4
5 6
7 8
And a data in in_file2 (second file) like this:
hello there!
And with this piece of code:
# file 1
a = "in_file"
# file 2
b = "in_file2"
f = lambda x,y: (open(x, 'r'),open(y, 'r'))
# replacing "\n" with an empty string
data = [k for k in map(lambda x:x.read().replace("\n",""), f(a,b))]
print(data)
The output will be:
['1 23 45 67 8', 'hello there!']
However, it's not a good practice to leave an opened files like this way.

Using only list comprehensions:
[line for file in (open('galaxies.txt'), open('cycles.txt')) for line in file]
However it is bad practice to leave files open and hope the GC cleans it up. You should really do something like this:
import fileinput
with fileinput.input(files=('galaxies.txt', 'cycles.txt')) as f:
comlist = f.readlines()
If you want to strip end of line characters a good way is with line.rstrip('\r\n').

Related

Python - Combine 2 rows into 1

I have a csv file which is a list as seen here:
A
B
C
D
E
F
And I would like to transform it into a list with pair like this:
AB
CD
EF
What is simplest way to achieve this?
An alternative approach using itertools.islice, it will avoid reading the whole file at once:
import csv
from itertools import islice
CHUNK = 2
def chunks():
with open("test.csv", newline="") as f:
reader = csv.reader(f)
while chunk := tuple(islice(reader, CHUNK)):
yield "".join(*zip(*chunk))
def main():
print(list(chunks()))
if __name__ == "__main__":
main()
Note:
The walrus operator (:=) is available since Python 3.8+, in previous versions you'll need something like this:
chunk = tuple(islice(reader, CHUNK))
while chunk:
yield "".join(*zip(*chunk))
chunk = tuple(islice(reader, CHUNK))
The easiest way is probably to put each line of your file in a list and then create a list with half the size and your pairs. Since your .csv file appears to have only one column, the file format doesn't really matter.
Now, I assume that you have the file eggs.csv in the same directory as your Python file:
A
B
C
D
E
F
The following code produces the expected output:
output_lines = []
with open('eggs.csv', 'r') as file:
for first, second in zip(file, file):
output_lines.append(f'{first.strip()}{second.strip()}')
If you execute this code and print output_lines, you will get
['AB', 'CD', 'EF']
Note that if the number of lines is odd, the last line will be simply ignored. I don't know the desired behavior so I just assumed this, but you can easily change the code.

Concatenate row values using python

I am new to pyhton and am stuck on this topic from 2 days,tried looking for a basic answer but couldn't,so finally I decided to come up with my question.
I want to concatenate the values of only first two rows of my csv file(if possible with help of inbuilt modules).
Any kind of help would be appreciated. Thnx in advance
Below is my sample csv file without headers:
1,Suraj,Bangalore
2,Ahuja,Karnataka
3,Rishabh,Bangalore
Desired Output:
1 2,Suraj Ahuja,Bangalore Karnataka
3,Rishabh,Bangalore
Just create a csv.reader object (and a csv.writer object). Then use next() on the first 2 rows and zip them together (using list comprehension) to match the items.
Then process the rest of the file normally.
import csv
with open("file.csv") as fr, open("output.csv","w",newline='') as fw:
cr=csv.reader(fr)
cw=csv.writer(fw)
title_row = [" ".join(z) for z in zip(next(cr),next(cr))]
cw.writerow(title_row)
# dump the rest as-is
cw.writerows(rows)
(you'll get an exception if the file has only 1 row of course)
You can use zip() for your first 2 lines like below:
with open('f.csv') as f:
lines = f.readlines()
res = ""
for i, j in zip(lines[0].strip().split(','), lines[1].strip().split(',')):
res += "{} {},".format(i, j)
print(res.rstrip(','))
for line in lines[2:]:
print(line)
Output:
1 2,Suraj Ahuja,Bangalore Karnataka
3,Rishabh,Bangalore
with open('file2', 'r') as f, open('file2_new', 'w') as f2:
lines = [a.split(',') for a in f.read().splitlines() if a.strip()]
newl2 = [[' '.join(x) for x in zip(lines[0], lines[1])]] + lines[2:]
for a in newl2:
f2.write(', '.join(a)+'\n')

How to sort line by line in multi files in Python?

I have some text files that I want to read file by file and line by line and sort and write to one file in Python, for example:
file 1:
C
D
E
file 2:
1
2
3
4
file 3:
#
$
*
File 4,.......
The result should be like this sequence in one file:
C
1
#
D
2
$
E
3
*
C
4
#
D
1
#
You can use a list of iterators over your files. You then need to constantly cycle through these iterators until one of your files has been consumed. You could use a while loop, or shown here is using itertools cycle:
import glob
import itertools
fs = glob.glob("./*py") # Use glob module to get files with a pattern
fits = [open(i, "r") for i in fs] # Create a list of file iterators
with open("blah", "w") as out:
for f in itertools.cycle(fits): # Loop over you list until one file is consumed
try:
l = next(f).split(" ")
s = sorted(l)
out.write(" ".join(s) + "/n")
print s
except: # If one file has been read, the next(f) will raise an exception and this will stop the loop
break
This looks related to another question you asked (and then deleted).
I'm assuming you want to be able to read a file, create generators, combine generators, sort the output of generators, then write to a file.
Using yield to form your generator makes life a lot easier.
Keep in mind, to sort every line like this, you will have to store it in memory. If dealing with very large files, you will need to handle this in a more memory-conscious way.
First, let's make your generator that opens a file and reads line-by-line:
def line_gen(file_name):
with open(file_name, 'r') as f:
for line in f.readlines():
yield line
Then let's "merge" the generators, by creating a generator which will iterate through each one in order.
def merge_gens(*gens):
for gen in gens:
for x in gen:
yield x
Then we can create our generators:
gen1 = line_gen('f1.txt')
gen2 = line_gen('f2.txt')
Combine them:
comb_gen = merge_gens(gen1, gen2)
Create a list from the generator. (This is the potentially-memory-intensive step.):
itered_list = [x for x in comb_gen]
Sort the list:
sorted_list = sorted(itered_list)
Write to the file:
with open('f3.txt', 'w') as f:
for line in sorted_list:
f.write(line + '\n')

get non-matching line numbers python

Hi I wrote a simple code in python to do the following:
I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this
File_1:
R000002
R000003
R000006
The second file has the names and position of all my loci and looks like this:
File_2:
R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123
What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:
Result:
1
2
3
9
10
11
12
13
I wrote the following simple code and it gets the job done
#!/usr/bin/env python
import sys
File1 = sys.argv[1]
File2 = sys.argv[2]
F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []
for line in F1:
Loci.append(line.strip())
for x, y in enumerate(F2):
y2 = y.strip().split()
if y2[0] not in Loci:
F3.write(str(x+1) + '\n')
However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the
if y2[0] not in Loci:
part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.
Can anybody help me out and show a more pythonic way of doing things.
Thanks in advance
Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.
#!/usr/bin/env python
import sys
ignore_f = sys.argv[1]
loci_f = sys.argv[2]
with open(ignore_f) as f:
ignore = set(x.strip() for x in f)
with open(loci_f) as f:
for n, line in enumerate(f, start=1):
if line.split()[0] not in ignore:
print n
Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.
You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.
Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).
If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.
To make the searching for matches more efficient you can simply use a set instead of list:
Loci = set()
for line in F1:
Loci.add(line.strip())
The rest should work the same, but faster.
Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary.

python: quickest way to split a file into two files randomly

python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are random?
for example: if the file is
1
2
3
4
5
6
7
8
9
10
it could be split into:
3
2
10
9
1
4
6
8
5
7
This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.
Given that definition, you can do this:
import random
def partition(l, pred):
yes, no = [], []
for e in l:
if pred(e):
yes.append(e)
else:
no.append(e)
return yes, no
lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)
Note that this won't necessarily exactly split the file in two, but it will on average.
You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):
def shuffle_split(infilename, outfilename1, outfilename2):
from random import shuffle
with open(infilename, 'r') as f:
lines = f.readlines()
# append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:len(lines) // 2])
with open(outfilename2, 'w') as f:
f.writelines(lines[len(lines) // 2:])
random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.
I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.
update: changed / to // to evade issues when __future__.division is enabled.
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
if n==len(data)/2:
c+=1
f.close()
f=open("test."+str(c),"w")
f.write(i)
Other version:
from random import shuffle
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.read().splitlines()
shuffle(lines)
half_lines = len(lines) // 2
with open(outfilename1, 'w') as f:
f.write('\n'.join(lines.pop() for count in range(half_lines)))
with open(outfilename2, 'w') as f:
f.writelines('\n'.join(lines))

Categories

Resources