Read multiple files with fileinput at a certain line - python

I have multiple files which I need to open and read (I thought may it will be easier with fileinput.input()). Those files contain at the very beginning non-relevant information, what I need is all the information below this specific line ID[tab]NAME[tab]GEO[tab]FEATURE (some times from line 32, but unfortunately some times at any other line), then I want to store them in a list ("entries")
ID[tab]NAME[tab]GEO[tab]FEATURE
1 aa us A1
2 bb ko B1
3 cc ve C1
.
.
.
Now, before reading from line 32 (see code below), I will like to read from the above line. Is it possible to do this with fileinput? or am I going the wrong way. Is there another mor simple way to do this? Here is my code until now:
entries = list()
for line in fileinput.input():
if fileinput.filelineno() > 32:
entries.append(line.strip().split("\t"))
I'm trying to implement this idea with Python 3.2
UPDATE:
Here is how my code looks now, but still out of range. I need to add some of the entries to a dictionary. Am I missing something?
filelist = fileinput.input()
entries = []
for fn in filelist:
for line in fn:
if line.strip() == "ID\tNAME\tGEO\tFEATURE":
break
entries.extend(line.strip().split("\t")for line in fn)
dic = collections.defaultdict(set)
for e in entries:
dic[e[1]].add(e[3])
Error:
dic[e[1]].add(e[3])
IndexError: list index out of range

Just iterate through the file looking for the marker line and add everything after that to the list.
EDIT Your second problem happens because not all of the lines in the original file split to at least 3 fields. A blank line, for instance, results in an empty list so e[1] is invalid. I've updated the example with a nested iterator that filters out lines that are not the right size. You may want to do something different (maybe strip empty lines but otherwise assert that the remaining lines need to split to exactly 3 columns), but you get the idea
entries = []
for fn in filelist:
with open('fn') as fp:
for line in fp:
if line.strip() == 'ID\tNAME\tGEO\tFEATURE':
break
#entries.extend(line.strip().split('\t') for line in fp)
entries.extend(items for items in (line.strip().split('\t') for line in fp) if len(items) >= 3)

Related

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

python merge files by rules

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines
Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

Too many values to unpack in python: Caused by the file format

I have two files, which have two columns as following:
file 1
------
main 46
tag 23
bear 15
moon 2
file 2
------
main 20
rocky 6
zoo 4
bear 2
I am trying to compare the first 2 rows of each file together and in case there are some words that are the same, I will sum up the numbers and write those in a new file.
I read the file and used a foreach loop to go through each line, but it returns a ValueError:too many values to unpack.
import os
from itertools import islice
DIR = r'dir'
for filename in os.listdir(DIR):
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for i in range(2):
line = f.readline().strip()
word, freq = line.split():
print(word)
print(count)
In the file, there is an extra empty line after each line of the text. I searched for the \n; but nothing is there.
then I removed them manually and then it worked.
If you don't know how many items you have in the line, then you can't use the nice unpack facility. You'll need to split and check how many you got. For instance:
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for line in f:
data = line.split()
if len(data) >= 2:
word, count = line[:2]
This will get you the first two fields of any line containing at least that many. Since you haven't specified what to do with other lines or extra fields, I'll leave that (any else part) up to you. I've also left out the strip part to accent the existing code; line input and split will get rid of newlines and spaces, but not necessarily all white space.

Connecting similar lines from two files

I have two files, both are very big. The files have mixed up information between themselves and I need to compare two files and connect the lines that intersect.
An example would be:
1st file has
var1:var2:var3
2nd would have
var2:var3:var4
I need to connect these in a third file with output: var1:var2:var3:var4.
Please note that the lines do not match, var4 which should go with var1 (since they have var2 and var3 together). Var2 and Var3 are common for Var1 and Var4. could be far away in these huge files.
I need to find a way to compare each line and connect it to the one in the 2nd file. I can't seem to think of anything of an adequate loop. Any ideas?
Try the following (assuming var2:var3 is always a unique key in both files):
Iterate over all lines in the first file
Add all entries into a dictionary with the value var2:var3 as key (and var1 as value)
Iterate over all entries in the second file
look up if the dictionary from part 1 contains an entry for the key var2:var3 and if it does output var1:var2:var3:var4 into the output file and delete the entry from the dictionary.
This approach can use very big amounts of memory and therefore should probably not be used for very large files.
Based on the specific fields you said that you want to match (2 & 3 from file 1, 1 & 2 from file 2):
#!/usr/bin/python3
# Iterate over every line in file1.
# Iterate over every line in file2.
# If lines intersect, print combined line.
with open('file1') as file1:
for line1 in file1:
u1,h1,s1 = line1.rstrip().split(':')
with open('file2') as file2:
for line2 in file2:
h2,s2,p2 = line2.rstrip().split(':')
if h1 == h2 and s1 == s2:
print(':'.join((u1,h1,s2,p2)))
This is horrendously slow (in theory), but uses a minimum of RAM. If the files aren't absolutely huge, it might not perform too badly.
If memory isn't problem, use dictionary where the key is the same as the value:
#!/usr/bin/python
out_dict = {}
with open ('file1','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('file2','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('output_file','w') as file_out:
for key in out_dict:
file_out.write (key)

Add numbers from a list to an existing file using python

I have a text file with say 14 lines, and I would like to add list items from a list to the end of each of these lines, starting with the 5th line. Can anyone help me out please.
e.g
I have this text file called test.txt:
a b
12
1
four
/users/path/apple
/users/path/banana
..
..
and I have the following list
cycle=[21,22,23,.....]
My question is how can add these list items to the end of the lines such that I get this:
a b
12
1
four
/users/path/apple 21
/users/path/banana 22
..
..
I am not very good at python and this seems like a simple problem. Any help would be appreciated.
In general, you cannot modify a file except to append things at the end of the file (after the last line).
In your case, you want to:
Read the file, lines by lines
Append something to the line (optionally)
Write that line back.
You can do it in several ways. Load -> Write back modified string would be the simplest:
with open("path/to/my/file/test.txt", 'rb') as f:
# Strip will remove the newlines, spaces, etc...
lines = [line.strip() for line in f.readlines()]
numbers = [21, 22, 23]
itr = iter(numbers)
with open("path/to/my/file/test.txt", 'wb') as f:
for line in lines:
if '/' in line:
f.write('%s %s\n' % (line, itr.next()))
else:
f.write('%s\n' % line)
The issue with this method is that if you do a mistake with your processing, you ruined the original file. Other methods would be to:
Do all the modifications on the list of lines, check, and write back the whole list
Write back into a newly created file, possibly renaming it in the end.
As always, the Python doc is a definitely good read to discover new features and patterns.
Something like this:
for line in file.readlines():
print line.rstrip(), cycle.pop(0)
cycle = [21, 22, 23]
i = 0
with open('myfile.txt', 'r') as fh:
with open('new_myfile.txt', 'w' as fh_new:
for line in fh:
addon = ''
if i < len(cycle):
addon = cycle[i]
fh_new.write(line.strip() + addon + '\n')
i += 1

Categories

Resources