Python Detect Missing Line in a Txt File - python

Let us say a txt file consists of these lines:
A
1
2
3
B
1
2
3
C
1
3
Clearly, 2 is missing from C in the txt file. What is the idea to detect the missing 2 and output it? I need to read the txt file line by line.
Thank you !

Probably you want something like this:
line_sets = []
file_names = ['a', 'b', 'c']
# read content of files, if you need to remember file names, use
# dict instead of list
for file_name in file_names:
with open(file_name, 'rb') as f:
line_sets.append(set(f.readlines()))
# find missing lines
missing_lines = set()
for first_set, second_set in itertools.combinations(line_sets, 2):
missing_lines.add(first_set - second_set)
print 'Missing lines: %s' % missing_lines

Ok I think the question is not clear to most of you. Anyway here is my solution:
I append the values from each section into a list inside a for loop. For example, in section A, the list will contains 1,2 and 3. The len of the list in section C will be only 2. Thus, we know that a value is missing from section C. From here, we can print out section C. Sorry for the misunderstandings. This question is officially close. Thanks for the view anyway!

Related

Trouble with turning .txt file into lists

In a project I am doing I am storing the lists in a .txt file in order to save space in my actual code. I can turn each line into a separate list but I need to turn multiple lines into one list where each letter is a different element. I would appreciate any help I can get. The program I wrote is as follows:
lst = open('listvariables.txt', 'r')
data = lst.readlines()
for line in data:
words = line.split()
print(words)
Here is a part of the .txt file I am using:
; 1
wwwwwwwwwww
wwwwswwwwww
wwwsswwwssw
wwskssssssw
wssspkswssw
wwwskwwwssw
wwwsswggssw
wwwswwgwsww
wwsssssswww
wwssssswwww
wwwwwwwwwww
; 2
wwwwwwwww
wwwwwsgsw
wwwwskgsw
wwwsksssw
wwskpswww
wsksswwww
wggswwwww
wssswwwww
wwwwwwwww
If someone could make the program print out two lists that would be great.
You can load the whole file and turn it into a char-list like:
with open('listvariables.txt', 'r') as f
your_list = list(f.read())
I'm not sure why you want to do it, tho. You can iterate over string the same way you can iterate over a list - the only advantage is that list is a mutable object but you wouldn't want to do complex changes to it, anyway.
If you want each character in a string to be an element of the final list you should use
myList = list(myString)
If I understand you correctly this should work:
with open('listvariables.txt', 'r') as my_file:
my_list = [list(line) for line in my_file.read().splitlines()]

Python multiple pairs replace in txt file

I've got 2 .txt files, the first one is organized like this:
1:NAME1
2:NAME2
3:NAME3
...
and the second one like this:
1
1
1
2
2
2
3
What I want to do is to substitute every line in .txt 2 according to pairs in .txt 1, like this:
NAME1
NAME1
NAME1
NAME2
NAME2
NAME2
NAME3
Is there a way to do this? I was thinking to organize the first txt deleting the 1: 2: 3: and read it as an array, then make a loop for i in range(1, number of lines in txt 1) and then in the txt 2 find lines containing "i" and substituting with the i-element of the array. But of course I've no idea how to do this.
As Rodrigo commented. There are many ways to implement it, but storing the names in a dictionary is probably the way to go.
# Read the names
with open('names.txt') as f_names:
names = dict(line.strip().split(':') for line in f_names)
# Read the numbers
with open('numbers.txt') as f_numbers:
numbers = list(line.strip() for line in f_numbers)
# Replace numbers with names
with open('numbers.txt', 'w') as f_output:
for n in numbers:
f_output.write(names[n] + '\n')
This should do the trick. It reads the first file and stores the k, v pairs in a dict. That dict is the used to output the v for every k you find in the second file.
But.... if you want to prevent these downvotes it is better to post a code snippet of your own to show what you have tried... Right now your question is a red blanket for the horde of SO'ers that downvote everything that has no code in it. Hell, they even downvote answers because there is no code in the question...
lookup = {}
with open("first.txt") as fifile:
for line in fifile:
lookup[line.split[":"][0]] = line.split[":"][1]
with open("second.txt") as sifile:
with open("output.txt", "w") as ofile:
for line in sifile:
ofile.write("{}\n".format(lookup[line])

Read multiple files with fileinput at a certain line

I have multiple files which I need to open and read (I thought may it will be easier with fileinput.input()). Those files contain at the very beginning non-relevant information, what I need is all the information below this specific line ID[tab]NAME[tab]GEO[tab]FEATURE (some times from line 32, but unfortunately some times at any other line), then I want to store them in a list ("entries")
ID[tab]NAME[tab]GEO[tab]FEATURE
1 aa us A1
2 bb ko B1
3 cc ve C1
.
.
.
Now, before reading from line 32 (see code below), I will like to read from the above line. Is it possible to do this with fileinput? or am I going the wrong way. Is there another mor simple way to do this? Here is my code until now:
entries = list()
for line in fileinput.input():
if fileinput.filelineno() > 32:
entries.append(line.strip().split("\t"))
I'm trying to implement this idea with Python 3.2
UPDATE:
Here is how my code looks now, but still out of range. I need to add some of the entries to a dictionary. Am I missing something?
filelist = fileinput.input()
entries = []
for fn in filelist:
for line in fn:
if line.strip() == "ID\tNAME\tGEO\tFEATURE":
break
entries.extend(line.strip().split("\t")for line in fn)
dic = collections.defaultdict(set)
for e in entries:
dic[e[1]].add(e[3])
Error:
dic[e[1]].add(e[3])
IndexError: list index out of range
Just iterate through the file looking for the marker line and add everything after that to the list.
EDIT Your second problem happens because not all of the lines in the original file split to at least 3 fields. A blank line, for instance, results in an empty list so e[1] is invalid. I've updated the example with a nested iterator that filters out lines that are not the right size. You may want to do something different (maybe strip empty lines but otherwise assert that the remaining lines need to split to exactly 3 columns), but you get the idea
entries = []
for fn in filelist:
with open('fn') as fp:
for line in fp:
if line.strip() == 'ID\tNAME\tGEO\tFEATURE':
break
#entries.extend(line.strip().split('\t') for line in fp)
entries.extend(items for items in (line.strip().split('\t') for line in fp) if len(items) >= 3)

get non-matching line numbers python

Hi I wrote a simple code in python to do the following:
I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this
File_1:
R000002
R000003
R000006
The second file has the names and position of all my loci and looks like this:
File_2:
R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123
What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:
Result:
1
2
3
9
10
11
12
13
I wrote the following simple code and it gets the job done
#!/usr/bin/env python
import sys
File1 = sys.argv[1]
File2 = sys.argv[2]
F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []
for line in F1:
Loci.append(line.strip())
for x, y in enumerate(F2):
y2 = y.strip().split()
if y2[0] not in Loci:
F3.write(str(x+1) + '\n')
However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the
if y2[0] not in Loci:
part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.
Can anybody help me out and show a more pythonic way of doing things.
Thanks in advance
Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.
#!/usr/bin/env python
import sys
ignore_f = sys.argv[1]
loci_f = sys.argv[2]
with open(ignore_f) as f:
ignore = set(x.strip() for x in f)
with open(loci_f) as f:
for n, line in enumerate(f, start=1):
if line.split()[0] not in ignore:
print n
Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.
You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.
Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).
If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.
To make the searching for matches more efficient you can simply use a set instead of list:
Loci = set()
for line in F1:
Loci.add(line.strip())
The rest should work the same, but faster.
Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary.

Python String Formatting, need to continuously add to a string

I just started python, any suggestions would be greatly appreciated. I'm reading a table generated by another program and pulling out 2 numbers from each line, I'll call them a and b. (they are saved as flux and observed in my program) I need to take these two numbers from each line and format them like this-
(a,b),(a,b),(a,b) ect.
Each consecutive parenthesis is from the consecutive line, first a,b is from line 1, second a,b is from line 2, etc. I need to read the entire table however, the table length will vary.
This is what I have so far. It can read the table and pull out the numbers I need, however, I don't know how to put the numbers into the proper format. I want to say something recursive would be most efficient but I'm unsure of how to do that. Thank you in advance.
#!/usr/bin/python
file = open("test_m.rdb")
while 1:
line = file.readline()
i = line.split()
flux = i[2]
observed = i[4]
if not line:
break
with open("test_m.rdb") as inf:
results = [(i[2],i[4]) for i in (line.split() for line in inf)]
result_string = ",".join(str(res) for res in results)
or a more general formatter:
result_string = ", ".join("('{2}', '{4}')".format(*res) for res in results)
Very simple solution:
with open('data') as data:
print ', '.join('(%s, %s)' % (x.split()[2], x.split()[4])
for x in data.readlines())
Just use readlines to iterate over the lines in the file.
assuming that two values you got are str1 and str2
//inside a loop which iterates through your values
strTable = strTable + ['('+str1+')','('+str2+')']
hope oit will work, if it dont, comment , i will solve it.

Categories

Resources