Python multiple pairs replace in txt file - python

I've got 2 .txt files, the first one is organized like this:
1:NAME1
2:NAME2
3:NAME3
...
and the second one like this:
1
1
1
2
2
2
3
What I want to do is to substitute every line in .txt 2 according to pairs in .txt 1, like this:
NAME1
NAME1
NAME1
NAME2
NAME2
NAME2
NAME3
Is there a way to do this? I was thinking to organize the first txt deleting the 1: 2: 3: and read it as an array, then make a loop for i in range(1, number of lines in txt 1) and then in the txt 2 find lines containing "i" and substituting with the i-element of the array. But of course I've no idea how to do this.

As Rodrigo commented. There are many ways to implement it, but storing the names in a dictionary is probably the way to go.
# Read the names
with open('names.txt') as f_names:
names = dict(line.strip().split(':') for line in f_names)
# Read the numbers
with open('numbers.txt') as f_numbers:
numbers = list(line.strip() for line in f_numbers)
# Replace numbers with names
with open('numbers.txt', 'w') as f_output:
for n in numbers:
f_output.write(names[n] + '\n')

This should do the trick. It reads the first file and stores the k, v pairs in a dict. That dict is the used to output the v for every k you find in the second file.
But.... if you want to prevent these downvotes it is better to post a code snippet of your own to show what you have tried... Right now your question is a red blanket for the horde of SO'ers that downvote everything that has no code in it. Hell, they even downvote answers because there is no code in the question...
lookup = {}
with open("first.txt") as fifile:
for line in fifile:
lookup[line.split[":"][0]] = line.split[":"][1]
with open("second.txt") as sifile:
with open("output.txt", "w") as ofile:
for line in sifile:
ofile.write("{}\n".format(lookup[line])

Related

Need to read csv files (when csv file is multiple input files) in Python

I have a school assignment that is asking me to write a program that first reads in the name of an input file and then reads the file using the csv.reader() method. The file contains a list of words separated by commas. The program should output the words and their frequencies (the number of times each word appears in the file) without any duplicates.
I have been able to figure out how to do this somewhat for one specific input file, but the program needs to be able to read multiple input files. This is what I have so far:
with open('input1.csv', 'r') as input1file:
csv_reader = csv.reader(input1file, delimiter = ',')
for row in csv_reader:
new_row = set(row)
for m in new_row:
count = row.count(m)
print(m, count)
This is what I get:
woman 1
man 2
Cat 1
Hello 1
boy 2
cat 2
dog 2
hey 2
hello 1
This works (almost) for the input1 file, except it changes the order each time I run it.
And I need it to work for two other input files?
sample CSV
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
See the code below for an example, I've commented it so you understand what it does and why.
As for the fact that for your implementation the order is different is due to the usage of set. A set by definition is unordered.
Also note that with your implementation you are passing over the rows twice, once to turn it into a set, and once more to count. Besides this, if the file contains more than one row, your logic would fail, as the counting part only gets reached when the last line of the file is read.
import csv
def count_things(filename):
with open(filename) as infile:
csv_reader = csv.reader(infile, delimiter = ',')
result = {}
for row in csv_reader:
# go over the row by element
for element in row:
# does it exist already?
if element in result:
# if yes, increase count
result[element] += 1
else:
# if no, add and set count to 1
result[element] = 1
# sorting, explained in detail here:
# https://stackoverflow.com/a/613218/9267296
return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
# you could just return unsorted result by using:
# return result
for key, value in count_things("input1.csv").items():
# iterate over items() by key/value pairs
# see this link:
# https://www.w3schools.com/python/python_dictionaries_access.asp
print(key, value)

reading a specific line from one file using a value from another

I have two files. One file contains lines of numbers. The other file contains lines of text. I want to look up specific lines of text from the list of numbers. Currently my code looks like this.
a_file = open("numbers.txt")
b_file = open("keywords.txt")
for position, line in enumerate(b_file):
lines_to_read = [a_file]
if position in lines_to_read:
print(line)
The values in numbers look like this..
26
13
122
234
41
The values in keywords looks like (example)
this is an apple
this is a pear
this is a banana
this is a pineapple
...
...
...
I can manually write out the values like this
lines_to_read = [26,13,122,234,41]
but that defeats the point of using a_file to look up the values in b_file. I have tried using strings and other variables but nothing seems to work.
[a_file] is a list with one single element which is a_file. What you want is a list containing the lines which you can get with a_file.readlines() or list(read_lines). But you do not want the text value of lines but their integer value, and you want to search often the container meaning that a set would be better. At the end, I would write:
lines_to_read = set(int(line) for line in a_file)
This is now fine:
for position, line in enumerate(b_file):
if position in lines_to_read:
print(line)
You need to read the contents of the a_file to get the numbers out.
Something like this should work:
lines_to_read = [int(num.strip()) for num in a_file.readlines()]
This will give you a list of the numbers in the file - assuming each line contains a single line number to lookup.
Also, you wouldn't need to put this inside the loop. It should go outside the loop - i.e. before it -- these numbers are fixed once read in from the file, so there's no need to process them again in each iteration.
socal_nerdtastic helped me find this solution. Thanks so much!
# first, read the numbers file into a list of numbers
with open("numbers.txt") as f:
lines_to_read = [int(line) for line in f]
# next, read the keywords file into a list of lines
with open("keywords.txt") as f:
keyword_lines = f.read().splitlines()
# last, use one to print the other
for num in lines_to_read:
print(keyword_lines[num])
I would just do this...
a_file = open("numbers.txt")
b_file = open("keywords.txt")
keywords_file = b_file.readlines()
for x in a_file:
print(keywords_file[int(x)-1])
This reads all lines of the keywords file to get the data as a list, then iterate through your numbers file to get the line numbers, and use those line numbers as the index of the array

Read multiple files with fileinput at a certain line

I have multiple files which I need to open and read (I thought may it will be easier with fileinput.input()). Those files contain at the very beginning non-relevant information, what I need is all the information below this specific line ID[tab]NAME[tab]GEO[tab]FEATURE (some times from line 32, but unfortunately some times at any other line), then I want to store them in a list ("entries")
ID[tab]NAME[tab]GEO[tab]FEATURE
1 aa us A1
2 bb ko B1
3 cc ve C1
.
.
.
Now, before reading from line 32 (see code below), I will like to read from the above line. Is it possible to do this with fileinput? or am I going the wrong way. Is there another mor simple way to do this? Here is my code until now:
entries = list()
for line in fileinput.input():
if fileinput.filelineno() > 32:
entries.append(line.strip().split("\t"))
I'm trying to implement this idea with Python 3.2
UPDATE:
Here is how my code looks now, but still out of range. I need to add some of the entries to a dictionary. Am I missing something?
filelist = fileinput.input()
entries = []
for fn in filelist:
for line in fn:
if line.strip() == "ID\tNAME\tGEO\tFEATURE":
break
entries.extend(line.strip().split("\t")for line in fn)
dic = collections.defaultdict(set)
for e in entries:
dic[e[1]].add(e[3])
Error:
dic[e[1]].add(e[3])
IndexError: list index out of range
Just iterate through the file looking for the marker line and add everything after that to the list.
EDIT Your second problem happens because not all of the lines in the original file split to at least 3 fields. A blank line, for instance, results in an empty list so e[1] is invalid. I've updated the example with a nested iterator that filters out lines that are not the right size. You may want to do something different (maybe strip empty lines but otherwise assert that the remaining lines need to split to exactly 3 columns), but you get the idea
entries = []
for fn in filelist:
with open('fn') as fp:
for line in fp:
if line.strip() == 'ID\tNAME\tGEO\tFEATURE':
break
#entries.extend(line.strip().split('\t') for line in fp)
entries.extend(items for items in (line.strip().split('\t') for line in fp) if len(items) >= 3)

get non-matching line numbers python

Hi I wrote a simple code in python to do the following:
I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this
File_1:
R000002
R000003
R000006
The second file has the names and position of all my loci and looks like this:
File_2:
R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123
What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:
Result:
1
2
3
9
10
11
12
13
I wrote the following simple code and it gets the job done
#!/usr/bin/env python
import sys
File1 = sys.argv[1]
File2 = sys.argv[2]
F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []
for line in F1:
Loci.append(line.strip())
for x, y in enumerate(F2):
y2 = y.strip().split()
if y2[0] not in Loci:
F3.write(str(x+1) + '\n')
However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the
if y2[0] not in Loci:
part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.
Can anybody help me out and show a more pythonic way of doing things.
Thanks in advance
Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.
#!/usr/bin/env python
import sys
ignore_f = sys.argv[1]
loci_f = sys.argv[2]
with open(ignore_f) as f:
ignore = set(x.strip() for x in f)
with open(loci_f) as f:
for n, line in enumerate(f, start=1):
if line.split()[0] not in ignore:
print n
Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.
You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.
Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).
If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.
To make the searching for matches more efficient you can simply use a set instead of list:
Loci = set()
for line in F1:
Loci.add(line.strip())
The rest should work the same, but faster.
Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary.

Python Detect Missing Line in a Txt File

Let us say a txt file consists of these lines:
A
1
2
3
B
1
2
3
C
1
3
Clearly, 2 is missing from C in the txt file. What is the idea to detect the missing 2 and output it? I need to read the txt file line by line.
Thank you !
Probably you want something like this:
line_sets = []
file_names = ['a', 'b', 'c']
# read content of files, if you need to remember file names, use
# dict instead of list
for file_name in file_names:
with open(file_name, 'rb') as f:
line_sets.append(set(f.readlines()))
# find missing lines
missing_lines = set()
for first_set, second_set in itertools.combinations(line_sets, 2):
missing_lines.add(first_set - second_set)
print 'Missing lines: %s' % missing_lines
Ok I think the question is not clear to most of you. Anyway here is my solution:
I append the values from each section into a list inside a for loop. For example, in section A, the list will contains 1,2 and 3. The len of the list in section C will be only 2. Thus, we know that a value is missing from section C. From here, we can print out section C. Sorry for the misunderstandings. This question is officially close. Thanks for the view anyway!

Categories

Resources