I have a file with contents as given below,
to-56 Olive 850.00 10 10
to-78 Sauce 950.00 25 20
to-65 Green 100.00 6 10
If the 4th column of data is less than or equal to the 5th column, the data should be written to a second file.
I tried the following code, but only 'to-56 Olive' is saved in the second file. I can't figure out what I'm doing wrong here.
file1=open("inventory.txt","r")
file2=open("purchasing.txt","w")
data=file1.readline()
for line in file1:
items=data.strip()
item=items.split()
qty=int(item[3])
reorder=int(item[4])
if qty<=reorder:
file2.write(item[0]+"\t"+item[1]+"\n")
file1.close()
file2.close()
You're reading only one line of input. So, you can have at most one line of output.
I see that your code is a bit "old school". Here's a more "modern" and Pythonic version.
# Modern way to open files. The closing in handled cleanly
with open('inventory.txt', mode='r') as in_file, \
open('purchasing.txt', mode='w') as out_file:
# A file is iterable
# We can read each line with a simple for loop
for line in in_file:
# Tuple unpacking is more Pythonic and readable
# than using indices
ref, name, price, quantity, reorder = line.split()
# Turn strings into integers
quantity, reorder = int(quantity), int(reorder)
if quantity <= reorder:
# Use f-strings (Python 3) instead of concatenation
out_file.write(f'{ref}\t{name}\n')
I've changed your code a tiny bit, all you need to do is iterate over lines in your file - like this:
file1=open("inventory.txt","r")
file2=open("purchasing.txt","w")
# Iterate over each line in the file
for line in file1.readlines():
# Separate each item in the line
items=line.split()
# Retrieve important bits
qty=int(items[3])
reorder=int(items[4])
# Write to the file if conditions are met
if qty<=reorder:
file2.write(items[0]+"\t"+items[1]+"\n")
# Release used resources
file1.close()
file2.close()
Here is the output in purchasing.txt:
to-56 Olive
to-65 Green
Related
I have two files. One file contains lines of numbers. The other file contains lines of text. I want to look up specific lines of text from the list of numbers. Currently my code looks like this.
a_file = open("numbers.txt")
b_file = open("keywords.txt")
for position, line in enumerate(b_file):
lines_to_read = [a_file]
if position in lines_to_read:
print(line)
The values in numbers look like this..
26
13
122
234
41
The values in keywords looks like (example)
this is an apple
this is a pear
this is a banana
this is a pineapple
...
...
...
I can manually write out the values like this
lines_to_read = [26,13,122,234,41]
but that defeats the point of using a_file to look up the values in b_file. I have tried using strings and other variables but nothing seems to work.
[a_file] is a list with one single element which is a_file. What you want is a list containing the lines which you can get with a_file.readlines() or list(read_lines). But you do not want the text value of lines but their integer value, and you want to search often the container meaning that a set would be better. At the end, I would write:
lines_to_read = set(int(line) for line in a_file)
This is now fine:
for position, line in enumerate(b_file):
if position in lines_to_read:
print(line)
You need to read the contents of the a_file to get the numbers out.
Something like this should work:
lines_to_read = [int(num.strip()) for num in a_file.readlines()]
This will give you a list of the numbers in the file - assuming each line contains a single line number to lookup.
Also, you wouldn't need to put this inside the loop. It should go outside the loop - i.e. before it -- these numbers are fixed once read in from the file, so there's no need to process them again in each iteration.
socal_nerdtastic helped me find this solution. Thanks so much!
# first, read the numbers file into a list of numbers
with open("numbers.txt") as f:
lines_to_read = [int(line) for line in f]
# next, read the keywords file into a list of lines
with open("keywords.txt") as f:
keyword_lines = f.read().splitlines()
# last, use one to print the other
for num in lines_to_read:
print(keyword_lines[num])
I would just do this...
a_file = open("numbers.txt")
b_file = open("keywords.txt")
keywords_file = b_file.readlines()
for x in a_file:
print(keywords_file[int(x)-1])
This reads all lines of the keywords file to get the data as a list, then iterate through your numbers file to get the line numbers, and use those line numbers as the index of the array
I have a file containing a block of introductory text for an unknown number of lines, then the rest of the file contains data. Before the data block begins, there are column titles and I want to skip those also. So the file looks something like this:
this is an introduction..
blah blah blah...
...
UniqueString
Time Position Count
0 35 12
1 48 6
2 96 8
...
1000 82 37
I want to record the Time Position and Count data to a separate file. Time Position and Count Data appears only after UniqueString.
Is it what you're looking for?
reduce(lambda x, line: (x and (outfile.write(line) or x)) or line=='UniqueString\n', infile)
How it works:
files can be iterated, so we can read infile line by line by simply doing [... for line in infile]
in the operation part, we use the fact that writeline() will not be triggered if the first operand for and is False.
in the or part, we set up the trigger if the desired line is found, so writeline will be fired for the next and consequent lines
default initial value for reduce is None, which evaluates to False
You could extract and write the data to another file like this:
with open("data.txt", "r") as infile:
x = infile.readlines()
x = [i.strip() for i in x[x.index('UniqueString\n') + 1:] if i != '\n' ]
with open("output.txt", "w") as outfile:
for i in x[1:]:
outfile.write(i+"\n")
It is pretty straight forward I think: The file is opened and all lines are read, a list comprehension slices the list beginning with the header string and the desired remaining lines are wrote to file again.
You could create a generator function (and more info here) that filtered the file for you.
It operates incrementally so doesn't require reading the entire file into memory at one time.
def extract_lines_following(file, marker=None):
"""Generator yielding all lines in file following the line following the marker.
"""
marker_seen = False
while True:
line = file.next()
if marker_seen:
yield line
elif line.strip() == marker:
marker_seen = True
file.next() # skip following line, too
# sample usage
with open('test_data.txt', 'r') as infile, open('cleaned_data.txt', 'w') as outfile:
outfile.writelines(extract_lines_following(infile, 'UniqueString'))
This could be optimized a little if you're using Python 3, but the basic idea would be the same.
If I have a text file that has a bunch of random text before I get to the stuff I actually want, how do I move the file pointer there?
Say for example my text file looks like this:
#foeijfoijeoijoijfoiej ijfoiejoi jfeoijfoifj i jfoei joi jo ijf eoij oie jojf
#feoijfoiejf ioj oij oi jo ij i joi jo ij oij #### oijroijf 3## # o
#foeijfoiej i jo i iojf 3 ## #io joi joij oi j## io joi joi j3# 3i ojoi joij
# The stuff I care about
(The hashtags are a part of the actual text file)
How do I move the file pointer to the line of stuff I care about, and then how would I get python to tell me the number of the line, and start the reading of the file there?
I've tried doing a loop to find the line that the last hashtag is in, and then reading from there, but I still need to get rid of the hashtag, and need the line number.
Try using the readlines function. This will return a list containing each line. You can use a for loop to parse through each line, searching for what you need, then obtain the number of the line via its index in the list. For instance:
with open('some_file_path.txt') as f:
contents = f.readlines()
object = '#the line I am looking for'
for line in contents:
if object in line:
line_num = contents.index(object)
To get rid of the pound sign, just use the replace function. Eg. new_line = line.replace('#','')
You can't seek to it directly without knowing the size of the junk data or scanning through the junk data. But it's not too hard to wrap the file in itertools.dropwhile to discard lines until you see the "good" data, after which it iterates through all remaining lines:
import itertools
# Or def a regular function that returns True until you see the line
# delimiting the beginning of the "good" data
not_good = '# The stuff I care about\n'.__ne__
with open(filename) as f:
for line in itertools.dropwhile(not_good, f):
... You'll iterate the lines at and after the good line ...
If you actually need the file descriptor positioned appropriately, not just the lines, this variant should work:
import io
with open(filename) as f:
# Get first good line
good_start = next(itertools.dropwhile(not_good, f))
# Seek back to undo the read of the first good line:
f.seek(-len(good_start), io.SEEK_CUR)
# f is now positioned at the beginning of the line that begins the good data
You can tweak this to get the actual line number if you really need it (rather than just needing the offset). It's a little less readable though, so explicit iteration via enumerate may make more sense if you need to do it (left as exercise). The way to make Python work for you is:
from future_builtins import map # Py2 only
from operator import itemgetter
with open(filename) as f:
linectr = itertools.count()
# Get first good line
# Pair each line with a 0-up number to advance the count generator, but
# strip it immediately so not_good only processes lines, not line nums
good_start = next(itertools.dropwhile(not_good, map(itemgetter(0), zip(f, linectr))))
good_lineno = next(linectr) # Keeps the 1-up line number by advancing once
# Seek back to undo the read of the first good line:
f.seek(-len(good_start), io.SEEK_CUR)
# f is now positioned at the beginning of the line that begins the good data
I have a BED Interval file that I'm trying to work with using the Galaxy online tool. Currently, every line in the file begins with a number (which stands for chromosome number). In order to upload it properly, I need every line to begin with "chr" and then the number. So for example lines that start with "2L", I need to change so that they will start with "chr2L", and do the same for every other line that start with a number (not just 2L, there are many different numbers). I was thinking if I could just add a "chr" to the start of every line, without affecting the other columns, that would be great, but I have no idea how to do that (very new to python)
Can you please help me out?
Thanks.
http://docs.python.org/2/library/stdtypes.html#file.writelines
with open('bed-interval') as f1, open('bed-interval-modified', 'w') as f2:
f2.writelines('chr' + line for line in f1)
step one open the file
file = open("somefile.txt")
step 2 get the lines
lines = list(file.readlines())
file.close()
step 3 use a list comprehension
new_lines = ["chr"+line for line in lines]
step 4 write new lines back to file
with open("somefile.txt","w") as f:
f.writelines(new_lines)
In order to not store all the lines in memory
file1 = open("some.txt")
file2 = open("output.txt","w")
for line in file1:
print >> file2, "chr"+ line
file1.close()
file2.close()
then just copy output.txt to your original filename
I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.
Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt on its 4th field, and output the result to sorted.txt
you want to build an in-memory index for the file:
create an empty list
open the file
read it line by line (using f.readline(), and store in the list a tuple consisting of the value on which you want to sort (extracted with line.split('\t').strip()) and the offset of the line in the file (which you can get by calling f.tell() before calling f.readline())
close the file
sort the list
Then to print the sorted file, reopen the file and for each element of your list, use f.seek(offset) to move the file pointer to the beginning of the line, f.readline() to read the line and print the line.
Optimization: you may want to store the length of the line in the list, so that you can use f.read(length) in the printing phase.
Sample code (optimized for readability, not speed):
def build_index(filename, sort_col):
index = []
f = open(filename)
while True:
offset = f.tell()
line = f.readline()
if not line:
break
length = len(line)
col = line.split('\t')[sort_col].strip()
index.append((col, offset, length))
f.close()
index.sort()
return index
def print_sorted(filename, col_sort):
index = build_index(filename, col_sort)
f = open(filename)
for col, offset, length in index:
f.seek(offset)
print f.read(length).rstrip('\n')
if __name__ == '__main__':
filename = 'somefile.txt'
sort_col = 2
print_sorted(filename, sort_col)
Split up into files that can be sorted in memory. Sort each file in memory. Then merge the resulting files.
Merge by reading a portion of each of the files to be merged. The same amount from each file leaving enough space in memory for the merged result. Once merged saving this. Repeating adding blocks of merged data onto the file.
This minimises the file i/o and moving around the file on the disk.
I would store the file in a good relational database, index it on the field your are interested in and then read the ordered items.