Python: Skipping lines from file - python

I have a data file that has 100 lines and I want create a dictionary that skips the first two lines and then create a dictionary with enumerating keys with the lines as values.
myfile = open(infile, 'r')
d={}
with myfile as f:
next(f)
next(f)
for line in f:
This is what I got, I don't how to use iteritems(), enumerate(), or itervalues() but I feel like I think I will use them or maybe not if anybody can help me.

You could do something like:
from itertools import islice
with open(infile, 'r') as myfile:
d = dict(enumerate(islice(myfile, 2, None)))
But I wish I understood why you want to skip the first two lines – are you sure you don't want linecache?

This is just going to be of the top of my head so the will certainly be room for improvement.
myfile = open(infile, 'r') # open the file
d = {} # initiate the dict
for line in myfile: # iterate over lines in the file
counter = 0 # initiate the counter
if counter <= 1: # if we have counted under 2 iterations
counter += 1 # increase the counter by 1 and do nothing else
else: # if we have counted over 2 iterations
d[counter - 2] = line # make a new key with the name of lines counted (taking in to consideration the 2 lines precounted)
counter += 1 # increase the counter by 1 before continuing
I can not of the top of my head remember where in the code it would be best to close the file but do some experimentation and read this and this. And another time a good place to start would really be google and the python docs in general.

Related

Removing duplicates from text file using python

I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well
You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger
Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])

How to skip over a certain index position for txt file

I have an assignment that requires me to analyze data for presidential job creation without using dictionaries. I have to open a text file and average data that applies to the democratic and republican presidents. I am having trouble understanding how to skip over certain lines (in my case I don't want to include the first line and index position 0, the years and months). This is what I have so far and a bit of the input file:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1979,14090,14135,14152,14191,14221,14239,14288,14328,14422,14484,14532,14559
1980,14624,14747,14754,14795,14827,14784,14861,14870,14824,14900,14903,14946
1981,14969,14981,14987,14985,14971,14963,14993,15007,14971,15028,15073,15075
1982,15056,15056,15050,15075,15132,15207,15299,15328,15403,15463,15515,15538
def g_avg():
infile = open("government_employment_Windows.txt", 'r')
lines = []
for line in infile:
print(line)
lines.append(line)
infile.close()
print(lines)
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean / len(lines)
print(mean)
Its a very very pythonic way to calculate this
with open('filename') as f:
lines = f.readlines() #Read the lines of the txt
sum = 0
n = 0
for line in lines[1:]: #Use the [1:] to skip the first row with the months
row = line.split(',') #Use the split to convert the line in a list separated by comma
for element in row[1:]: #Use the [1:] to skip the years
sum += float(element)
n += 1
mean = sum/ n
This also looks like a csv file, in which case you can use the built in csv module
import csv
total = 0
count = 0
with open("government_employment_Windows.txt", 'r') as f:
reader = csv.reader(f)
next(reader) #skips the headers
for line in reader:
for item in line[1:]:
count += 1
total += float(item)
print(total)
print(count)
print('average: ', total/count)
Use a slice to skip over the first line, i.e file.readlines[1:]
# foo.txt
line, that, you, want, to, skip
important, stuff
other, important, stuff
with open('foo.txt') as file:
for line in file.readlines()[1:]:
print(line)
important, stuff
other, important, stuff
Note that since I have used with to open a file, Python will close it for me automatically. If you have just done file = open(..) you will have to also do file.close()

How to define length of lines to read in from a file

I am reading lines from a file in Python. Here is my code:
with open('words','rb') as f:
for line in f:
Is there a way to define the amount of lines I want to use? Say for example, the first 1000 lines in the file?
You can use enumerate():
with open('words','rb') as f:
for i, line in enumerate(f):
if i >= 1000:
break
# do work for first 1000 lines
Make a variable to count. I have used i for example below. The value will be incremented in each iteration. When the value reached 999 that is, 1000 times, you can do stuffs there
i = 0
with open('words','rb') as f:
for line in f:
if(i<1000):
#do stuffs
i = i+1

Deleting n number of lines after specific line of file in python

I am trying to remove a specific number of lines from a file. These lines always occur after a specific comment line. Anyways, talk is cheap, here is an example of what I have.
FILE: --
randomstuff
randomstuff2
randomstuff3
# my comment
extrastuff
randomstuff2
extrastuff2
#some other comment
randomstuff4
So, I am trying to remove the section after # my comment. Perhaps there is someway to delete a line in r+ mode?
Here is what I have so far
with open(file_name, 'a+') as f:
for line in f:
if line == my_comment_text:
f.seek(len(my_comment_text)*-1, 1) # move cursor back to beginning of line
counter = 4
if counter > 0:
del(line) # is there a way to do this?
Not exactly sure how to do this. How do I remove a specific line? I have looked at this possible dup and can't quite figure out how to do it that way either. The answer recommends you read the file, then you re-write it. The problem with this is they are checking for a specific line when they write. I cant do that exactly, plus I dont like the idea of storing the entire files contents in memory. That would eat up a lot of memory with a large file (since every line has to be stored, rather than one at a time).
Any ideas?
You can use the fileinput module for this and open the file in inplace=True mode to allow in-place modification:
import fileinput
counter = 0
for line in fileinput.input('inp.txt', inplace=True):
if not counter:
if line.startswith('# my comment'):
counter = 4
else:
print line,
else:
counter -= 1
Edit per your comment "Or until a blank line is found":
import fileinput
ignore = False
for line in fileinput.input('inp.txt', inplace=True):
if not ignore:
if line.startswith('# my comment'):
ignore = True
else:
print line,
if ignore and line.isspace():
ignore = False
You can make a small modification to your code and stream the content from one file to the other very easily.
with open(file_name, 'r') as f:
with open(second_file_name,'w') a t:
counter = 0
for line in f:
if line == my_comment_text:
counter = 3
elif: counter > 0
counter -= 1
else:
w.write(line)
I like the answer form #Ashwini. I was working on the solution also and something like this should work if you are OK to write a new file with filtered lines:
def rewriteByRemovingSomeLines(inputFile, outputFile):
unDesiredLines = []
count = 0
skipping = False
fhIn = open(inputFile, 'r')
line = fhIn.readline()
while(line):
if line.startswith('#I'):
unDesiredLines.append(count)
skipping = True
while (skipping):
line = fhIn.readline()
count = count + 1
if (line == '\n' or line.startswith('#')):
skipping=False
else:
unDesiredLines.append(count)
count = count + 1
line = fhIn.readline()
fhIn.close()
fhIn = open(inputFile, 'r')
count = 0
#Write the desired lines to a new file
fhOut = open(outputFile, 'w')
for line in fhIn:
if not (count in unDesiredLines):
fhOut.write(line)
count = count + 1
fhIn.close()
fhOut.close

Skip first couple of lines while reading lines in Python file

I want to skip the first 17 lines while reading a text file.
Let's say the file looks like:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
good stuff
I just want the good stuff. What I'm doing is a lot more complicated, but this is the part I'm having trouble with.
Use a slice, like below:
with open('yourfile.txt') as f:
lines_after_17 = f.readlines()[17:]
If the file is too big to load in memory:
with open('yourfile.txt') as f:
for _ in range(17):
next(f)
for line in f:
# do stuff
Use itertools.islice, starting at index 17. It will automatically skip the 17 first lines.
import itertools
with open('file.txt') as f:
for line in itertools.islice(f, 17, None): # start=17, stop=None
# process lines
for line in dropwhile(isBadLine, lines):
# process as you see fit
Full demo:
from itertools import *
def isBadLine(line):
return line=='0'
with open(...) as f:
for line in dropwhile(isBadLine, f):
# process as you see fit
Advantages: This is easily extensible to cases where your prefix lines are more complicated than "0" (but not interdependent).
Here are the timeit results for the top 2 answers. Note that "file.txt" is a text file containing 100,000+ lines of random string with a file size of 1MB+.
Using itertools:
import itertools
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for line in itertools.islice(fo, 90000, None):
line.strip()""", number=100)
>>> 1.604976346003241
Using two for loops:
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for i in range(90000):
next(fo)
for j in fo:
j.strip()""", number=100)
>>> 2.427317383000627
clearly the itertools method is more efficient when dealing with large files.
If you don't want to read the whole file into memory at once, you can use a few tricks:
With next(iterator) you can advance to the next line:
with open("filename.txt") as f:
next(f)
next(f)
next(f)
for line in f:
print(f)
Of course, this is slighly ugly, so itertools has a better way of doing this:
from itertools import islice
with open("filename.txt") as f:
# start at line 17 and never stop (None), until the end
for line in islice(f, 17, None):
print(f)
This solution helped me to skip the number of lines specified by the linetostart variable.
You get the index (int) and the line (string) if you want to keep track of those too.
In your case, you substitute linetostart with 18, or assign 18 to linetostart variable.
f = open("file.txt", 'r')
for i, line in enumerate(f, linetostart):
#Your code
If it's a table.
pd.read_table("path/to/file", sep="\t", index_col=0, skiprows=17)
You can use a List-Comprehension to make it a one-liner:
[fl.readline() for i in xrange(17)]
More about list comprehension in PEP 202 and in the Python documentation.
Here is a method to get lines between two line numbers in a file:
import sys
def file_line(name,start=1,end=sys.maxint):
lc=0
with open(s) as f:
for line in f:
lc+=1
if lc>=start and lc<=end:
yield line
s='/usr/share/dict/words'
l1=list(file_line(s,235880))
l2=list(file_line(s,1,10))
print l1
print l2
Output:
['Zyrian\n', 'Zyryan\n', 'zythem\n', 'Zythia\n', 'zythum\n', 'Zyzomys\n', 'Zyzzogeton\n']
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
Just call it with one parameter to get from line n -> EOF

Categories

Resources