python remove "many" lines from file

python remove "many" lines from file - python

I am trying to remove specific line numbers from a file in python in a way such as
./foo.py filename.txt 4 5 2919
Where 4 5 and 2919 are line numbers
What I am trying to do is:
for i in range(len(sys.argv)):
if i>1: # Avoiding sys.argv[0,1]
newlist.append(int(sys.argv[i]))
Then:
count=0
generic_loop{
bar=file.readline()
count+=1
if not count in newlist:
print bar
}
it prints all the lines in original file (with blank spaces in between)

You can use enumerate to determine the line number:
import sys
exclude = set(map(int, sys.argv[2:]))
with open(sys.argv[1]) as f:
for num,line in enumerate(f, start=1):
if num not in exclude:
sys.stdout.write(line)
You can remove start=1 if you start counting at 0. In the above code, the line numbering starts with 1:
$ python3 so-linenumber.py so-linenumber.py 2 4 5
import sys
with open(sys.argv[1], 'r') as f:
sys.stdout.write(line)
If you want to write the content to the file itself, write it to a temporary file instead of sys.stdout, and then rename that to the original file name (or use sponge on the command-line), like this:
import os
import sys
from tempfile import NamedTemporaryFile
exclude = set(map(int, sys.argv[2:]))
with NamedTemporaryFile('w', delete=False) as outf:
with open(sys.argv[1]) as inf:
outf.writelines(line for n,line in enumerate(inf, 1) if n not in exclude)
os.rename(outf.name, sys.argv[1])

You can try something like this:
import sys
import os
filename= sys.argv[1]
lines = [int(x) for x in sys.argv[2:]]
#open two files one for reading and one for writing
with open(filename) as f,open("newfile","w") as f2:
#use enumerate to get the line as well as line number, use enumerate(f,1) to start index from 1
for i,line in enumerate(f):
if i not in lines: #`if i not in lines` is more clear than `if not i in line`
f2.write(line)
os.rename("newfile",filename) #rename the newfile to original one
Note that for the generation of temporary files it's better to use tempfile module.

import sys
# assumes line numbering starts with 1
# enumerate() starts with zero, so we subtract 1 from each line argument
omitlines = set(int(arg)-1 for arg in sys.argv[2:] if int(arg) > 0)
with open(sys.argv[1]) as fp:
filteredlines = (line for n,line in enumerate(fp) if n not in omitlines)
sys.stdout.writelines(filteredlines)

The fileinput module has an inplace=True option that redirects stdout to a tempfile which is automatically renamed after for you.
import fileinput
exclude = set(map(int, sys.argv[2:]))
for i, line in enumerate(fileinput.input('filename.txt', inplace=True), start=1):
if i not in exclude:
print line, # fileinput inplace=True redirects stdout to tempfile

Related

Adding a comma to end of first row of csv files within a directory using python

Ive got some code that lets me open all csv files in a directory and run through them removing the top 2 lines of each file, Ideally during this process I would like it to also add a single comma at the end of the new first line (what would have been originally line 3)
Another approach that's possible could be to remove the trailing comma's on all other rows that appear in each of the csvs.
Any thoughts or approaches would be gratefully received.
import glob
path='P:\pytest'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
for line in lines:
o.write(line+'\n')
o.close()

adding a counter in there can solve this:
import glob
path=r'C:/Users/dsqallihoussaini/Desktop/dev_projects/stack_over_flow'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()

One possible problem with your code is that you are reading the whole file into memory, which might be fine. If you are reading larger files, then you want to process the file line by line.
The easiest way to do that is to use the fileinput module: https://docs.python.org/3/library/fileinput.html
Something like the following should work:
#!/usr/bin/env python3
import glob
import fileinput
# inplace makes a backup of the file, then any output to stdout is written
# to the current file.
# change the glob..below is just an example.
#
# Iterate through each file in the glob.iglob() results
with fileinput.input(files=glob.iglob('*.csv'), inplace=True) as f:
for line in f: # Iterate over each line of the current file.
if f.filelineno() > 2: # Skip the first two lines
# Note: 'line' has the newline in it.
# Insert the comma if line 3 of the file, otherwise output original line
print(line[:-1]+',') if f.filelineno() == 3 else print(line, end="")

Ive added some encoding as well as mine was throwing a error but encoding fixed that up nicely
import glob
path=r'C:/whateveryourfolderis'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r',encoding='utf-8') as f:
lines = f.read().split("\n")
#print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w',encoding='utf-8')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()

Python string matching in if else condition

I am currently trying to find line based on the same pattern. If the line match the pattern, i want to print the line in output file.
Here is one the example of the line in "in.txt":
in_file [0:2] declk
out_file [0:1] subclk
The script that i currently have with the help of #gilch:
#!/usr/bin/python
import re
with open("in.txt", "r+") as f:
with open("out.txt, "w+") as fo:
for line in f:
if "\S*\s*[\d:\d]\s*\S*" in line:
fo.write(line) #need to fix this line
But then, is it possible to make the output like below:
e.g
Output in "out.txt":
in_file [0] declk
in_file [1] declk
in_file [2] declk
out_file [0] subclk
out_file [1] subclk

You'll need to import the re module to use regex.
import re
with open("out.txt", "w+") as fo:
for line in f:
if re.match(r"\S*\s*\[-?\d*:?-?\d*\]\s*\S*", line):
fo.write(line)
Also, indentation is part of Python's syntax. The colon isn't enough.
This also assumes that f is already some iterable containing your lines. (The above code never assigns to it.)

Try this:
import re
with open("out.txt", "w+") as fo:
for line in f:
if re.match(r"\w+\s\[\d\:\d\]\s\w+",line):
fo.write(line)

How to force every line to have the same # of tabs as the maximum length line

I have a tab-delimited txt, and I want to make every row has the same number of tabs as the row with the largest number of tabs.
For example,
A\tB\tC\tD
E\t
F\tG\t
input file : https://drive.google.com/file/d/0B1sEqo7wNB1-bmpKaWdrSmUtcUE/edit?usp=sharing
will become
A\tB\tC\tD
E\t\t\t
F\tG\t\t
I am trying this.
import sys
from itertools import izip_longest
import codecs
inputf = sys.argv[1]
outputf = sys.argv[2]
with open(inputf) as f:
data = izip_longest(*(x.split('\t') for x in f), fillvalue='\t')
for line in zip(*data):
print line,
ofile = codecs.open(outputf, "w")
But output has nothing although it prints things in command window.
I hope this program doesn't prints these in command window (it seems it takes much time)
and I hope the output file has the correct output.

Try to use csv module, like this
#!/usr/bin/env python
import sys
import csv
from itertools import izip_longest
def read_rows(inputfile):
with open(inputfile, 'rb') as h:
reader = csv.reader(h, dialect='excel-tab')
return list(reader)
def write_rows(outputfile, rows):
with open(outputfile, 'wb') as h:
writer = csv.writer(h, dialect='excel-tab')
for row in rows:
writer.writerow(row)
def show_file(outputfile):
with open(outputfile, 'r') as h:
print h.read().splitlines()
def main(inputfile, outputfile):
rows = read_rows(inputfile)
rows = zip(*(izip_longest(*rows, fillvalue='')))
write_rows(outputfile, rows)
show_file(outputfile)
if __name__ == '__main__':
inputfile = sys.argv[1]
outputfile = sys.argv[2]
main(inputfile, outputfile)
With your input file:
./normalize.py ~/Downloads/input.txt ~/Downloads/output.txt
['A\tB\tC\tD', 'E\t\t\t', 'F\tG\t\t']

You're seeing the output in the command window because you're printing out what's indata(which consumes the iterator returned byizip_longest()). Nothing ends up in the file because no data is ever written to it, you only opened it for writing.
I believe the following will do (only) what you want:
import sys
from itertools import izip_longest
import codecs
inputf = sys.argv[1]
outputf = sys.argv[2]
with open(inputf) as f:
data = izip_longest(*(x.strip().split('\t') for x in f), fillvalue='')
with codecs.open(outputf, "w") as ofile:
ofile.write('\n'.join('\t'.join(items) for items in zip(*data)) + '\n')

But output has nothing although it prints things in command window.
This is because you are not writing the data to the file.
Change your program as follows
with open(inputf) as fin, open(outputf, "w") as fout:
data = izip_longest(*(x.split('\t') for x in fin), fillvalue='\t')
fout.write('\n'.join(map(''.join, zip(*data))))
Note, your program may not give the desired output as the newline character is part of the characters in the list of elements you are zipping. You need to strip off the newline from the lines read
data = izip_longest(*(x.strip().split('\t') for x in f), fillvalue='\t')

create standard compliant file

I have a comma delimited file. The lines look like this...
1,2,3,4,5
6,7,8
9,10
11,12,13,14,15
I need to have exactly 5 columns across all lines. So the new file will be...
1,2,3,4,5
6,7,8,,
9,10,,,
11,12,13,14,15
In other words, if there are less than 4 commas in a line. add required number to the end. I was told that there is python module that will do exactly the same. Where can I find such module? Is awk better suited for such type of tasks?

The module you are looking for is the csv module. You'd still need to ensure that your lists meet you minimal length requirements:
with open('output.csv', 'wb') as output:
input = csv.reader(open('faultyfile.csv', 'rb'))
output = csv.writer(output, dialect=input.dialect)
for line in input:
if len(line) < 5:
line.extend([''] * (5 - len(line)))
output.writerow(line)

If you don't mind using awk, then it is easy:
$ cat data.txt
1,2,3,4,5
6,7,8
9,10
11,12,13,14,15
$ awk -F, 'BEGIN {OFS=","} {print $1,$2,$3,$4,$5}' data.txt
1,2,3,4,5
6,7,8,,
9,10,,,
11,12,13,14,15

with open('somefile.txt') as f:
rows = []
for line in f:
rows.append(line.split(","))
max_cols = len(max(rows,key=len))
for row in rows:
row.extend(['']*(max_cols-len(row))
print "\n".join(str(r) for r in rows)
If you are sure that it will always be n items long (in this case 5) and you will always know before opening the file ... it is more memory efficient to do (something like this)
with open("f1","r"):
with open("f2","w"):
for line in f1:
f2.write(line+(","*(4-line.count(",")))+"\n")

def correct_file(fname):
with open(fname) as f:
data = [ line[:-1]+(4-line.count(','))*',' + '\n' for line in f ]
with open(fname,'w'):
f.writelines(data)
As noted in the comments, this reads the entire file into memory when you really don't need to. To do it not all in one go:
import shutil
def correct_file(fname):
with open(fname,'r') as fin, open('temp','w') as fout:
for line in fin:
new = line[:-1]+(4-line.count(','))*',' + '\n'
fout.write(new)
shutil.move('temp',fname)
This will make any file named temp disappear in the current directory. Of course, you can always use the tempfile module to get around that ...
And for the slightly more verbose, but bullet-proof (?) version:
import shutil
import tempfile
import atexit
import os
def try_delete(fname):
try:
os.unlink(fname)
except OSError:
if os.path.exists(fname):
print "Couldn't delete existing file",fname
def correct_file(fname):
with open(fname,'r') as fin, tempfile.NamedTemporaryFile('w',delete=False) as fout:
atexit.register(lambda f=fout.name: try_delete(f)) #Need a closure here ...
for line in fin:
new = line[:-1]+(4-line.count(','))*',' + '\n'
fout.write(new)
shutil.move(fout.name,fname) #This should get rid of the temporary file ...

This might work for you (GNU sed):
sed ':a;s/,/&/4;t;s/$/,/;ta' file

Skip first couple of lines while reading lines in Python file

I want to skip the first 17 lines while reading a text file.
Let's say the file looks like:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
good stuff
I just want the good stuff. What I'm doing is a lot more complicated, but this is the part I'm having trouble with.

Use a slice, like below:
with open('yourfile.txt') as f:
lines_after_17 = f.readlines()[17:]
If the file is too big to load in memory:
with open('yourfile.txt') as f:
for _ in range(17):
next(f)
for line in f:
# do stuff

Use itertools.islice, starting at index 17. It will automatically skip the 17 first lines.
import itertools
with open('file.txt') as f:
for line in itertools.islice(f, 17, None): # start=17, stop=None
# process lines

for line in dropwhile(isBadLine, lines):
# process as you see fit
Full demo:
from itertools import *
def isBadLine(line):
return line=='0'
with open(...) as f:
for line in dropwhile(isBadLine, f):
# process as you see fit
Advantages: This is easily extensible to cases where your prefix lines are more complicated than "0" (but not interdependent).

Here are the timeit results for the top 2 answers. Note that "file.txt" is a text file containing 100,000+ lines of random string with a file size of 1MB+.
Using itertools:
import itertools
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for line in itertools.islice(fo, 90000, None):
line.strip()""", number=100)
>>> 1.604976346003241
Using two for loops:
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for i in range(90000):
next(fo)
for j in fo:
j.strip()""", number=100)
>>> 2.427317383000627
clearly the itertools method is more efficient when dealing with large files.

If you don't want to read the whole file into memory at once, you can use a few tricks:
With next(iterator) you can advance to the next line:
with open("filename.txt") as f:
next(f)
next(f)
next(f)
for line in f:
print(f)
Of course, this is slighly ugly, so itertools has a better way of doing this:
from itertools import islice
with open("filename.txt") as f:
# start at line 17 and never stop (None), until the end
for line in islice(f, 17, None):
print(f)

This solution helped me to skip the number of lines specified by the linetostart variable.
You get the index (int) and the line (string) if you want to keep track of those too.
In your case, you substitute linetostart with 18, or assign 18 to linetostart variable.
f = open("file.txt", 'r')
for i, line in enumerate(f, linetostart):
#Your code

If it's a table.
pd.read_table("path/to/file", sep="\t", index_col=0, skiprows=17)

You can use a List-Comprehension to make it a one-liner:
[fl.readline() for i in xrange(17)]
More about list comprehension in PEP 202 and in the Python documentation.

Here is a method to get lines between two line numbers in a file:
import sys
def file_line(name,start=1,end=sys.maxint):
lc=0
with open(s) as f:
for line in f:
lc+=1
if lc>=start and lc<=end:
yield line
s='/usr/share/dict/words'
l1=list(file_line(s,235880))
l2=list(file_line(s,1,10))
print l1
print l2
Output:
['Zyrian\n', 'Zyryan\n', 'zythem\n', 'Zythia\n', 'zythum\n', 'Zyzomys\n', 'Zyzzogeton\n']
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
Just call it with one parameter to get from line n -> EOF

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python remove "many" lines from file - python

Related

Adding a comma to end of first row of csv files within a directory using python

Python string matching in if else condition

How to force every line to have the same # of tabs as the maximum length line

create standard compliant file

Skip first couple of lines while reading lines in Python file

Categories

Resources