Python Function whose arguments specify the lines to retrieve from CSV - python

Hello there community!
I've been struggling with this function for a while and I cannot seem to make it work.
I need a function that reads a csv file and takes as arguments: the csv file, and a range of first to last line to be read. I've looked into many threads but none seem to work in my case.
My current function is based on the answer to this post: How to read specific lines of a large csv file
It looks like this:
def lines(file, first, last):
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csvfile:
for line_number, row in enumerate(csvfile):
if line_number in lines_set:
result.append(extract_data(csvfile.readline())) #this extract_data is a previously created function that fetches specific columns from the csv
return result
What's happening is it's skipping a row everytime, meaning that instead of reading, for example, from line 1 to 4, it reads these four lines: 1, 3, 5 and 7.
Adittional question: The csv file has a headline. How can I use the next method in order to never include the header?
Thank you very much for all your help!

I recommend you use the CSV reader, it’ll save you from getting any incomplete row as a row can span many lines.
That said, your basic problem is manually iterating your reader inside the for-loop, which is already automatically iterating your reader:
import csv
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csv_file:
reader = csv.reader(csv_file)
next(reader) # only if CSV has a header AND it should not be counted
for line_number, row in enumerate(reader):
if line_number in lines_set:
result.append(extract_data(row))
return result
Also, keep in mind that enumerate() will start at 0 by default, give the start=1 option (or whatever start value you need) to start counting correctly. Then, range() has a non-inclusive end, so you might want end+1.
Or... do away with range() entirely since < and > are sufficient, and maybe clearer:
import csv
start = 3
end = 6
with open('input.csv') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for i, row in enumerate(reader, start=1):
if i < start or i > end:
continue
print(row)

Related

CSV parsing in Python

I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

python array indexing through function

I'm writing a function that will read a file given the number of header lines to skip and the number of footer lines to skip.
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
fin = open(file)
text = []
for line in fin.readlines()[HeaderLinesToSkip, -FooterLinesToSkip]
text.append(line.strip())
return text
My problem is that this function will work properly only of FooterLinesToSkip is at least equal to 1. If FooterLinesToSkip = 0, then the function will return []. I can solve this problem with an if statement, but is there a much simpler form?
Edit : I actually simplified my problem; the lines read from the file contains columns separated by a semi-column. The real function includes .split(delimiter_character) and should store only column 1.
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
fin = open(file)
text = []
for line in fin.readlines()[HeaderLinesToSkip, -FooterLinesToSkip]
text.append(line.strip().split(';')[1])
return text
Set FooterLinesToSkip to None instead, so the slice defaults to the list length:
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
with open(file) as fin:
FooterLinesToSkip = -FooterLinesToSkip if FooterLinesToSkip else None
text = []
for line in fin.readlines()[HeaderLinesToSkip:FooterLinesToSkip]):
text.append(line.strip().split(';')[1])
Let me offer you an improvement, which does not require you to read the whole list into memory:
from collections import deque
from itertools import islice
def skip_headers_and_footers(fh, header_skip, footer_skip):
buffer = deque(islice(fh, header_skip, header_skip + footer_skip), footer_skip)
for line in fh:
yield buffer.popleft()
buffer.append(line)
This reads lines one by one, after skipping header_skip lines, and keeping footer_skip lines in a buffer. By the time we looped over all lines in the file, footer_skip lines remain in the buffer and are ignored.
This is a generator function, so it'll yield lines in a loop:
with open(filename) as open_file:
for line in skip_headers_and_footers(open_file, 2, 2):
# do something with this line.
line = line.strip()
I moved the file opening out of the function so that it can be used for other iterables too, not just files.
Now you can use the csv module to handle the column splitting and stripping:
import csv
with open(filename, 'rb') as open_file:
reader = csv.reader(open_file, delimiter=';')
for row in skip_headers_and_footers(reader, 2, 2):
column = row[1]
and the skip_headers_and_footers() generator has skipped the first two rows for you and will never yield the last two rows either.

StopIteration Error in pythoncode while reading csv data

I am writing a program to read csv file. I have craeted a reader object and calling next() on it gives me the header row. But when I am calling it again it gives StopIteration error although there are rows in the csv file.I am doing file.seek(0) then it is working fine. Anyone please explains this to me? A snapshot of code is given below:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for colname in header[2:]:
col_index = header.index(colname)
# f.seek(0)
next(reader)
You're calling next once for each column (except the first two). So, if you have, say, 10 columns, it's going to try to read 8 rows.
If you have 20 rows, that's not going to raise an exception, but you'll be ignoring the last 12 rows, which you probably don't want. On the other hand, if you have only 5 rows, it's going to raise when trying to read the 6th row.
The reason the f.seek(0) prevents the exception is that it resets the file back to the start before each next, so you just read the header row over and over, ignoring everything else in the file. It doesn't raise anything, but it's not doing useful.
What you probably wanted is something like this:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for row in reader:
for col_index, colname in enumerate(header)[2:]:
value = row[col_index]
result.append(do_something_with(value, colname))
This reads every row exactly once, and does something with each column but the first two of each row.
From a comment, what you actually want to do is find the maximum value for each column. So, you do need to iterate over the columns–and then, within each column, you need to iterate over the rows.
A csv.reader is an iterator, which means you can only iterate over it once. So, if you just do this the obvious way, it won't work:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(reader, key=operator.itemgetter(col_index))
The first column will read whatever's left after reading the header, which is good. The next column will read whatever's left after reading the whole file, which is nothing.
So, how can you fix this?
One way is to re-create the iterator each time through the outer loop:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
with open(file) as f:
reader = csv.reader(f)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
The problem with this is that you're reading the whole file N times, and reading the file off disk is probably by far the slowest thing your program does.
What you were attempting to do with f.seek(0) is a trick that depends on how file objects and csv.reader objects work. While file objects are iterators, they're special, in that they have a way to reset them to the beginning (or to save a position and return to it later). And csv.reader objects are basically simple wrappers around file objects, so if you reset the file, you also reset the reader. (It's not clear that this is guaranteed to work, but if you know how csv works, you can probably convince yourself that in practice it's safe.) So:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
f.seek(0)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
This saves you the cost of closing and opening the file each time, but that's not the expensive part; you're still doing the disk reads over and over. And now anyone reading your code has to understand the trick with using file objects as iterators but resetting them, or they won't know how your code works.
So, how can you avoid that?
In general, whenever you need to make multiple passes over an iterator, there are two options. The simple solution is to copy the iterator into a reusable iterable, like a list:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
rows = list(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(rows, key=lambda row: float(row[col_index]))
This is not only much simpler than the earlier code, it's also much faster. Unless the file is huge. By storing all of the rows in a list, you're reading the whole file into memory at once. If it's too big to fit, your program will fail. Or, worse, if it fits, but only by using virtual memory, your program will swap parts of it in and out of memory every time you go through the loop, thrashing your swapfile and making everything slow to a crawl.
The other alternative is to reorganize things so you only have to make one pass. This means you have to put the loop over the rows on the outside, and the loop over the columns on the inside. It requires rethinking the design a bit, and it means you can't just use the simple max function, but the tradeoff is probably worth it:
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
maxes = {colname: float('-inf') for colname in header[2:]}
for row in reader:
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(maxes[colname], float(row[col_index]))
You can simplify this even further—e.g., use a Counter instead of a plain dict, and a DictReader instead of a plain reader—but it's already simple, readable, and efficient as-is.
For me Was accidentally remove data (there is no data) in .csv
so get this message.
so make sure to check if there is data in .csv file or not.
Why didn't you write:
header = next(reader)
In the last line as well? I don't know if this is your problem, but I would start there.

Consolidate several lines of a CSV file with firewall rules, in order to parse them easier?

I have a CSV file, which I created using an HTML export from a Check Point firewall policy.
Each rule is represented as several lines, in some cases. That occurs when a rule has several address sources, destinations or services.
I need the output to have each rule described in only one line.
It's easy to distinguish when each rule begins. In the first column, there's the rule ID, which is a number.
Here's an example. In green are marked the strings that should be moved:
http://i.imgur.com/i785sDi.jpg
Let me show you an example:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp;accept;
;;;;igmp;;
2;Testing;fwgcluster;fwgcluster;FireWall;accept;
;;fwmgmpe;fwmgmpe;ssh;;
;;fwmgm;fwmgm;;;
What I need ,explained in pseudo code, is this:
Read the first column of the next line. If there's a number:
Evaluate the first column of the next line. If there's no number there, concatenate (separating with a comma) \
the strings in the columns of this line with the last one and eliminate the text in the current one
The output should be something like this:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp-igmp;accept;
;;;;;;
2;Testing;fwgcluster-fwmgmpe-fwmgm;fwgcluster-fwmgmpe-fwmgm;FireWall-ssh;accept;
;;;;;;
The empty lines are there only to be more clear, I don't actually need them.
Thanks!
This should get you started
import csv
with open('data.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';')
for r in reader:
print r
EDIT: Given your required output, this should get you nearly there. Its a bit crude but does the majority of what you need. It checks for the 'No.' key and if it has a value it will start a record. If not it will join any other data in the row with the equivalent data in the record. Finally, when a new record is created the old one is appended to the result, this also happens at the end to catch the last item.
import csv
result, record = [], None
with open('data2.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';', lineterminator='\n')
for r in reader:
if r['NO.']:
if record:
result.append(record)
record = r
else:
for key in r.keys():
if r[key]:
record[key] = '-'.join([record[key], r[key]])
if record:
result.append(record)
print result
Graeme, thanks again, just before your edit I solved it with the following code.
But you got me looking in the right direction!
If anyone needs it, here it is:
import csv
# adjust these 3 lines
WRITE_EMPTIES = False
INFILE = "input.csv"
OUTFILE = "output.csv"
with open(INFILE, "r") as in_file:
r = csv.reader(in_file, delimiter=";")
with open(OUTFILE, "wb") as out_file:
previous = None
empties_to_write = 0
out_writer = csv.writer(out_file, delimiter=";")
for i, row in enumerate(r):
first_val = row[0].strip()
if first_val:
if previous:
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
empties_to_write = 0
previous = row
else: # append sub-portions to each other
previous = [
"|".join(
subitem
for subitem in existing.split(",") + [new]
if subitem
)
for existing, new in zip(previous, row)
]
empties_to_write += 1
if previous: # take care of the last row
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)

Have csv.reader tell when it is on the last line

Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.

Categories

Resources