How to read csv on python with newline separator # - python

I need to read a csv on Python, and the text file that I have has this structure:
"114555","CM13","0004","0","C/U"#"99172","CM13","0001","0","C/U"#"178672","CM13","0001","0","C/U"
delimeter: ,
newline: #
My code so far:
import csv
data = []
with open('stock.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',', lineterminator='#')
for row in reader:
data.append({'MATERIAL': row[0],'CENTRO': row[1], 'ALMACEN': row[2], 'STOCK_VALORIZADO' : row[3], 'STOCK_UMB':row[4]})
print(data) #this print just one row
This code only print one row, because it's not recognize # as a newline,
and prints it with quotes:
[{'MATERIAL': '114555', 'CENTRO': 'CM13', 'ALMACEN': '0004', 'STOCK_VALORIZADO': '0', 'STOCK_UMB': 'C/U#"99172"'}]

According to https://docs.python.org/2/library/csv.html :
"The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." Hence for now, providing the argument lineterminator='#' will not work.
I think the best option is to read your entire file into a variable, and replace all '#' characters, you can do this as follows:
with open("stock.csv", "r") as myfile:
data = myfile.read().replace('#', '\n')
Now you need to adjust your algorithm in such a way that you can pass the variable data to csv.reader (instead of the file stock.csv), according to the python doc:
"The "iterable" argument can be any object that returns a line
of input for each iteration, such as a file object or a list. [...]"
Hence you can pass data.splitlines() to csv.reader.

I was struggling with CRLF ('\r\n') line endings using csv.reader. I was able to get it working using the newline parameter in open
with open(local_file, 'r', newline='\r\n') as f:
reader = csv.reader(f)

Related

Replace newlines in the middle of .csv columns

I have a CSV file which can look like this:
Col1,"Col2","Col31
Col32
Col33"
That means if I use this code:
with open(inpPath, "r+") as csvFile:
wb_obj = list(csv.reader(csvFile, delimiter=','))
for row in wb_obj:
print(row)
The output looks like this:
[Col1,"Col2","Col31\nCol32\nCol33"]
So I am trying to replace the \n characters with spaces so the CSV file would be rewritten like this: Col1,"Col2","Col31 Col32 Col33"
I have written this short function but it results in Error: Cannot open mapping csv, Exception thrown I/O operation on closed file.
def processCSV(fileName):
with open(fileName, "rU") as csvFile:
filtered = (line.replace('\n', ' ') for line in csvFile)
wb_obj = csv.reader(filtered, delimiter=",")
return wb_obj
How could I fix that? Thank you very much for any help
Your processCSV function returns an iterable based on the file object csvFile, and yet when the iterable gets consumed by the caller of the processCSV function, the file object csvFile is already closed by the context manager for being outside the with statement, hence the said error. The common pattern for such a function is to make it accept a file object as a parameter, and let the caller open the file instead to pass the file object to the function so that the file can remain open for the caller.
You also should not replace all newlines with spaces to begin with, since you really only want to replace newlines if they are within double quotes, which would be parsed by csv.reader as part of a column value rather than row separators. Instead, you should let csv.reader do its parsing first, and then replace newline characters with spaces in all column values:
def processCSV(file):
return ([col.replace('\n', ' ') for col in row] for row in csv.reader(file))
# the caller
with open(filename) as file:
for row in processCSV(file):
print(*row, sep=',')
This comes down to using a generator expression inside of the file context. See a longer explanation here: https://stackoverflow.com/a/39656712/15981783.
You can change the generator expression to a list comprehension and it will work:
import csv
def processCSV(fileName):
with open(fileName, "r") as csvFile:
# list comprehension used here instead
filtered = [line.replace('\n', ' ') for line in csvFile]
wb_obj = csv.reader(filtered, delimiter=",")
return wb_obj
print(list(processCSV("tmp.csv")))
Returns:
[['Col1', 'Col2', 'Col31 Col32 Col33']]

How to read csv data, strip spaces/tabs and write to new csv file?

I have a large (1.6million rows+) .csv file that has some data with leading spaces, tabs, and trailing spaces and maybe even trailing tabs. I need to read the data in, strip all of that whitespace, and then spit the rows back out into a new .csv file preferably with the most efficient code possible and using only built-in modules in python 3.7
Here is what I have that is currently working, except it only spits out the header over and over and over and doesn't seem to take care of trailing tabs (not a huge deal though on trailing tabs):
def new_stripper(self, input_filename: str, output_filename: str):
"""
new_stripper(self, filename: str):
:param self: no idea what this does
:param filename: name of file to be stripped, must have .csv at end of file
:return: for now, it doesn't return anything...
-still doesn't remove trailing tabs?? But it can remove trailing spaces
-removes leading tabs and spaces
-still needs to write to new .csv file
"""
import csv
csv.register_dialect('strip', skipinitialspace=True)
reader = csv.DictReader(open(input_filename), dialect='strip')
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in reader)
for row in reader:
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(row)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
new_stripper(self='', input_filename=input_filename, output_filename=output_filename)
As written above, the code just prints the headers over and over in a single line. I've played around with the arrangement and indenting of the last four lines of the def with some different results, but the closest I've gotten is getting it to print the header row again and again on new lines each time:
...
# headers and headers for days
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
for row in reader:
writer.writerow(row)
EDIT1: Here's the result from the non-stripping correctly thing. Some of them have leading spaces that weren't stripped, some have trailing spaces that weren't stripped. It seems like the left-most column was properly stripped of leading spaces, but not trailing spaces; same with header row.
enter image description here
Update: Here's the solution I was looking for:
def get_data(self, input_filename: str, output_filename: str):
import csv
with open(input_filename, 'r', newline='') as in_file, open(output_filename, 'w', newline='') as out_file:
r = csv.reader(in_file, delimiter=',')
w = csv.writer(out_file, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
get_data(self='', input_filename=input_filename, output_filename=output_filename)
Don't make life complicated for yourself, "CSV" files are simple plain text files, and can be handled in a generic way:
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
trim = (field.strip() for field in line.split(','))
of.write(','.join(trim)+'\n')
Alternatively, using the csv module:
import csv
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
r = csv.reader(inf, delimiter=',')
w = csv.writer(of, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)
Unfortunately I cannot comment, but I believe you might want to strip every entry in csv of the white space (not just the line). If that is the case, then, based on Jan's answer, this might do the trick:
with open('file.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
of.write(','.join(list(map(str.strip, line.split(',')))) + '\n')
What it does is it splits each line by comma resulting in a list of values, then strips every element from whitespace to later join them back up and save to output file.
your final reader variable contains tuple of dicts but your writer expects list.
you can either user csv.DictWriter or store the processed data(v) in a list first and then write to csv and include headers using writer.writeheader()

how can I use csv tools for zip text file?

update-my file.txt.zp is tab delimited and looks kind of like this :
file.txt.zp
I want to split the first col by : _ /
original post:
I have a very large zipped tab delimited file.
I want to open it, scan it one row at a time, split some of the col, and write it to a new file.
I got various errors (every time I fix one another pops)
This is my code:
import csv
import re
import gzip
f = gzip.open('file.txt.gz')
original = f.readlines()
f.close()
original_l = csv.reader(original)
for row in original_l:
file_l = re.split('_|:|/',row)
with open ('newfile.gz', 'w', newline='') as final:
finalfile = csv.writer(final,delimiter = ' ')
finalfile.writerow(file_l)
Thanks!
for this code i got the error:
for row in original_l:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
so based on what I found here I added this after f.close():
original = original.decode('utf8')
and then got the error:
original = original.decode('utf8')
AttributeError: 'list' object has no attribute 'decode'
Update 2
This code should produce the output that you're after.
import csv
import gzip
import re
with gzip.open('file.txt.gz', mode='rt') as f, \
open('newfile.gz', 'w') as final:
writer = csv.writer(final, delimiter=' ')
reader = csv.reader(f, delimiter='\t')
_ = next(reader) # skip header row
for row in reader:
writer.writerow(re.split(r'_|:|/', row[0]))
Update
Open the gzip file in text mode because str objects are required by the CSV module in Python 3.
f = gzip.open('file.txt.gz', 'rt')
Also specify the delimiter when creating the csv.reader.
original_l = csv.reader(original, delimiter='\t')
This will get you past the first hurdle.
Now you need to explain what the data is, which columns you wish to extract, and what the output should look like.
Original answer follows...
One obvious problem is that the output file is constantly being overwritten by the next row of input. This is because the output file is opened in (over)write mode (`'w`` ) once per row.
It would be better to open the output file once outside of the loop.
Also, the CSV file delimiter is not specified when creating the reader. You said that the file is tab delimited so specify that:
original_l = csv.reader(original, delimiter='\t')
On the other hand, your code attempts to split each row using other delimiters, however, the rows coming from the csv.reader are represented as a list, not a string as the re.split() code would require.
Another problem is that the output file is not zipped as the name suggests.

Read CSV with comma as linebreak

I have a file saved as .csv
"400":0.1,"401":0.2,"402":0.3
Ultimately I want to save the data in a proper format in a csv file for further processing. The problem is that there are no line breaks in the file.
pathname = r"C:\pathtofile\file.csv"
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
print(reader)
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file)
csv_writer.writerow(reader)
The print reader output looks exactly how I want (or at least it's a format I can further process).
"400":0.1
"401":0.2
"402":0.3
And now I want to save that to a new csv file. However the output looks like
"""",4,0,0,"""",:,0,.,1,"
","""",4,0,1,"""",:,0,.,2,"
","""",4,0,2,"""",:,0,.,3
I'm sure it would be intelligent to convert the format to
400,0.1
401,0.2
402,0.3
at this stage instead of doing later with another script.
The main problem is that my current code
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
reader = csv.reader(reader,delimiter=':')
x = []
y = []
print(reader)
for row in reader:
x.append( float(row[0]) )
y.append( float(row[1]) )
print(x)
print(y)
works fine for the type of csv files I currently have, but doesn't work for these mentioned above:
y.append( float(row[1]) )
IndexError: list index out of range
So I'm trying to find a way to work with them too. I think I'm missing something obvious as I imagine that it can't be too hard to properly define the linebreak character and delimiter of a file.
with open(pathname, newline=',') as file:
yields
ValueError: illegal newline value: ,
The right way with csv module, without replacing and casting to float:
import csv
with open('file.csv', 'r') as f, open('filenew.csv', 'w', newline='') as out:
reader = csv.reader(f)
writer = csv.writer(out, quotechar=None)
for r in reader:
for i in r:
writer.writerow(i.split(':'))
The resulting filenew.csv contents (according to your "intelligent" condition):
400,0.1
401,0.2
402,0.3
Nuances:
csv.reader and csv.writer objects treat comma , as default delimiter (no need to file.read().replace(',', '\n'))
quotechar=None is specified for csv.writer object to eliminate double quotes around the values being saved
You need to split the values to form a list to represent a row. Presently the code is splitting the string into individual characters to represent the row.
pathname = r"C:\pathtofile\file.csv"
with open(pathname) as old_file:
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter=',')
text_rows = old_file.read().split(",")
for row in text_rows:
items = row.split(":")
csv_writer.writerow([int(items[0]), items[1])
If you look at the documentation, for write_row, it says:
Write the row parameter to the writer’s file
object, formatted according to the current dialect.
But, you are writing an entire string in your code
csv_writer.writerow(reader)
because reader is a string at this point.
Now, the format you want to use in your CSV file is not clearly mentioned in the question. But as you said, if you can do some preprocessing to create a list of lists and pass each sublist to writerow(), you should be able to produce the required file format.

Removing last row in csv

I'm trying to remove the last row in a csv but I getting an error: _csv.Error: string with NUL byte
This is what I have so far:
dcsv = open('PnL.csv' , 'a+r+b')
cWriter = csv.writer(dcsv, delimiter=' ')
cReader = csv.reader(dcsv)
for row in cReader:
cWriter.writerow(row[:-1])
I cant figure out why I keep getting errors
I would just read in the whole file with readlines(), pop out the last row, and then write that with csv module
import csv
f = open("summary.csv", "r+w")
lines=f.readlines()
lines=lines[:-1]
cWriter = csv.writer(f, delimiter=',')
for line in lines:
cWriter.writerow(line)
This should work
import csv
f = open('Pnl.csv', "r+")
lines = f.readlines()
lines.pop()
f = open('Pnl.csv', "w+")
f.writelines(lines)
I'm not sure what you're doing with the 'a+r+b' file mode and reading and writing to the same file, so won't provide a complete code snippet, but here's a simple method to skip any lines that contains a NUL byte in them in a file you're reading, whether it's the last, first, or one in the middle being read.
The trick is to realize that the docs say the csvfile argument to a csv.writer() "can be any object which supports the iterator protocol and returns a string each time its next() method is called." This means that you can replace the file argument in the call with a simple filter iterator function defined this way:
def filter_nul_byte_lines(a_file):
for line in a_file:
if '\x00' not in line:
yield line
and use it in a way similar to this:
dcsv = open('Pnl.csv', 'rb+')
cReader = csv.reader(filter_nul_byte_lines(dcsv))
for row in cReader:
print row
This will cause any lines with a NUL byte in them to be ignored while reading the file. Also this technique works on-the-fly as each line is read, so it does not require reading the entire file into memory at once or preprocessing it ahead of time.

Categories

Resources