Find delimiter in txt to convert to csv using Python

Find delimiter in txt to convert to csv using Python - python

I have to convert some txt files to csv
(and make some operation during the conversion).
I use csv.Sniffer() class to detect wich delimiter is used in the txt
This code
with open(filename_input, 'r') as f1, open(filename_output, 'wb') as f2:
dialect = csv.Sniffer().sniff(f1.read(1024)) #### detect delimiters
f1.seek(0)
r=csv.reader(f1, delimiter=dialect )
writer = csv.writer(f2,delimiter=';')
return: Error: Could not determine delimiter
This work
with open(filename_input, 'r') as f1, open(filename_output, 'wb') as f2:
#dialect = csv.Sniffer().sniff(f1.read(1024)) #### detect delimiters
#f1.seek(0)
r=csv.reader(f1, delimiter='\t' )
writer = csv.writer(f2,delimiter=';')
or
with open(filename_input, 'r') as f1, open(filename_output, 'wb') as f2:
#dialect = csv.Sniffer().sniff(f1.read(1024)) #### detect delimiters
#f1.seek(0)
r=csv.reader(f1, dialect="excel-tab")
writer = csv.writer(f2,delimiter=';')
this is a txt row example (10 records delimited by Tab)
166 14908941 sa_s NOVA i 7.05 DEa 7.17 Ncava - Deo mo 7161 4,97
why csv.Sniffer() class doesn't work?
The bug was read only 1024 byte to parse the entire txt(maybe this is not enough to detect the delimiter).
Now this code works without no other edit:
with open(filename_input, 'r') as f1, open(filename_output, 'wb') as f2:
dialect = csv.Sniffer().sniff(f1.read()) #### error with dialect = csv.Sniffer().sniff(f1.read(1024))
f1.seek(0)
r=csv.reader(f1, delimiter=dialect )
writer = csv.writer(f2,delimiter=';')

You have to use dialect.delimiter instead of just dialect because what is returned is of type class Dialect and you need its attribute Dialect.delimiter
rows=csv.reader(f1, delimiter=dialect.delimiter)
Modified code will be as below
import csv
filename_input = 'filein.txt'
filename_output = 'fileout.csv'
with open(filename_input, 'r') as f1, open(filename_output, 'wb') as f2:
dialect = csv.Sniffer().sniff(f1.read(1024), "\t") #### detect delimiters
f1.seek(0)
print(dialect.delimiter)
rows=csv.reader(f1, delimiter=dialect.delimiter)
writer = csv.writer(f2,delimiter=';')
writer.writerows(rows)
Output:
C:\pyp>python.exe txttocsv.py
,
C:\pyp>
Also note that from doc:
sniff(sample, delimiters=None)
Analyze the given sample and return a Dialect subclass reflecting
the parameters found. If the optional delimiters parameter is given,
it is interpreted as a string containing possible valid delimiter
characters.
Hence if the delimiter that you want to find in your text file is something like # instead of , or ; then you should mention that in sniff function as second parameter like this:
dialect = csv.Sniffer().sniff(f1.read(1024), '#')
Update: For reading whole file you will need
dialect = csv.Sniffer().sniff(f1.read())

The code works but in CSV that is generated each record is skipping one line.
The code i used :-
import csv
filename_input = r'filepath.txt'
filename_output = r'filepath.csv'
with open(filename_input, 'r') as tmp, open(filename_output, 'w') as tmp2:
dialect = csv.Sniffer().sniff(tmp.read(1024), ";") #### detect delimiters
tmp.seek(0)
print(dialect.delimiter)
rows=csv.reader(tmp, delimiter=dialect.delimiter)
writer = csv.writer(tmp2,delimiter=',')
writer.writerows(rows)
Input:-
Output:-

Related

CSV Writer (Python) with CRLF instead of LF

Hi I am trying to use csv library to convert my CSV file into a new one.
The code that I wrote is the following:
import csv
import re
file_read=r'C:\Users\Comarch\Desktop\Test.csv'
file_write=r'C:\Users\Comarch\Desktop\Test_new.csv'
def find_txt_in_parentheses(cell_txt):
pattern = r'\(.+\)'
return set(re.findall(pattern, cell_txt))
with open(file_write, 'w', encoding='utf-8-sig') as file_w:
csv_writer = csv.writer(file_w, lineterminator="\n")
with open(file_read, 'r',encoding='utf-8-sig') as file_r:
csv_reader = csv.reader(file_r)
for row in csv_reader:
cell_txt = row[0]
txt_in_parentheses = find_txt_in_parentheses(cell_txt)
if len(txt_in_parentheses) == 1:
txt_in_parentheses = txt_in_parentheses.pop()
cell_txt_new = cell_txt.replace(' ' + txt_in_parentheses,'')
cell_txt_new = txt_in_parentheses + '\n' + cell_txt_new
row[0] = cell_txt_new
csv_writer.writerow(row)
The only problem is that in the resulting file (Test_new.csv file), I have CRLF instead of LF.
Here is a sample image of:
read file on the left
write file on the right:
And as a result when I copy the csv column into Google docs Excel file I am getting a blank line after each row with CRLF.
Is it possible to write my code with the use of csv library so that LF is left inside a cell instead of CRLF.

From the documentation of csv.reader
If csvfile is a file object, it should be opened with newline=''1
[...]
Footnotes
1(1,2)
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
This is precisely the issue you're seeing. So...
with open(file_read, 'r', encoding='utf-8-sig', newline='') as file_r, \
open(file_write, 'w', encoding='utf-8-sig', newline='') as file_w:
csv_reader = csv.reader(file_r, dialect='excel')
csv_writer = csv.writer(file_w, dialect='excel')
# ...

You are on Windows, and you open the file with mode 'w' -- which gives you windows style line endings. Using mode 'wb' should give you the preferred behaviour.

How to read csv data, strip spaces/tabs and write to new csv file?

I have a large (1.6million rows+) .csv file that has some data with leading spaces, tabs, and trailing spaces and maybe even trailing tabs. I need to read the data in, strip all of that whitespace, and then spit the rows back out into a new .csv file preferably with the most efficient code possible and using only built-in modules in python 3.7
Here is what I have that is currently working, except it only spits out the header over and over and over and doesn't seem to take care of trailing tabs (not a huge deal though on trailing tabs):
def new_stripper(self, input_filename: str, output_filename: str):
"""
new_stripper(self, filename: str):
:param self: no idea what this does
:param filename: name of file to be stripped, must have .csv at end of file
:return: for now, it doesn't return anything...
-still doesn't remove trailing tabs?? But it can remove trailing spaces
-removes leading tabs and spaces
-still needs to write to new .csv file
"""
import csv
csv.register_dialect('strip', skipinitialspace=True)
reader = csv.DictReader(open(input_filename), dialect='strip')
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in reader)
for row in reader:
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(row)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
new_stripper(self='', input_filename=input_filename, output_filename=output_filename)
As written above, the code just prints the headers over and over in a single line. I've played around with the arrangement and indenting of the last four lines of the def with some different results, but the closest I've gotten is getting it to print the header row again and again on new lines each time:
...
# headers and headers for days
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
for row in reader:
writer.writerow(row)
EDIT1: Here's the result from the non-stripping correctly thing. Some of them have leading spaces that weren't stripped, some have trailing spaces that weren't stripped. It seems like the left-most column was properly stripped of leading spaces, but not trailing spaces; same with header row.
enter image description here
Update: Here's the solution I was looking for:
def get_data(self, input_filename: str, output_filename: str):
import csv
with open(input_filename, 'r', newline='') as in_file, open(output_filename, 'w', newline='') as out_file:
r = csv.reader(in_file, delimiter=',')
w = csv.writer(out_file, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
get_data(self='', input_filename=input_filename, output_filename=output_filename)

Don't make life complicated for yourself, "CSV" files are simple plain text files, and can be handled in a generic way:
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
trim = (field.strip() for field in line.split(','))
of.write(','.join(trim)+'\n')
Alternatively, using the csv module:
import csv
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
r = csv.reader(inf, delimiter=',')
w = csv.writer(of, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)

Unfortunately I cannot comment, but I believe you might want to strip every entry in csv of the white space (not just the line). If that is the case, then, based on Jan's answer, this might do the trick:
with open('file.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
of.write(','.join(list(map(str.strip, line.split(',')))) + '\n')
What it does is it splits each line by comma resulting in a list of values, then strips every element from whitespace to later join them back up and save to output file.

your final reader variable contains tuple of dicts but your writer expects list.
you can either user csv.DictWriter or store the processed data(v) in a list first and then write to csv and include headers using writer.writeheader()

Read CSV with comma as linebreak

I have a file saved as .csv
"400":0.1,"401":0.2,"402":0.3
Ultimately I want to save the data in a proper format in a csv file for further processing. The problem is that there are no line breaks in the file.
pathname = r"C:\pathtofile\file.csv"
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
print(reader)
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file)
csv_writer.writerow(reader)
The print reader output looks exactly how I want (or at least it's a format I can further process).
"400":0.1
"401":0.2
"402":0.3
And now I want to save that to a new csv file. However the output looks like
"""",4,0,0,"""",:,0,.,1,"
","""",4,0,1,"""",:,0,.,2,"
","""",4,0,2,"""",:,0,.,3
I'm sure it would be intelligent to convert the format to
400,0.1
401,0.2
402,0.3
at this stage instead of doing later with another script.
The main problem is that my current code
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
reader = csv.reader(reader,delimiter=':')
x = []
y = []
print(reader)
for row in reader:
x.append( float(row[0]) )
y.append( float(row[1]) )
print(x)
print(y)
works fine for the type of csv files I currently have, but doesn't work for these mentioned above:
y.append( float(row[1]) )
IndexError: list index out of range
So I'm trying to find a way to work with them too. I think I'm missing something obvious as I imagine that it can't be too hard to properly define the linebreak character and delimiter of a file.
with open(pathname, newline=',') as file:
yields
ValueError: illegal newline value: ,

The right way with csv module, without replacing and casting to float:
import csv
with open('file.csv', 'r') as f, open('filenew.csv', 'w', newline='') as out:
reader = csv.reader(f)
writer = csv.writer(out, quotechar=None)
for r in reader:
for i in r:
writer.writerow(i.split(':'))
The resulting filenew.csv contents (according to your "intelligent" condition):
400,0.1
401,0.2
402,0.3
Nuances:
csv.reader and csv.writer objects treat comma , as default delimiter (no need to file.read().replace(',', '\n'))
quotechar=None is specified for csv.writer object to eliminate double quotes around the values being saved

You need to split the values to form a list to represent a row. Presently the code is splitting the string into individual characters to represent the row.
pathname = r"C:\pathtofile\file.csv"
with open(pathname) as old_file:
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter=',')
text_rows = old_file.read().split(",")
for row in text_rows:
items = row.split(":")
csv_writer.writerow([int(items[0]), items[1])

If you look at the documentation, for write_row, it says:
Write the row parameter to the writer’s file
object, formatted according to the current dialect.
But, you are writing an entire string in your code
csv_writer.writerow(reader)
because reader is a string at this point.
Now, the format you want to use in your CSV file is not clearly mentioned in the question. But as you said, if you can do some preprocessing to create a list of lists and pass each sublist to writerow(), you should be able to produce the required file format.

Pipe delimiter file, but no pipe inside data

Problem
I need to re-format a text from comma (,) separated values to pipe (|) separated values. Pipe characters within the values of the original (comma separated) text shall be replaced by a space for representation in the (pipe separated) result text.
The pipe separated result text shall be written back to the same file from which the original comma separated text has been read.
I am using python 2.6
Possible Solution
I should read the file first and remove all pipes with spaces in that and later replace (,) with (|).
Is there a the better way to achieve this?

Don't reinvent the value-separated file parsing wheel. Use the csv module to do the parsing and the writing for you.
The csv module will add "..." quotes around values that contain the separator, so in principle you don't need to replace the | pipe symbols in the values. To replace the original file, write to a new (temporary) outputfile then move that back into place.
import csv
import os
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
writer.writerows(reader)
os.remove(inputfile)
os.rename(outputfile, inputfile)
For an input file containing:
foo,bar|baz,spam
this produces
foo|"bar|baz"|spam
Note that the middle column is wrapped in quotes.
If you do need to replace the | characters in the values, you can do so as you copy the rows:
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
for row in reader:
writer.writerow([col.replace('|', ' ') for col in row])
os.remove(inputfile)
os.rename(outputfile, inputfile)
Now the output for my example becomes:
foo|bar baz|spam

Sounds like you're trying to work with a variation of CSV - in that case, Python's CSV library might as well be what you need. You can use it with custom delimiters and it will auto-handle escaping for you (this example was yanked from the manual and modified):
import csv
with open('eggs.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|')
spamwriter.writerow(['One', 'Two', 'Three])
There are also ways to modify quoting and escaping and other options. Reading works similarly.

You can create a temporary file from the original that has the pipe characters replaced, and then replace the original file with it when the processing is done:
import csv
import tempfile
import os
filepath = 'C:/Path/InputFile.csv'
with open(filepath, 'rb') as fin:
reader = csv.DictReader(fin)
fout = tempfile.NamedTemporaryFile(dir=os.path.dirname(filepath)
delete=False)
temp_filepath = fout.name
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
# writer.writeheader() # requires Python 2.7
header = dict(zip(reader.fieldnames, reader.fieldnames))
writer.writerow(header)
for row in reader:
for k,v in row.items():
row[k] = v.replace('|'. ' ')
writer.writerow(row)
fout.close()
os.remove(filepath)
os.rename(temp_filepath, filepath)

Python Enclosing Words With Quotes In A String

For Python I'm opening a csv file that appears like:
jamie,london,uk,600087
matt,paris,fr,80092
john,newyork,ny,80071
How do I enclose the words with quotes in the csv file so it appears like:
"jamie","london","uk","600087"
etc...
What I have right now is just the basic stuff:
filemame = "data.csv"
file = open(filename, "r")
Not sure what I would do next.

If you are just trying to convert the file, use the QUOTE_ALL constant from the csv module, like this:
import csv
with open('data.csv') as input, open('out.csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
for line in reader:
writer.writerow(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find delimiter in txt to convert to csv using Python - python

Related

CSV Writer (Python) with CRLF instead of LF

How to read csv data, strip spaces/tabs and write to new csv file?

Read CSV with comma as linebreak

Pipe delimiter file, but no pipe inside data

Python Enclosing Words With Quotes In A String

Categories

Resources