Insert whitespace after delimiter with CSV writer - python

f = open("file1.csv", "r")
g = open("file2.csv", "w")
a = csv.reader(f, delimiter=";", skipinitialspace=True)
b = csv.writer(g, delimiter=";")
for line in a:
b.writerow(line)
In the above code, I try to load file1.csv using the csv module in Python2.7, and then write it in file2.csv using a csv.writer.
My issue comes from existing whitespaces (a single space character) after the delimiter in the input file. I need to remove them in order to do some data manipulation later on, so I used the skipinitialspace=True argument for the reader. However, I cannot get the writer to print the space char after the delimiter, and therefore disturbing any subsequent diffing of the two files.
I tried to use the Sniffer class to auto-generate a Dialect but I guess my input files (coming from a large complex legacy system, with dozens of fields and poor quoting and escaping) are proving to be too complex for this.
In more simple terms I'm looking for the answers to the following questions:
How can I insert a space character after each delimiter in the writer?
Incidently, what are the reasons to prohibit the use of multi-character strings as delimiters? delimiter="; " would've solved my problem.

You can wrap your file objects in proxies that add the whitespace:
>>> class DelimitedFile(file):
... def write(self, value):
... super(DelimitedFile, self).write(value.replace(";", "; "))
...
>>> f = DelimitedFile("foo", "w")
>>> f.write("hello;world")
>>> f.close()
>>> open("foo").read()
'hello; world'

If you left the whitespace you want written in (removing/restoring it during processing), or put it back after processing but before writing, that would take care of it.

One solution would be to write to a StringIO object, and then to replace the semicolons with '; ', or to do so during processing of the lines, if you do any other processing.

As for the first, I would probably do something like this:
for k, line in enumerate(a):
if k == 0:
b.writerow(line)
else:
b.writerow(' ' + line) #assuming line is always a string, if not just use str() on it
As for the second, I have no idea.

Related

Pandas to_csv with multiple separators

I want to convert a pandas dataframe to csv with multiple separators. Is there a way?
dataframe.to_csv(file.csv, sep="%%")
Error: delimiter must be 1-character string
The easiest way might be to use a unique single-character separator first, then replace it:
tsv = dataframe.to_csv(sep='\t') # use '\1' if your data contains tabs
psv = tsv.replace('\t', '%%')
with open('file.csv', 'w') as outfile:
outfile.write(psv)
P.S.: Consider using an extension other than .csv since it's not comma separated.
I think there might be some bugs with replace as John says, cause it can't promise the replaced character is the seperator.
Besides, as to_csv returned as a string, if the data is big, it migth lead to memory error.
Here is another feasible solution.
with open('test_pandas.txt', 'w') as f:
for index, row in dataframe.iterrows():
l = map(str, row.values.tolist())
line = '%%'.join(l)
f.write(line+'\n')

Writing a string to CSV using line escapes in python 3

Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?
Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)
If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))
Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.
Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.

How can I successfully capture all possible cases to create a python list from a text file

This public gist creates a simple scenario where you can turn a text file into a python list line by line.
with open('test.txt', 'r') as listFile:
lines = listFile.read().split("\n")
out = []
for item in lines:
if '"' in item:
out.append('("""' + item + '"""),')
else:
out.append('("' + item + '"),')
with open('out.py', 'a') as outFile:
outFile.write("out = [\n")
for item in out:
outFile.write("\t" + item + "\n")
outFile.write("]")
In text.txt the sixth and seventh lines
'"""'
""
are the ones that produce invalid output. Perhaps you can think of some other examples that would fail to work.
EDIT:
Valid output would look something like this:
out = [
"line1",
"line2",
""" line 3 has """ and "" and " in it """, # but it is a valid string
"last line",
]
The ( and ) characters were an oversight by me they are not needed or wanted...
EDIT: Oh god I'm getting overwhelmed. I'm going to take 5 minutes and post the question again in a better form.
Using a newline character besides \n would also cause the program to fail. In Windows its common to use \r or \r\n.
#abarnert's comment shows a better way to read lines.
A text file is already an iterable of lines.
As with any other iterable, you can convert it to a list by just passing it to the list constructor:
with open('text.txt') as f:
lines = list(f)
Or, if you don't want the newlines on the end of each line:
with open('text.txt') as f:
lines = [line.rstrip('\n') for line in f]
If you want to handle classic Mac and Windows line endings as well as Unix, open the file in universal-newlines mode:
with open('text.txt', 'rU') as f:
… or use the Python 3-style io classes (but note that this will give you unicode strings, not byte strings, which will repr with u prefixes—they're still valid Python literals that way, but they won't look as pretty):
import io
with io.open('text.txt') as f:
Now, it's hard to tell from code that doesn't work and no explanation of what's wrong with it, but it looks like you're trying to figure out how to write that list out as a Python-source-format list display, wrapping it in brackets, adding quotes, escaping any internal quotes, etc. But there's a much easier way to do that too:
with open('out.py', 'a') as f:
f.write(repr(lines))
If you're trying to pretty-print it, there's a pprint module in the stdlib for exactly that purpose, and various bigger/better alternatives on PyPI. Here's an example of the output of pprint.pprint(lines, width=60) with (what I think is) the same input you used for your desired output:
['line1',
'line2',
' line 3 has """ and "" and " in it ',
'last line']
Not exactly the same as your desired output—but, unlike your output, it's a valid Python list display that evaluates to the original input, and it looks pretty readable to me.

Writelines writes lines without newline, Just fills the file

I have a program that writes a list to a file.
The list is a list of pipe delimited lines and the lines should be written to the file like this:
123|GSV|Weather_Mean|hello|joe|43.45
122|GEV|temp_Mean|hello|joe|23.45
124|GSI|Weather_Mean|hello|Mike|47.45
BUT it wrote them line this ahhhh:
123|GSV|Weather_Mean|hello|joe|43.45122|GEV|temp_Mean|hello|joe|23.45124|GSI|Weather_Mean|hello|Mike|47.45
This program wrote all the lines into like one line without any line breaks.. This hurts me a lot and I gotta figure-out how to reverse this but anyway, where is my program wrong here? I thought write lines should write lines down the file rather than just write everything to one line..
fr = open(sys.argv[1], 'r') # source file
fw = open(sys.argv[2]+"/masked_"+sys.argv[1], 'w') # Target Directory Location
for line in fr:
line = line.strip()
if line == "":
continue
columns = line.strip().split('|')
if columns[0].find("#") > 1:
looking_for = columns[0] # this is what we need to search
else:
looking_for = "Dummy#dummy.com"
if looking_for in d:
# by default, iterating over a dictionary will return keys
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
else:
new_idx = str(len(d)+1)
d[looking_for] = new_idx
kv = open(sys.argv[3], 'a')
kv.write(looking_for+" "+new_idx+'\n')
kv.close()
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
fw.writelines(line_list)
This is actually a pretty common problem for newcomers to Python—especially since, across the standard library and popular third-party libraries, some reading functions strip out newlines, but almost no writing functions (except the log-related stuff) add them.
So, there's a lot of Python code out there that does things like:
fw.write('\n'.join(line_list) + '\n')
(writing a single string) or
fw.writelines(line + '\n' for line in line_list)
Either one is correct, and of course you could even write your own writelinesWithNewlines function that wraps it up…
But you should only do this if you can't avoid it.
It's better if you can create/keep the newlines in the first place—as in Greg Hewgill's suggestions:
line_list.append(new_line + "\n")
And it's even better if you can work at a higher level than raw lines of text, e.g., by using the csv module in the standard library, as esuaro suggests.
For example, right after defining fw, you might do this:
cw = csv.writer(fw, delimiter='|')
Then, instead of this:
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
You do this:
row_list.append(d[looking_for] + columns[1:])
And at the end, instead of this:
fw.writelines(line_list)
You do this:
cw.writerows(row_list)
Finally, your design is "open a file, then build up a list of lines to add to the file, then write them all at once". If you're going to open the file up top, why not just write the lines one by one? Whether you're using simple writes or a csv.writer, it'll make your life simpler, and your code easier to read. (Sometimes there can be simplicity, efficiency, or correctness reasons to write a file all at once—but once you've moved the open all the way to the opposite end of the program from the write, you've pretty much lost any benefits of all-at-once.)
The documentation for writelines() states:
writelines() does not add line separators
So you'll need to add them yourself. For example:
line_list.append(new_line + "\n")
whenever you append a new item to line_list.
As others have noted, writelines is a misnomer (it ridiculously does not add newlines to the end of each line).
To do that, explicitly add it to each line:
with open(dst_filename, 'w') as f:
f.writelines(s + '\n' for s in lines)
writelines() does not add line separators. You can alter the list of strings by using map() to add a new \n (line break) at the end of each string.
items = ['abc', '123', '!##']
items = map(lambda x: x + '\n', items)
w.writelines(items)
As others have mentioned, and counter to what the method name would imply, writelines does not add line separators. This is a textbook case for a generator. Here is a contrived example:
def item_generator(things):
for item in things:
yield item
yield '\n'
def write_things_to_file(things):
with open('path_to_file.txt', 'wb') as f:
f.writelines(item_generator(things))
Benefits: adds newlines explicitly without modifying the input or output values or doing any messy string concatenation. And, critically, does not create any new data structures in memory. IO (writing to a file) is when that kind of thing tends to actually matter. Hope this helps someone!
Credits to Brent Faust.
Python >= 3.6 with format string:
with open(dst_filename, 'w') as f:
f.writelines(f'{s}\n' for s in lines)
lines can be a set.
If you are oldschool (like me) you may add f.write('\n') below the second line.
As we have well established here, writelines does not append the newlines for you. But, what everyone seems to be missing, is that it doesn't have to when used as a direct "counterpart" for readlines() and the initial read persevered the newlines!
When you open a file for reading in binary mode (via 'rb'), then use readlines() to fetch the file contents into memory, split by line, the newlines remain attached to the end of your lines! So, if you then subsequently write them back, you don't likely want writelines to append anything!
So if, you do something like:
with open('test.txt','rb') as f: lines=f.readlines()
with open('test.txt','wb') as f: f.writelines(lines)
You should end up with the same file content you started with.
As we want to only separate lines, and the writelines function in python does not support adding separator between lines, I have written the simple code below which best suits this problem:
sep = "\n" # defining the separator
new_lines = sep.join(lines) # lines as an iterator containing line strings
and finally:
with open("file_name", 'w') as file:
file.writelines(new_lines)
and you are done.

How to convert tab separated, pipe separated to CSV file format in Python

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))

Categories

Resources