Writing a string to CSV using line escapes in python 3 - python

Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?

Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)

If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))

Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.

Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.

Related

وصلى characters showing when writing the text obtained through web scraping into a csv file [duplicate]

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.
from newspaper import Article, Source
import csv
first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")
first_article.download()
if first_article.is_downloaded:
first_article.parse()
first_article.nlp
article_array = []
collate = {}
collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)
keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
csv_writer = csv.DictWriter(output_file, keys)
csv_writer.writeheader()
csv_writer.writerows(article_array)
output_file.close()
When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:
“At the end of the day, Europe’s economy isn’t in great shape, inflation doesn’t look exciting and there are a bunch of political risks to reckon with.
So far I have tried:
with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:
to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.
What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?
Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.
Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.
That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.
You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

remove double quotes in each row (csv writer)

I'm writing API results to CSV file in python 3.7. Problem is it adds double quotes ("") to each row when it writes to file.
I'm passing format as csv to API call, so that I get results in csv format and then I'm writing it to csv file, store to specific location.
Please suggest if there is any better way to do this.
Here is the sample code..
with open(target_file_path, 'w', encoding='utf8') as csvFile:
writer = csv.writer(csvFile, quoting=csv.QUOTE_NONE, escapechar='\"')
for line in rec.split('\r\n'):
writer.writerow([line])
when I use escapechar='\"' it adds (") at the of every column value.
here is sample records..
2264855868",42.38454",-71.01367",07/15/2019 00:00:00",07/14/2019 20:00:00"
2264855868",42.38454",-71.01367",07/15/2019 01:00:00",07/14/2019 21:00:00"
API gives string/bytes which you can write directly in file.
data = request.get(..).content
open(filename, 'wb').write(data)
With csv.writer you would have to convert string/bytes to Python's data using csv.reader and then convert it back to string/bytes with csv.writer - so there is no sense to do it.
The same method should work if API send any file: JSON, CSV, XML, PDF, images, audio, etc.
For bigger files you could use chunk/stream in requests. Doc: requests - Advanced Usage
Have you tried removing the backward-slash from escapechar='\"'? It shouldn't be necessary, since you are using single quotes for the string.
EDIT: From the documentation:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. On reading, the escapechar removes any special meaning from the following character.
And the delimeter:
A one-character string used to separate fields. It defaults to ','
So it is going to escape the delimeter (,) with whatever you set as the escapechar, in this case ,
If you don't want any escape, try leaving it empty
Try:
import codecs
def find_replace(file, search_characters, replace_with):
text = codecs.open(file, "r", "utf-8-sig")
text = ''.join([i for i in text]).replace(
search_characters, replace_with)
x = codecs.open(file, "w", "utf-8-sig")
x.writelines(text)
x.close()
if __name__ == '__main__':
file = "target_file_path"
search_characters = '"'
replace_with = ''
find_replace(file, search_characters, replace_with)
output:
2264855868,42.38454,-71.01367,07/15/2019 00:00:00,07/14/2019 20:00:00
2264855868,42.38454,-71.01367,07/15/2019 01:00:00,07/14/2019 21:00:00

Converting .tsv file to .txt creates unintended characters, possible fix?

Need to process a .tsv file that has 1 million lines and then save the file as a .txt file . I successfully am able to perform that this way:
import csv
with open("data.tsv") as fd, open('pre_processed_data.txt', 'wb') as csvout:
rd = csv.reader(fd, delimiter="\t", quotechar='"')
csvout = csv.writer(csvout,delimiter='\t')
for row in rd:
csvout.writerow([row[1],row[2],row[3]])
However, beyond a certain point , along with tabs certain special characters unintended crawls in. ie this way:
As you can see the first column expects only numeric values between 0 and 1. However special characters are seen in between.
What is possibly causing this and how to effectively resolve this?
These extra characters exist in the input file. As you have no cntrol over the file, the easiest thing to to do is to remove them as you process the data. The re module's sub function can do this:
>>> import re
>>> s = '1#'
>>> re.sub(r'\D+', '', s)
'1'
The r'\D+' pattern will match any non-numeric character for removal from the provided string.

Why is 'rblabla' a valid csv file mode?

I have a csv file and a function to read it.
I can open it in many ways, most of these modes produce similar results.
def read(mode):
with open("file.csv", mode) as inf:
reader = csv.reader(inf)
for row in reader:
print row
read('r') #prints \r\n characters
read('rb') #prints \r\n characters
read('rU') #prints \n characters but not \r characters
read('rblabla') #WAT.
I am wondering why the last example is allowed. It produces the same results as normal read mode.
Is there any reason why it works this way?
The mode is not for the csv reader, but for the python default file handler. Python only enforces mode to begin with 'r', 'w' or 'a', after stripping U. This is documented here, and is for python 2.5 and later.
The mode is an attribute of the file handler, and may be used by other applications, hence it may contain more characters.

using txt file as input for python

I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.

Categories

Resources