I have a gz file with the first 5 columns delimited by a backslash and the next 5 delimited by a comma. I'm reading in the file as follows:
with gzip.open(myfile, 'r') as fin:
for line in fin:
print line
The data looks like this:
a\b\c\d\e,f,g,h,i,j
How can I convert the backslashes into commas, so that it looks like this?
a,b,c,d,e,f,g,h,i,j
I've tried:
>>> g = 'a\b\c\d\e,f,g,h,i,j'
>>> g2 = g.replace('\\', ',')
>>> g2
'a\x08,c,d,e,f,g,h,i,j'
Reading the string in its raw format solves the problem:
>>> g = r'a\b\c\d\e,f,g,h,j,k'
>>> g.replace('\\', ',')
'a,b,c,d,e,f,g,h,j,k'
But how would I read lines from a gzip'd file as raw strings?
Just read it like you're already reading it. Reading from files doesn't apply string literal escape processing. String literal escape processing only applies to string literals.
it exists the methods .read() and .readlines(), but don't really know if it works with gz files, try with that and use r'content'. The problem is \b, \t, \n, \u2003, and others are special caracters and using .replace('\', ',') doesn't work
Related
I need to prepend a comma-containing string to a CSV file using Python. Some say enclosing the string in double quotes escapes the commas within. This does not work. How do I write this string without the commas being recognized as seperators?
string = "WORD;WORD 45,90;WORD 45,90;END;"
with open('doc.csv') as f:
prepended = string + '\n' + f.read()
with open('doc.csv', 'w') as f:
f.write(prepended)
So as you point out, you can typically quote the string as below. Is the system that reads these files not recognizing that syntax? If you use python's csv module it will handle the proper escaping:
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(myIterable, quoting=csv.QUOTE_ALL)
The quoted strings would look like:
"string1","string 2, with, commas"
Note if you have a quote character within your string it will be written as "" (two quote chars in a row):
"string1","string 2, with, commas, and "" a quote"
Need to process a .tsv file that has 1 million lines and then save the file as a .txt file . I successfully am able to perform that this way:
import csv
with open("data.tsv") as fd, open('pre_processed_data.txt', 'wb') as csvout:
rd = csv.reader(fd, delimiter="\t", quotechar='"')
csvout = csv.writer(csvout,delimiter='\t')
for row in rd:
csvout.writerow([row[1],row[2],row[3]])
However, beyond a certain point , along with tabs certain special characters unintended crawls in. ie this way:
As you can see the first column expects only numeric values between 0 and 1. However special characters are seen in between.
What is possibly causing this and how to effectively resolve this?
These extra characters exist in the input file. As you have no cntrol over the file, the easiest thing to to do is to remove them as you process the data. The re module's sub function can do this:
>>> import re
>>> s = '1#'
>>> re.sub(r'\D+', '', s)
'1'
The r'\D+' pattern will match any non-numeric character for removal from the provided string.
Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?
Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)
If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))
Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.
Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.
I have a text file in this format:
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
And I want to read those lines and covert them to
Chapter 1 - BlaBla
Boy's Dead.
and replace them on the same file.
I tried encoding and decoding already with print(line.encode("UTF-8", "replace")) and that didn't work
strings = [
b'Chapter 1 \xe2\x80\x93 BlaBla',
b'Boy\xe2\x80\x99s Dead.',
]
for string in strings:
print(string.decode('utf-8', 'ignore'))
--output:--
Chapter 1 – BlaBla
Boy’s Dead.
and replace them on the same file.
There is no computer programming language in the world that can do that. You have to write the output to a new file, delete the old file, and rename the newfile to the oldfile. However, python's fileinput module can perform that process for you:
import fileinput as fi
import sys
with open('data.txt', 'wb') as f:
f.write(b'Chapter 1 \xe2\x80\x93 BlaBla\n')
f.write(b'Boy\xe2\x80\x99s Dead.\n')
with open('data.txt', 'rb') as f:
for line in f:
print(line)
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak',
mode = 'rb') as f:
for line in f:
string = line.decode('utf-8', 'ignore')
print(string, end="")
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla\n'
b'Boy\xe2\x80\x99s Dead.\n'
~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.
Edit:
import fileinput as fi
import re
pattern = r"""
\\ #Match a literal slash...
x #Followed by an x...
[a-f0-9]{2} #Followed by any hex character, 2 times
"""
repl = ''
with open('data.txt', 'w') as f:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
print(r"b'Boy\xe2\x80\x99s Dead.'", file=f)
with open('data.txt') as f:
for line in f:
print(line.rstrip()) #Output goes to terminal window
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak') as f:
for line in f:
line = line.rstrip()[2:-1]
new_line = re.sub(pattern, "", line, flags=re.X)
print(new_line) #Writes to file, not your terminal window
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
~/python_programs$ cat data.txt
Chapter 1 BlaBla
Boys Dead.
Your file does not contain binary data, so you can read it (or write it) in text mode. It's just a matter of escaping things correctly.
Here is the first part:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
Python converts certain backslash escape sequences inside a string to something else. One of the backslash escape sequences that python converts is of the format:
\xNN #=> e.g. \xe2
The backslash escape sequence is four characters long, but python converts the backslash escape sequence into a single character.
However, I need each of the four characters to be written to the sample file I created. To keep python from converting the backslash escape sequence into one character, you can escape the beginning '\' with another '\':
\\xNN
But being lazy, I didn't want to go through your strings and escape each backslash escape sequence by hand, so I used:
r"...."
An r string escapes all the backslashes for you. As a result, python writes all four characters of the \xNN sequence to the file.
The next problem is replacing a backslash in a string using a regex--I think that was your problem to begin with. When a file contains a \, python reads that into a string as \\ to represent a literal backslash. As a result, if the file contains the four characters:
\xe2
python reads that into a string as:
"\\xe2"
which when printed looks like:
\xe2
The bottom line is: if you can see a '\' in a string that you print out, then the backslash is being escaped in the string. To see what's really inside a string, you should always use repr().
string = "\\xe2"
print(string)
print(repr(string))
--output:--
\xe2
'\\xe2'
Note that if the output has quotes around it, then you are seeing everything in the string. If the output doesn't have quotes around it, then you can't be sure exactly what's in the string.
To construct a regex pattern that matches a literal back slash in a string, the short answer is: you need to use double the amount of back slashes that you would think. With the string:
"\\xe2"
you would think that the pattern would be:
pattern = "\\x"
but based on the doubling rule, you actually need:
pattern = "\\\\x"
And remember r strings? If you use an r string for the pattern, then you can write what seems reasonable, and then the r string will escape all the slashes, doubling them:
pattern r"\\x" #=> equivalent to "\\\\x"
I have an excel file that I converted to a text file with a list of numbers.
test = 'filelocation.txt'
in_file = open(test,'r')
for line in in_file:
print line
1.026106236
1.660274766
2.686381002
4.346655769
7.033036771
1.137969254
a = []
for line in in_file:
a.append(line)
print a
'1.026106236\r1.660274766\r2.686381002\r4.346655769\r7.033036771\r1.137969254'
I wanted to assign each value (in each line) to an individual element in the list. Instead it is creating one element separated by \r . i'm not sure what \r is but why is putting these into the code ?
I think I know a way to get rid of the \r from the string but i want to fix the problem from the source
To accepts any of \r, \n, \r\n as a newline you could use 'U' (universal newline) file mode:
>>> open('test_newlines.txt', 'rb').read()
'a\rb\nc\r\nd'
>>> list(open('test_newlines.txt'))
['a\rb\n', 'c\r\n', 'd']
>>> list(open('test_newlines.txt', 'U'))
['a\n', 'b\n', 'c\n', 'd']
>>> open('test_newlines.txt').readlines()
['a\rb\n', 'c\r\n', 'd']
>>> open('test_newlines.txt', 'U').readlines()
['a\n', 'b\n', 'c\n', 'd']
>>> open('test_newlines.txt').read().split()
['a', 'b', 'c', 'd']
If you want to get a numeric (float) array from the file; see Reading file string into an array (In a pythonic way)
use rstrip() or rstrip('\r') if you're sure than the last character is always \r.
for line in in_file:
print line.rstrip()
help on str.rstrip():
S.rstrip([chars]) -> string or unicode
Return a copy of the string S with trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
str.strip() removes both trailing and leading whitespaces.
You can strip the carriage returns and newlines from the line by using strip()
line.strip()
i.e.
for line in in_file:
a.append(line.strip())
print a
To fix this do:
for line in in_file:
a.append(line.strip())
.strip() the lines to remove the whitespace that you don't need:
lines = []
with open('filelocation.txt', 'r') as handle:
for line in handle:
line = line.strip()
lines.append(line)
print line
print lines
Also, I'd advise that you use the with ... notation to open a file. It's cleaner and closes the file automatically.
First, I generally like #J.F. Sebastian's answer, but my use case is closer to Python 2.7.1: How to Open, Edit and Close a CSV file, since my string came from a text file was output from Excel as a csv and was furthermore input using the csv module. As indicated at that question:
as for the 'rU' vs 'rb' vs ..., csv files really should be binary so
use 'rb'. However, its not uncommon to have csv files from someone who
copied it into notepad on windows and later it was joined with some
other file so you have funky line endings. How you deal with that
depends on your file and your preference. – #kalhartt Jan 23 at 3:57
I'm going to stick with reading as 'rb' as recommended in the python docs. For now, I know that the \r inside a cell is a result of quirks of how I'm using Excel, so I'll just create a global option for replacing '\r' with something else, which for now will be '\n', but later could be '' (an empty string, not a double quote) with a simple json change.