How to convert tuple
text = ('John', '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
to csv format
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL\\r\\n""", """Johny\\nIs\\nHere"""'
or even omitting the special chars at the end
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL""", """Johny\\nIs\\nHere"""'
I came up with this monster
out1 = ','.join(f'""{t}""' if t.startswith('"') and t.endswith('"')
else f'"{t}"' for t in text)
out2 = out1.replace('\n', '\\n').replace('\r', '\\r')
You can get pretty close to what you want with the csv and io modules from the standard library:
use csv to correctly encode the delimiters and handle the quoting rules; it only writes to a file handle
use io.StringIO for that file handle to get the resulting CSV as a string
import csv
import io
f = io.StringIO()
text = ("John", '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
writer = csv.writer(f)
writer.writerow(text)
csv_str = f.getvalue()
csv_repr = repr(csv_str)
print("CSV_STR")
print("=======")
print(csv_str)
print("CSV_REPR")
print("========")
print(csv_repr)
and that prints:
CSV_STR
=======
John,"""n""","""ABC 123
DEF, 456GH
ijKl""
","""Johny
Is
Here"""
CSV_REPR
========
'John,"""n""","""ABC 123\nDEF, 456GH\nijKl""\r\n","""Johny\nIs\nHere"""\r\n'
csv_str is what you'd see in a file if you wrote directly to a file you opened for writing, it is true CSV
csv_repr is kinda what you asked for when you showed us out, but not quite. Your example included "doubly escaped" newlines \\n and carriage returns \\r\\n. CSV doesn't need to escape those characters any more because the entire field is quoted. If you need that, you'll need to do it yourself with something like:
csv_repr.replace(r"\r", r"\\r").replace(r"\n", r"\\n")
but again, that's not necessary for valid CSV.
Also, I don't know how to make the writer include an initial space before every field after the first field, like the spaces you show between "John" and "n" and then after "n" in:
out = 'John, """n""", ...'
The reader can be configured to expect and ignore an initial space, with Dialect.skipinitialspace, but I don't see any options for the writer.
Related
I have a data file in which fields are enclosed within double quotes and field separator like below:
field enclosure = "<field_value>"
sep = ||####
So of the field values have text within quotes that have 'LF' and 'CR LF' line separators which are causing for the next lines to be printed on a new line - which may be misinterpreted as a new record, when in reality, it a part of one record, has the lines not been broken to shift to a new line.
example:
3||####14||####"2016-01-13 19:59:27"||####"2016-01-15 23:09:19"||####1162||####822||####1237||####\N||####"VHiujdfYshv"||####"---<LF>
...LF
"||####\N||####"2016-01-15 23:09:18"||####0||####1||####0||####0||####3||####1788||####\N||####205||####\N||####0||####\N||####\N||####\N||####\N||####\N||####\N||####1||####\N||####"251 Bgegf BHVcvytd Street<CR LF>
JHbsdbfh, RF 35214<CR LF>
<CR LF>
xyz#gmail.dhg.com<CR LF>
<CR LF>
####1788<LF>
4||####14||####"2016-01-25 22:08:53"||####"2016-02-15 20:32:08"||####1097||####933||####1262||####\N||####"VHiujdfYshv"||####"--- <LF>
...<LF>
Please note that the LF and CR LF actually show up without the angle brackets, which is, probably, a given, but I am mentioning it for absolute clarity. Below is a snip of how that looks on a notepad++ file. Also, note that my data consists of '||####' as a field separator, with '\N' for the na_values.
Below is how I am reading this file so far. I tried to use 'quotechar' and 'quoting' params from the pd.read_csv, but that uses a C parser, which separator uses a Python parser, so python parser is overriding. How do I read this file <process it before reading as a CSV, or use some regex while reading a CSV file? Please help.
df = pd.read_csv(z.open(filename),
encoding = 'utf8',
header=None,
sep='\|\|####',
na_values='\\N',
engine = 'python')
You can use
rx_lbr = r'[\r\n\x0B\x0C\u0085\u2028\u2029]+'
with open(filepath, 'r', newline="\n", encoding="utf-8") as fr:
with open(savefilepath, 'w', newline="\n", encoding="utf-8") as fw:
fw.write( re.sub(r'"[^"]*"', lambda x: re.sub(rx_lbr, '', x.group()), fr.read()) )
The "[^"]*" regex with re.sub matches all non-overlapping substrings between single quotes, and the lambda x: re.sub(rx_lbr, '', x.group()) replacement removes all vertical whitespace Unicode chars from the match only. So, all other line breaks remain unchanged.
See the Python demo:
import re
content = r"""3||####14||####"2016-01-13 19:59:27"||####"2016-01-15 23:09:19"||####1162||####822||####1237||####\N||####"VHiujdfYshv"||####"---
...
"||####\N||####"2016-01-15 23:09:18"||####0||####1||####0||####0||####3||####1788||####\N||####205||####\N||####0||####\N||####\N||####\N||####\N||####\N||####\N||####1||####\N||####"251 Bgegf BHVcvytd Street
JHbsdbfh, RF 35214
xyz#gmail.dhg.com
####1788
4||####14||####"2016-01-25 22:08:53"||####"2016-02-15 20:32:08"||####1097||####933||####1262||####\N||####"VHiujdfYshv"||####"---
"""
print( re.sub(r'"[^"]*"', lambda x: re.sub(r'[\r\n\x0B\x0C\u0085\u2028\u2029]+', '', x.group()), content) )
I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.
I have created a dictionary (which as Keys has encoded words in utf-8) :
import os.path
import codecs
import pickle
from collections import Counter
wordDict = {}
def pathFilesList():
source='StemmedDataset'
retList = []
for r, d, f in os.walk(source):
for files in f:
retList.append(os.path.join(r, files))
return retList
# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = codecs.open(f, mode = 'r' , encoding='utf-8')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, stemFreq)]
else:
wordDict[stemWord].append((date_stamp, stemFreq))
# Close file and do next.
fw.close()
if __name__ == "__main__":
# Parse all files and store in wordDict.
parsing()
output = open('data.pkl', 'wb')
print "Dumping wordDict of size {0}".format(len(wordDict))
pickle.dump(wordDict, output)
output.close()
when I unpickle the pickled data , and query this dictionary I can't query alphabetical words , even words of which I'm sure they're in the dictionary,it always returns false , but for the numeric query , it works fine. here is how I unpickle the data and query :
pkl_file=codecs.open('data.pkl' , 'rb' )
wd=pickle.load(pkl_file)
pprint.pprint(wd) #to make sure the wd is correct and it has been created
print type(wd) #making sure of the type of data structure
pkl_file.close()
#tried lots of other ways to query like if wd.has_key('some_encoded_word')
value= None
inputList= ['اندیمشک' , '16' , 'درحوزه' ]
for i in inputList :
if i in wd :
value = wd[i]
print value
else:
print 'False'
here is my output
pa#pa:~/Desktop$ python unpickle.py
False
[('2000-05-07', 5), ('2000-07-05', 2)]
False
so I'm quite sure there's something wrong with the encoded words .
Your problem is that you're using codecs.open. That function is specifically for opening a text file and automatically decoding the data to Unicode. (As a side note, you usually want io.open, not codecs.open, even for that case.)
To open a binary file to be passed as bytes to something like pickle.load, just use the builtin open function, not codecs.open.
Meanwhile, it usually makes things simpler to use Unicode strings throughout your program, and only use byte strings at the edges, decoding input as soon as you get it and encoding output as late as possible.
Also, you have literal Unicode characters in non-Unicode string literals. Never, ever, ever do this. That's the surest way to create invisible mojibake strings. Always use Unicode literals (like u'abc' instead of 'abc') when you have non-ASCII characters.
Plus, if you use non-ASCII characters in the source, always use a coding declaration (and of course make sure your editor is using the same encoding you put in the coding declaration),
Also, keep in mind that == (and dict lookup) may not do what you want for Unicode. If you have a word stored in NFC, and look up the same word in NFD, the characters won't be identical even though they represent the same string. Without knowing the exact strings you're using and how they're represented it's hard to know if this is a problem in your code, but it's pretty common with, e.g., Mac programs that get strings out of filenames or Cocoa GUI apps.
This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.
I am completely new to python and struggling to make a simple thing work.
I am reading a bunch of information from a Web service, parsing the results, and I want to write it out into a flat-file. Most of my items are single line items, but one of the things I get back from my Web service is a paragraph. The paragraph will contain newlines, quotes, and any random characters.
I was going to use the CSV module for python, but unsure of the parameters I should use and how to escape my string so the paragraph gets put onto a single line and so I am guaranteed all characters are properly escaped (especially the delimiter)
The default csv.writer setup should handle this properly. Here's a simple example:
import csv
myparagraph = """
this is a long paragraph, with "quotes" and stuff.
"""
mycsv = csv.writer(open('foo.csv', 'wb'))
mycsv.writerow([myparagraph, 'word1'])
mycsv.writerow(['word2', 'word3'])
This yields the following csv file:
"
this is a long paragraph, with ""quotes"" and stuff.
",word1
word2,word3
Which should load into your favorite csv opening tool with no problems, as a having two rows and two columns.
You don't have to do anything special. The CSV module takes care of the quoting for you.
>>> from StringIO import StringIO
>>> s = StringIO()
>>> w = csv.writer(s)
>>> w.writerow(['the\nquick\t\r\nbrown,fox\\', 32])
>>> s.getvalue()
'"the\nquick\t\r\nbrown,fox\\",32\r\n'
>>> s.seek(0)
>>> r = csv.reader(s)
>>> next(r)
['the\nquick\t\r\nbrown,fox\\', '32']
To help with setting your expectations, the following is executable pseudocode for how the quoting etc works in the de-facto standard CSV output:
>>> def csv_output_record(input_row):
... delimiter = ','
... q = '"' # quotechar
... quotables = set([delimiter, '\r', '\n'])
... return delimiter.join(
... q + value.replace(q, q + q) + q if q in value
... else q + value + q if any(c in quotables for c in value)
... else value
... for value in input_row
... ) + '\r\n'
...
>>> csv_output_record(['foo', 'x,y,z', 'Jack "Ripper" Jones', 'top\nmid\nbot'])
'foo,"x,y,z","Jack ""Ripper"" Jones","top\nmid\nbot"\r\n'
Note that there is no escaping, only quoting, and hence if the quotechar appears in the field, it must be doubled.