I want to user either R or Python to convert .csv.gz file to utf-8 encoding. How can I do this directly? I am not able find any comprehensive guide as how to do this.
My best attempt was to read .csv.gz file with csv.reader in python:
csvFile = gzip.open('pracodawcy_20190611_5.csv.gz', 'rt', newline='')
reader = csv.reader(csvFile)
But later how to save it as csv with utf-8?
Very easily, it puts the file in a vector:
import gzip
### assuming the file is separated as you said
with gzip.open('input_file.csv.gz', 'rt', newline='\n') as f:
content = f.readlines()
### to print the vector content
for v in content :
print(v)
### to write to .csv.gz
with gzip.open('output.csv.gz', 'wb') as f:
for v in content :
f.write(v.encode('utf-8'))
you can also lazy-open it line per line if it's too big with read() or for. There are a lot of examples here and in the web.
Related
The raw ECG that I have is in csv format. I need to convert it into .txt file which will have only the ECG data. I need a python code for the same. Can I get some help on this.
csv_file = 'ECG_data_125Hz_Simulator_Patch_Normal_Sinus.csv'
txt_file = 'ECG_data_125Hz_Simulator_Patch_Normal_Sinus.txt'
import csv
with open(txt_file, "w") as my_output_file:
with open(csv_file, "r") as my_input_file:
//need to write data to the output file
my_output_file.close()
The input ECG data looks like this:
Raw_ECG_data
What worked for me
import csv
csv_file = 'FL_insurance_sample.csv'
txt_file = 'ECG_data_125Hz_Simulator_Patch_Normal_Sinus.txt'
with open(txt_file, "w") as my_output_file:
with open(csv_file, "r") as my_input_file:
[ my_output_file.write(" ".join(row)+'\n') for row in csv.reader(my_input_file)]
my_output_file.close()
A few things:
You can open multiple files with the same context manager (with statement):
with open(csv_file, 'r') as input_file, open(txt_file, 'w') as output_file:
...
When using a context manager to handle files, there's no need to close the file, that's what the with statement is doing; it's saying "with the file open, do the following". So once the block is ended, the file is closed.
You could do something like:
with open(csv_file, 'r') as input_file, open(txt_file, 'w') as output_file:
for line in input_file:
output_file.write(line)
... But as #MEdwin says a csv can just be renamed and the commas will no longer act as separators; it will just become a normal .txt file. You can rename a file in python using os.rename():
import os
os.rename('file,txt', 'file.csv')
Finally, if you want to remove certain columns from the csv when writing to the txt file, you can use .split(). This allows you use an identifier such as a comma, and separate the line according this identifier into a list of strings. For example:
"Hello, this is a test".split(',')
>>> ["Hello", "this is a test"]
You can then just write certain indices from the list to the new file.
For more info on deleting columns en masse, see this post
I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.
My program reads a csv file but recently the input file was changed to be base64 encoded. So currently the read code is:
with open(uploadFile, 'rb') as csvfile:
spreadSheet = csv.reader(csvfile, delimiter=',')
I know the csv is a file descriptor and this can't be done, but I want to do something like:
import base64
with open(uploadFile, 'rb') as csvfile:
spreadSheet = csv.reader(bas64.decode(csvfile), delimiter=',')
That is the file input would be base64 decoded as though in a pipe and then parsed as a csv file.
I can read the file decode it write back into another file and then read that file with the csv reader but that all seems as though there should be a way to do it as a pipe sequence.
Try the following
import base64
import csv
with open(uploadFile, 'rb') as csvfile:
decoded = base64.standard_b64decode(csvfile.read()).decode('utf-8')
spreadSheet = csv.reader(decoded.splitlines(), delimiter=',')
I want to open a file using Python on Windows, perform some regex operations, optionally alter the content and then write the result back to a file.
I can create an example file which looks right (based on the comments on using binary mode in other posts on SO and within the documentation). What I can't see is how I convert the 'binary' data to a usable form without introducing '\r' characters.
An example:
import re
# Create an example file which represents the one I'm actually working on (a Jenkins config file if you're interested).
testFileName = 'testFile.txt'
with open(testFileName, 'wb') as output_file:
output_file.write(b'this\nis\na\ntest')
# Try and read the file in as I would in the script I was trying to write.
content = ""
with open(testFileName, 'rb') as content_file:
content = content_file.read()
# Do something to the content
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content) # <-- Fails because it won't operate on 'binary data'
# Write the file back to disk and then realise, frustratingly that something in this process has introduced carriage returns onto every line.
outputFilename = 'output_'+testFileName
with open(outputFilename, 'wb') as output_file:
output_file.write(content)
I presume you mean, your text file has return carriages and you don't want them included in the text.
If you use
with open(fileName, 'r', encoding="utf-8", errors="ignore", newline="\r\n") as content_file
or more specifically, set newline="\r\n" in your open call, it should consume the return carriages on new lines.
Edit: Or if you want to operate only on \n then this working example should do it.
import re
testFileName = 'testFile.txt'
with open(testFileName, 'w', newline='\n') as output_file:
output_file.write('this\nis\na\ntest')
content = ""
with open(testFileName, 'r', newline='\n') as content_file:
content = content_file.read()
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content)
outputFilename = 'output_'+testFileName
with open(outputFilename, 'w', newline='\n') as output_file:
output_file.write(content)
If I interpreted the question correctly, I first decoded the bytes to string, then did the regex sub. Next, I encoded the string into bytes to be written into the output file.
import re
testFileName = 'testFile.txt'
with open(testFileName, 'wb') as output_file:
output_file.write(b'this\nis\na\ntest')
content = ""
with open(testFileName, 'rb') as content_file:
content = content_file.read().decode('utf-8')
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content)
outputFilename = 'output_'+testFileName
with open(outputFilename, 'wb') as output_file:
output_file.write(content.encode('utf-8'))
I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))