How to ETL postgresql columns which contain HTML code via csv copy - python

I need to ETL data between two postgresql data bases. Several of the columns needing to be transferred contain HTML code. The is several million rows and i need to move this via csv copy for speed reasons. While trying to csv.writer and copy_from the csv file the characters in the HTML code are causing errors with the transfer making it seemingly impossible to set a delimiter or handle quoting
I am running this ETL job via python and have used .replace() on the columns to make ';' work as a delimiter.
However, I am running into issues with page breaks and paragraphs in the HTML code ( and specifically) and quoting fields. When i encounter these i receive the error that there is 'missing data for column'
I have tried setting 'doublequote' and changing the escape character to '\'. I have also tried changing the quotechar to '|'
The code I am using to create the csv is:
filename = 'transfer.csv'
with open(filename, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=';', quotechar='|',
quoting=csv.QUOTE_MINIMAL)
The code i am using to load the csv is:
f = open(file_path, "r")
print('File opened')
cur.copy_from(f, stage_table, sep=';', null="")
As mentioned above, the error message that i receive when i try to import the csv is: "Error: missing data for column"
I would love to be able to format my csv.writer and copy_from code in such a way that i do not have to use dozens or more nested replace() statements and transforms to ETL this data and can have an automated script run it on a schedule.

Related

Multiline CSV read using Python3

Everyday we get CSV file from vendor and we need to parse them and insert it to database. We use single Python3 program for all the tasks.
The problem happening is with multiline CSV files, where the contents in the second lines are skipped.
48.11363;11.53402;81369;München;"";1.0;1962;I would need
help from
Stackoverflow;"";"";"";289500.0;true;""
Here the field "I would need help from Stackoverflow" is spread in 3 lines.
The problem that happens is python3 only considers "I would Need" as a record and skips the rest of the part.
At present I am using below options to read from database :
with open(file_path, newline='', encoding='utf-8') as f:
reader = csv.reader(f, delimiter=',' , quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in reader:
{MY LOGIC}
Is there any way to include multiline CSV as a single record.
I understand, In pyspark, there is an option of option("multiline",True) but we don't want to use pyspark in first place.
Looking for options.
Thanks in Advance

store JSON string in csv file for neo4j import

I need to store a JSON string in one of the fields of a csv file that is going to be used to create a neo4j database with neo4j-admin import. After I generate all of the necessary csv files and try to create the database it is writing me that there are no valid --nodes files. I suspect this is an issue of quoting specifically in the csv's with JSON strings stored. Here is the code I am using for the generation of csv files:
with open(cl_file,'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(title_list)
for row in unique_cl_data:
writer.writerow([row[0], row[1], row[2], row[3], 'Cluster', dataset_name])
The JSON string is stored in the row[3] value and looks like that:
'{"mature_neuron":0.493694929,"intermediate_progenitor_cell":0.0982259823,"immature_neuron":0.1773570713,"glutamatergic_neuron":0.6074802751,"gabaergic_neuron":0.2685863644,"dopaminergic_neuron":0.0234599396,"serotonergic_neruon":0.001022236,"cholinergic_neuron":0.0273108961,"neuroepithelial_cell":0.2173953827,"radial_glia":0.2758471756,"microglia":0.0282818013,"macrophage":0.0,"astrocyte":0.3250249223,"oligodendrocyte_precursor_cell":0.4788073089,"mature_oligodendrocyte":0.3684283806,"schwann_cell_precursor":0.2158159088,"myelinating_schwann_cell":0.3282158992,"nonmyelinating_schwann_cell":0.4526564331,"endothelial_cell":0.7830818309,"mural_cell":0.0756233339}'
The generated csv looks like that:
"clusterId:ID","chartType","clusterName","assign",":LABEL","DATASET"
"scid_engram_fear_traned_tsne_1","tsne","1","{""mature_neuron"":0.793159869,""intermediate_progenitor_cell"":0.000454013,""immature_neuron"":0.0548508584,""glutamatergic_neuron"":1.0792403847,""gabaergic_neuron"":0.3181778459,""dopaminergic_neuron"":0.150589103,""serotonergic_neruon"":0.0096765336,""cholinergic_neuron"":0.0251700647,""neuroepithelial_cell"":0.0594110346,""radial_glia"":0.1539441058,""microglia"":0.0224593362,""macrophage"":0.0300658893,""astrocyte"":0.0996221719,""oligodendrocyte_precursor_cell"":0.0051255739,""mature_oligodendrocyte"":0.0223153229,""schwann_cell_precursor"":0.029507684,""myelinating_schwann_cell"":0.0360644031,""nonmyelinating_schwann_cell"":0.4626932582,""endothelial_cell"":0.0006433937,""mural_cell"":0.0}","Cluster","scid_engram_fear_traned"
As can be seen there are double quotation marks around the keys of the JSON strings. I suspect that is the issue, but am not sure. I do not know how to avoid such quotation from happening if it is the cause of the failed import. csv.QUOTE_ALL has always been working for me (before I tried to store a JSON string).
Ultimately I just substituted some characters in row[3] which worked out fine (same QUOTE_ALL) using:
row[3].replace('"', '\\"').replace('\n', '\\n')
When reading in the fields on the front end I needed to substitute the characters back:
JSON.parse(jsonStr.replace(/\\"/g, '"'))

Cleaning unicode characters while writing to csv

I am using a certain REST api to get data, and then attemping to write it to a csv using python 2.7
In the csv, every item with a tuple has u' ' around it. For example, with the 'tags' field i am retrieving, i am getting [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue'] . However, if I print the data in the program prior to it being written in the csv, the data looks fine, .ie ('01d/02d/major--', '45m/04h/12h/24h', etc). So I am assuming I have to modify something in the csv write command or within the the csv writer object itself. My question is how to write the data into the csv properly so that there are no unicode characters.
In Python3:
Just define the encoding when opening the csv file to write in.
If the row contains non ascii chars, you will get UnicodeEncodeError
row = [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue']
import csv
with open('output.csv', 'w', newline='', encoding='ascii') as f:
writer = csv.writer(f)
writer.writerow(row)

Exporting tables to CSV on postgres without having to use 'Text to columns'

I've been using the following script to export tables from redshift and postgres:
#export on;
#export set filename="D:\Users\files\filename.csv" CsvColumnDelimiter=";";
SELECT * FROM schemaname.tablename;
#export off;
This works well, but to get the data in separate columns I have to use the "Text to Columns" function in excel. I am looking for a script that will automate the "Text to Columns" step as I have over 700 tables. I've been searching for SQL and Python scripts that will do this, but haven't found anything so far.
Excel is expecting a comma delimiter in the CSVs. You need to either change your delimiter, or change the default delimiter for Excel. You can do that in your Region & Language settings.
python has a builtin csv module.
import csv
with open('in.csv', 'r') as infp:
with open('out.csv', 'w') as outfp:
reader = csv.reader(infp, delimiter=';')
writer = csv.writer(outfp, delimiter=',')
for row in reader:
writer.writenow(row)
Should translate the delimiter. Look at CSV for full details.
There's probably a better way of doing this. On linux there are sed, awk, tr, etc that can do this in one arcane line. See 8 examples to change the delimiter of a file in Linux. But I don't know the windows equivalents, if there are any.

Using Python's CSV library to print an array as a csv file

I have a python list as such:
[['a','b','c'],['d','e','f'],['g','h','i']]
I am trying to get it into a csv format so I can load it into excel:
a,b,c
d,e,f
g,h,i
Using this, I am trying to write the arary to a csv file:
with open('tables.csv','w') as f:
f.write(each_table)
However, it prints out this:
[
[
'
a
'
,
...
...
So then I tried putting it into an array (again) and then printing it.
each_table_array=[each_table]
with open('tables.csv','w') as f:
f.write(each_table_array)
Now when I open up the csv file, its a bunch of unknown characters, and when I load it into excel, I get a character for every cell.
Not too sure if it's me using the csv library wrong, or the array portion.
I just figured out that the table I am pulling data from has another table within one of its cells, this expands out and messes up the whole formatting
You need to use the csv library for your job:
import csv
each_table = [['a','b','c'],['d','e','f'],['g','h','i']]
with open('tables.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for row in each_table:
writer.writerow(row)
As a more flexible and pythonic way use csv module for dealing with csv files Note that as you are in python 2 you need the method newline='' * in your open function . then you can use csv.writer to open you csv file for write:
import csv
with open('file_name.csv', 'w',newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerows(main_list)
From python wiki: If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

Categories

Resources