store JSON string in csv file for neo4j import - python

I need to store a JSON string in one of the fields of a csv file that is going to be used to create a neo4j database with neo4j-admin import. After I generate all of the necessary csv files and try to create the database it is writing me that there are no valid --nodes files. I suspect this is an issue of quoting specifically in the csv's with JSON strings stored. Here is the code I am using for the generation of csv files:
with open(cl_file,'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(title_list)
for row in unique_cl_data:
writer.writerow([row[0], row[1], row[2], row[3], 'Cluster', dataset_name])
The JSON string is stored in the row[3] value and looks like that:
'{"mature_neuron":0.493694929,"intermediate_progenitor_cell":0.0982259823,"immature_neuron":0.1773570713,"glutamatergic_neuron":0.6074802751,"gabaergic_neuron":0.2685863644,"dopaminergic_neuron":0.0234599396,"serotonergic_neruon":0.001022236,"cholinergic_neuron":0.0273108961,"neuroepithelial_cell":0.2173953827,"radial_glia":0.2758471756,"microglia":0.0282818013,"macrophage":0.0,"astrocyte":0.3250249223,"oligodendrocyte_precursor_cell":0.4788073089,"mature_oligodendrocyte":0.3684283806,"schwann_cell_precursor":0.2158159088,"myelinating_schwann_cell":0.3282158992,"nonmyelinating_schwann_cell":0.4526564331,"endothelial_cell":0.7830818309,"mural_cell":0.0756233339}'
The generated csv looks like that:
"clusterId:ID","chartType","clusterName","assign",":LABEL","DATASET"
"scid_engram_fear_traned_tsne_1","tsne","1","{""mature_neuron"":0.793159869,""intermediate_progenitor_cell"":0.000454013,""immature_neuron"":0.0548508584,""glutamatergic_neuron"":1.0792403847,""gabaergic_neuron"":0.3181778459,""dopaminergic_neuron"":0.150589103,""serotonergic_neruon"":0.0096765336,""cholinergic_neuron"":0.0251700647,""neuroepithelial_cell"":0.0594110346,""radial_glia"":0.1539441058,""microglia"":0.0224593362,""macrophage"":0.0300658893,""astrocyte"":0.0996221719,""oligodendrocyte_precursor_cell"":0.0051255739,""mature_oligodendrocyte"":0.0223153229,""schwann_cell_precursor"":0.029507684,""myelinating_schwann_cell"":0.0360644031,""nonmyelinating_schwann_cell"":0.4626932582,""endothelial_cell"":0.0006433937,""mural_cell"":0.0}","Cluster","scid_engram_fear_traned"
As can be seen there are double quotation marks around the keys of the JSON strings. I suspect that is the issue, but am not sure. I do not know how to avoid such quotation from happening if it is the cause of the failed import. csv.QUOTE_ALL has always been working for me (before I tried to store a JSON string).

Ultimately I just substituted some characters in row[3] which worked out fine (same QUOTE_ALL) using:
row[3].replace('"', '\\"').replace('\n', '\\n')
When reading in the fields on the front end I needed to substitute the characters back:
JSON.parse(jsonStr.replace(/\\"/g, '"'))

Related

Why is my csv file separated by " \t " instead of commas (" , ")?

I downloaded data from internet and saved as a csv (comma delimited) file. The image shows what the file looks like in excel.
Using csv.reader in python, I printed each row. I have shown my code below along with the output in Spyder.
import csv
with open('p_dat.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
I am very confused as to why my values are not comma separated. Any help will be greatly appreciated.
As pointed out in the comments, technically this is a TSV (tab-separated values) file, which is actually perfectly valid.
In practice, of course, not all libraries will make a "hard" distinction between a TSV and CSV file. The way you parse a TSV file is basically the same as the way you parse a CSV file, except that the delimiter is different.
There are actually multiple valid delimiters for this kind of file, such as tabs, commas, and semicolons. Which one you choose is honestly a matter of preference, not a "hard" technical limit.
See the specification for csvs. There are many options for the delimiter in the file. In this case you have a tab, \t.
The option is important. Suppose your data had commas in it, then a , as a delimiter would not be a good choice.
Even though they're named comma-separated values, they're sometimes separated by different symbols (like the tab character that you have currently).
If you want to use Python to view this as a comma-separated file, you can try something like:
import csv
...
with open('p_dat.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
commarow = row.replace("\t",",")
print(commarow)

How to ETL postgresql columns which contain HTML code via csv copy

I need to ETL data between two postgresql data bases. Several of the columns needing to be transferred contain HTML code. The is several million rows and i need to move this via csv copy for speed reasons. While trying to csv.writer and copy_from the csv file the characters in the HTML code are causing errors with the transfer making it seemingly impossible to set a delimiter or handle quoting
I am running this ETL job via python and have used .replace() on the columns to make ';' work as a delimiter.
However, I am running into issues with page breaks and paragraphs in the HTML code ( and specifically) and quoting fields. When i encounter these i receive the error that there is 'missing data for column'
I have tried setting 'doublequote' and changing the escape character to '\'. I have also tried changing the quotechar to '|'
The code I am using to create the csv is:
filename = 'transfer.csv'
with open(filename, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=';', quotechar='|',
quoting=csv.QUOTE_MINIMAL)
The code i am using to load the csv is:
f = open(file_path, "r")
print('File opened')
cur.copy_from(f, stage_table, sep=';', null="")
As mentioned above, the error message that i receive when i try to import the csv is: "Error: missing data for column"
I would love to be able to format my csv.writer and copy_from code in such a way that i do not have to use dozens or more nested replace() statements and transforms to ETL this data and can have an automated script run it on a schedule.

Writing results from SQL query to CSV and avoiding extra line-breaks

I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
a.writerows(rows)
Some of the columns that have carriage returns/linebreaks will create new lines:
"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences;
^
-<class 'psycopg2.ProgrammingError'>"
which causes the import process to fail. I can work around this by hard-coding for exceptions:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
for row in rows:
list_of_rows = []
for c in row:
if isinstance(c, str):
c = c.replace("\n", "\\n")
c = c.replace("|", "\|")
c = c.replace("\\", "\\\\")
list_of_rows.append(c)
else:
list_of_rows.append(c)
a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])
But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?
If you're doing SELECT * FROM table without a WHERE clause, you could use COPY table TO STDOUT instead, with the right options:
copy_command = """COPY some_schema.some_message_log TO STDOUT
CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *"""
with open('data.csv', 'w', newline='') as fp:
cursor.copy_expert(copy_command)
This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines.
If you do need a WHERE clause in production you could create a temporary table and copy it instead:
cursor.execute("""CREATE TEMPORARY TABLE copy_me AS
SELECT this, that, the_other FROM table_name WHERE conditions""")
(edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.
I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree.
With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the .writerow() or .writerows() functions should be all you need do.
Redshift's COPY implementation understands the most common dialect of CSV by default, which is to
delimit cells by a comma (,),
quote cells with double quotes ("),
and to escape any embedded double quotes by doubling (" → "").
To back that up with documentation from Redshift FORMAT AS CSV:
... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ...
However, your Python CSV export code uses a pipe (|) as the delimiter and sets the quotechar to double quote ("). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w') as fp:
csvw = csv.writer( fp )
csvw.writerows(rows)
From there, tell COPY to use the CSV format (again with no need for non-default specifications):
COPY your_table FROM your_csv_file auth_code FORMAT AS CSV;
That should do it.
Why write to the database after every row?
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
list_of_rows = []
for row in rows:
for c in row:
if isinstance(c, basestring):
c = c.replace("\n", "\\n")
c = c.replace("|", "\|")
c = c.replace("\\", "\\\\")
list_of_rows.append(row)
a.writerows([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])
The problem is that you are using the Redshift COPY command with its default parameters, which use a pipe as a delimiter (see here and here) and require escaping of newlines and pipes within text fields (see here and here). However, the Python csv writer only knows how to do the standard thing with embedded newlines, which is to leave them as-is, inside a quoted string.
Fortunately, the Redshift COPY command can also use the standard CSV format. Adding the CSV option to your COPY command gives you this behavior:
Enables use of CSV format in the input data. To automatically escape delimiters, newline characters, and carriage returns, enclose the field in the character specified by the QUOTE parameter. The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character."
This is exactly the approach used by the Python CSV writer, so it should take care of your problems. So my advice would be to create a standard csv file using code like this:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp) # no need for special settings
a.writerows(rows)
Then in Redshift, change your COPY command to something like this (note the added CSV tag):
COPY logdata
FROM 's3://mybucket/data/data.csv'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
CSV;
Alternatively, you could continue manually converting your fields to match the default settings for Redshift's COPY command. Python's csv.writer won't do this for you on its own, but you may be able to speed up your code a bit, especially for big files, like this:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(
fp,
delimiter='|', quoting=csv.QUOTE_ALL,
quotechar='"', doublequote=True, lineterminator='\n'
)
a.writerows(
c.replace("\\", "\\\\").replace("\n", "\\\n").replace("|", "\\|").encode('utf-8')
if isinstance(c, str)
else c
for row in rows
for c in row
)
As another alternative, you could experiment with importing the query data into a pandas DataFrame with .from_sql, doing the replacements in the DataFrame (a whole column at a time), then writing the table out with .to_csv. Pandas has incredibly fast csv code, so this may give you a significant speedup.
Update: I just noticed that in the end I basically duplicated #hunteke's answer. The key point (which I missed the first time through) is that you probably haven't been using the CSV argument in your current Redshift COPY command; if you add that, this should get easy.

Cleaning unicode characters while writing to csv

I am using a certain REST api to get data, and then attemping to write it to a csv using python 2.7
In the csv, every item with a tuple has u' ' around it. For example, with the 'tags' field i am retrieving, i am getting [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue'] . However, if I print the data in the program prior to it being written in the csv, the data looks fine, .ie ('01d/02d/major--', '45m/04h/12h/24h', etc). So I am assuming I have to modify something in the csv write command or within the the csv writer object itself. My question is how to write the data into the csv properly so that there are no unicode characters.
In Python3:
Just define the encoding when opening the csv file to write in.
If the row contains non ascii chars, you will get UnicodeEncodeError
row = [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue']
import csv
with open('output.csv', 'w', newline='', encoding='ascii') as f:
writer = csv.writer(f)
writer.writerow(row)

Python csv writer : "Unknown Dialect" Error

I have a very large string in the CSV format that will be written to a CSV file.
I try to write it to CSV using the simplest if the python script
results=""" "2013-12-03 23:59:52","/core/log","79.223.39.000","logging-4.0",iPad,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,3,"1385593191.865",true,ERROR,"app_error","iPad/Unknown/webkit/537.51.1",NA,"Does+not",false
"2013-12-03 23:58:41","/core/log","217.7.59.000","logging-4.0",Win32,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,4,"1385593120.68",true,ERROR,"app_error","Win32/Unknown/msie/9.0",NA,"Does+not,false
"2013-12-03 23:58:19","/core/client_log","79.240.195.000","logging-4.0",Win32,"5.1","1.0.1.59-266060",NA,NA,NA,NA,6,"1385593099.001",true,ERROR,"app_error","Win32/5.1/mozilla/25.0",NA,"Could+not:+{"url":"/all.json?status=ongoing,scheduled,conflict","code":0,"data":"","success":false,"error":true,"cached":false,"jqXhr":{"readyState":0,"responseText":"","status":0,"statusText":"error"}}",false"""
resultArray = results.split('\n')
with open(csvfile, 'wb') as f:
writer = csv.writer(f)
for row in resultArray:
writer.writerows(row)
The code returns
"Unknown Dialect"
Error
Is the error because of the script or is it due to the string that is being written?
EDIT
If the problem is bad input how do I sanitize it so that it can be used by the csv.writer() method?
You need to specify the format of your string:
with open(csvfile, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar="'", quoting=csv.QUOTE_ALL)
You might also want to re-visit your writing loop; the way you have it written you will get one column in your file, and each row will be one character from the results string.
To really exploit the module, try this:
import csv
lines = ["'A','bunch+of','multiline','CSV,LIKE,STRING'"]
reader = csv.reader(lines, quotechar="'")
with open('out.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(list(reader))
out.csv will have:
A,bunch+of,multiline,"CSV,LIKE,STRING"
If you want to quote all the column values, then add quoting=csv.QUOTE_ALL to the writer object; then you file will have:
"A","bunch+of","multiline","CSV,LIKE,STRING"
To change the quotes to ', add quotechar="'" to the writer object.
The above code does not give csv.writer.writerows input that it expects. Specifically:
resultArray = results.split('\n')
This creates a list of strings. Then, you pass each string to your writer and tell it to writerows with it:
for row in resultArray:
writer.writerows(row)
But writerows does not expect a single string. From the docs:
csvwriter.writerows(rows)
Write all the rows parameters (a list of row objects as described above) to the writer’s file object, formatted according to the current dialect.
So you're passing a string to a method that expects its argument to be a list of row objects, where a row object is itself expected to be a sequence of strings or numbers:
A row must be a sequence of strings or numbers for Writer objects
Are you sure your listed example code accurately reflects your attempt? While it certainly won't work, I would expect the exception produced to be different.
For a possible fix - if all you are trying to do is to write a big string to a file, you don't need the csv library at all. You can just write the string directly. Even splitting on newlines is unnecessary unless you need to do something like replacing Unix-style linefeeds with DOS-style linefeeds.
If you need to use the csv module after all, you need to give your writer something it understands - in this example, that would be something like writer.writerow(['A','bunch+of','multiline','CSV,LIKE,STRING']). Note that that's a true Python list of strings. If you need to turn your raw string "'A','bunch+of','multiline','CSV,LIKE,STRING'" into such a list, I think you'll find the csv library useful as a reader - no need to reinvent the wheel to handle the quoted commas in the substring 'CSV,LIKE,STRING'. And in that case you would need to care about your dialect.
you can use 'register_dialect':
for example for escaped formatting:
csv.register_dialect('escaped', escapechar='\\', doublequote=True, quoting=csv.QUOTE_ALL)

Categories

Resources