I have some large (~500MB) .db3 files with tables in them that I'd like to extract, perform some checks on and write to a .csv file. The following code shows how I open a .db3 file given its filepath, select data from MY_TABLE and return the output list:
import sqlite3
def extract_db3_table(filepath):
output = []
with sqlite3.connect(filepath) as connection:
cursor = connection.cursor()
connection.text_factory = bytes
cursor.execute('''SELECT * FROM MY_TABLE''')
for row_index, row in enumerate(cursor):
output.append(row)
return output
However, the following error happens at the final iteration of the cursor:
File "C:/Users/.../Documents/Python/extract_test.py", line 10, in extract_db3_table
for row_index, row in enumerate(cursor):
DatabaseError: database disk image is malformed
The code runs for every iteration up until the last, when this error is thrown. I have tried different .db3 files but the problem persists. The same error is also thrown if I use the cursor.fetchall() method instead of the cursor for loop in my example above.
Do I have to use a try...catch block within the loop somehow and break the loop once this error is thrown or is there another way to solve this? Any insight is very much appreciated!
The reason I use bytes as the connection.text_factory is because some TEXT values in my table are not able to be converted to a str, so in my full code I use a try...catch block when decoding the TEXT values to UTF-8 and ignore the rows where this does not work.
Related
I'm trying to read a huge PostgreSQL table (~3 million rows of jsonb data, ~30GB size) to do some ETL in Python. I use psycopg2 for working with the database. I want to execute a Python function for each row of the PostgreSQL table and save the results in a .csv file.
The problem is that I need to select the whole 30GB table, and the query runs for a very long time without any possibility to monitor progress. I have found out that there exists a cursor parameter called itersize which determines the number of rows to be buffered on the client.
So I have written the following code:
import psycopg2
conn = psycopg2.connect("host=... port=... dbname=... user=... password=...")
cur = conn.cursor()
cur.itersize = 1000
sql_statement = """
select * from <HUGE TABLE>
"""
cur.execute(sql_statement)
for row in cur:
print(row)
cur.close()
conn.close()
Since the client buffers every 1000 rows on the client, I expect the following behavior:
The Python script buffers the first 1000 rows
We enter the for loop and print the buffered 1000 rows in the console
We reach the point where the next 1000 rows have to be buffered
The Python script buffers the next 1000 rows
GOTO 2
However, the code just hangs on the cur.execute() statement and no output is printed in the console. Why? Could you please explain what exactly is happening under the hood?
I'm appreciate to your help in my issue.
I want to append my response query into exist CSV file. I'm implement this by this example but from unknown reason - the output file stay empty.
This is minimal snippet of my code:
import psycopg2
import numpy as np
# Connect to an existing database
conn = psycopg2.connect(dbname="...", user="...", password="...", host="...")
# Open a cursor to perform database operations
cur = conn.cursor()
f = open('file_name.csv', 'ab') # "ab" for appending
cur.execute("""select * from table limit 10""") # I have another query here but it isn't relevant.
cur_out = np.asarray(cur.fetchall())
Until here it works perfect. When I print(cur_out), I got desired output. But in the next step:
np.savetxt(f, cur_out, delimiter=",", fmt='%s')
The file stay empty and I didn't find the reason for that.
Can you help me please?
Thankes for helpers.
I don't know how to tell you. But your code works for me perfectly.
My suggestions:
If you have existing files in this name, try to delete them and execute your code again,
Try to change the batch size.
Try to change the postfix of the file name to txt or dat.
Best wishes.
np.savetxt('file_name.csv', cur_out, delimiter=",", fmt='%s')
I am trying to load csv file to snowflake which has the below data.
But i am getting "end of record reached while expected to parse column" error.
customer_key,product,customer_id,first_name,last_name,res_version,updated_at
1234,desk,10977,Harry,Western \,1,20-04-1994
I have put the code i tried below. Can someone help me solve this error.
cs.execute("PUT file://"+cleaned_path+"file_name.csv #%file_name")
cs.execute("""copy into file_namefile_format=(type=csv skip_header=1 FIELD_OPTIONALLY_ENCLOSED_BY = '"' EMPTY_FIELD_AS_NULL = TRUE)""")
Try one of these options:
ESCAPE_UNENCLOSED_FIELD = None
or something like this:
COPY INTO table
FROM ( select replace(t.$1,'\') from #table/test.txt.gz t)
FILE_FORMAT=(TYPE=CSV FIELD_DELIMITER='\x01')
I need to write a python code that converts all the entries of a CSV to a mySQL insert into statement through a loop. I have csv files with about 6 million entries.
This code below can probably read a row.. Has some syntactically errors though. Can't really pin point as I don't have a background in coding.
file = open('physician_data.csv','r')
for row in file:
header_string = row
header_list = list(header_string.split(','))
number_of_columns = len(header_list)
insert_into_query= INSERT INTO physician_data (%s)
for i in range(number_of_columns):
if i != number_of_columns-1:
insert_into_query+="%s," %(header_list[i])
else:
# no comma after last column
insert_into_query += "%s)" %(header_list[i])
print(insert_into_query)
file.close
Can someone tell me how to make this work?
Please include error messages when you describe a problem (https://stackoverflow.com/help/mcve).
You may find the documentation for the CSV library quite helpful.
Use quotes where appropriate, e.g. insert_into_query = "INSERT..."
Call the close function like this: file.close()
The code shown below connects to two separate URLs and print the data retrieved from them. Item and Item2 are the selected dictionaries i need to deal with.
import urllib2
import json
#assign the url
url="http://ewodr.wodr.poznan.pl/doradztwo/swd/swd_api.php?dane={%22token%22:%22pcss%22,%22operacja%22:%22stacje%22}"
# open the url
json_obj= urllib2.urlopen(str(url))
output= json.load(json_obj)
station_res= output ['data'] ['features']
with open('file1.txt', 'w') as f:
for item in station_res:
station= item['id']
#for station in station_res:
url1="http://ewodr.wodr.poznan.pl/doradztwo/swd/meteo_api.php?dane={%22token%22:%22pcss%22,%22id%22:" +str(station)+ "}"
json_obj2= urllib2.urlopen(str(url1))
output2= json.load(json_obj2)
try:
for item2 in output2 ['data']:
print item['id'], item2['rok'], item2['miesiac'], item2['dzien'], item2['godzina'], item2['minuta'], item2['temperatura'], item2['wilgotnosc'], item2['opad'], item2['kierunek_wiatru'], item2['predkosc_wiatru'], item2['dewpoint'], item2['cisnienie']
json.dump ((item['id'], item2['rok'], item2['miesiac'], item2['dzien'], item2['godzina'], item2['minuta'], item2['temperatura'], item2['wilgotnosc'], item2['opad'], item2['kierunek_wiatru'], item2['predkosc_wiatru'], item2['dewpoint'], item2['cisnienie']),f)
except KeyError:
pass
Now i have a postgresql database where in one of the schemas present there, i have a specific table whose column structure is exactly the same as i printed out in the last of the above code and the goal is to insert these data retrieved from the APIs.
Also the catch is , for each value of item['id'] i need to call my url1 and the whole data from url1 sorted by only item2 along with its identifiers (e.g item2['rok'], item2['mesiac']...) need to be in the specific table in the database plus also the respective value of item['id'] (from url) every time.
On second thought, i just want to know if this whole thing can be done more easily by retrieving the whole data in the required format (as the code prints) in a txt file or csv file and then do a bulk insert in the database? I think this might make the task easier rather than looping an insert every time.
The code above also makes a txt file which contains the required data to be inserted in the table which is present in a schema inside the database. Using psycopg2 on python 2.7 please suggest me the code for the best approach, which i think is the bulk insert with txt file. Thanks.