I am still learning Python and as a little Project I wrote a script that would take the values I have in a text file and insert them into a sqlite3 database. But some of the names have weird letter (I guess you would call them non-ASCII), and generate an error when they come up. Here is my little script (and please tell me if there is anyway it could be more Pythonic):
import sqlite3
f = open('complete', 'r')
fList = f.readlines()
conn = sqlite3.connect('tpb')
cur = conn.cursor()
for i in fList:
exploaded = i.split('|')
eList = (
(exploaded[1], exploaded[5])
)
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
conn.commit()
cur.close()
And it generates this error:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\sortinghat.py", line 13, in <module>
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a te
xt_factory that can interpret 8-bit bytestrings (like text_factory = str). It is
highly recommended that you instead just switch your application to Unicode str
ings.
To get the file contents into unicode you need to decode from whichever encoding it is in.
It looks like you're on Windows so a good bet is cp1252.
If you got the file from somewhere else all bets are off.
Once you have the encoding sorted, an easy way to decode is to use the codecs module, e.g.:
import codecs
# ...
with codecs.open('complete', encoding='cp1252') as fin: # or utf-8 or whatever
for line in fin:
to_insert = (line.split('|')[1], line.split('|')[5])
cur.execute('INSERT INTO magnets VALUES (?,?)', to_insert)
conn.commit()
# ...
Related
Hello I have the following code:
from __future__ import print_function
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import pandas as pd
import re
import threading
import pickle
import sqlite3
#from treetagger import TreeTagger
conn = sqlite3.connect('Telcel.db')
cursor = conn.cursor()
cursor.execute('select id_comment from Tweets')
id_comment = [i for i in cursor]
cursor.execute('select id_author from Tweets')
id_author = [i for i in cursor]
cursor.execute('select comment_message from Tweets')
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
cursor.execute('select comment_published from Tweets')
comment_published = [i for i in cursor]
That is working well in python 2.7.12, output:
~/data$ python DBtoList.py
8003
8003
8003
8003
However when I run the same code using python3 as follows, I got:
~/data$ python3 DBtoList.py
Traceback (most recent call last):
File "DBtoList.py", line 21, in <module>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
File "DBtoList.py", line 21, in <listcomp>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
sqlite3.OperationalError: Could not decode to UTF-8 column 'comment_message' with text 'dancing music ������'
I searched for this line and I found:
"dancing music 😜"
I am not sure why the code is working in python 2, it seems that python Python 3.5.2 is not able to decode this character at this line:
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
so I would like to appreciate suggestions to fix this problem, thanks for the support
Python 3 has no issue with the string itself if you store it using the Python sqlite3 API. I've set utf-8 as my default encoding everywhere.
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('create table Tweets (comment_message text)')
conn.execute('insert into Tweets values ("dancing music 😜")')
[(tweet,) ] = conn.execute('select comment_message from tweets')
tweet
output:
'dancing music 😜'
Now, let's see the type:
>>> type(tweet)
str
So everything is fine if you work with Python str from the start.
Now, as an aside, the thing you are trying to do (encode utf-8, decode latin-1) makes very little sense, especially if you have things like emojis in the string. Look what happens to your tweet:
>>> tweet.encode('utf-8').decode('latin-1')
'dancing music ð\x9f\x98\x9c'
But now to your problem: You have stored strings (byte sequences) in your database using an encoding different from utf-8. The error you are seeing is caused by the sqlite3 library attempting to decode these byte sequences and failing because the bytes are not valid utf-8 sequences. The only way to solve this problem is:
Find out what encoding was used to encode the strings in the database
Use that encoding to decode the strings by setting conn.text_factory = lambda x: str(x, 'latin-1'). This assumes you've stored the strings using latin1.
I would then suggest that you run through the database and update the values so that they now are encoded using utf-8 which is the default behaviour.
See also this question.
I also highly recommend that you read this article about how encodings work.
This question already has answers here:
ValueError: need more than 2 values to unpack in Python 2.6.6
(4 answers)
Closed 6 years ago.
I'm struggling about this ValueError, it happens when I want to define an init_db function with resetting the database and adding some data from local document (hardwarelist.txt), the code was:
def init_db():
"""Initializes the database."""
db = get_db()
with app.open_resource('schema.sql', mode='r') as f:
db.cursor().executescript(f.read())
with open('hardwarelist.txt') as fl:
for eachline in fl:
(model,sn,user,status)=eachline.split(',')
db.execute('insert into entries (model,sn,user,status) values (?, ?, ?, ?)',
(model,sn,user,status))
fl.close()
db.commit()
And the error was:
File "/home/ziyma/Heroku_pro/flaskr/flaskr/flaskr.py", line 48, in init_db
(model,sn,user,status)=eachline.split(',')
ValueError: need more than 3 values to unpack
What should I do?
One of my mentors told me "If half your code is error handling, you aren't doing enough error handling." But we can leverage python's exception handling to make the job easier. Here, I've reworked your example so that if an error is detected, a message is displayed, and nothing is committed to the database.
When you hit the bad line, its printed and you can figure out what's wrong from there.
import sys
def init_db():
"""Initializes the database."""
db = get_db()
with app.open_resource('schema.sql', mode='r') as f:
db.cursor().executescript(f.read())
with open('hardwarelist.txt') as fl:
try:
for index, eachline in enumerate(fl):
(model,sn,user,status)=eachline.strip().split(',')
db.execute('insert into entries (model,sn,user,status) values (?, ?, ?, ?)',
(model,sn,user,status))
db.commit()
except ValueError as e:
print("Failed parsing {} line {}: {} ({})".format('hardwarelist.txt',
index, eachline.strip(), e), file=sys.stderr)
# TODO: Your code should have its own exception class
# that is raised. Your users would catch that exception
# with a higher-level summary of what went wrong.
raise
You should expand that exception handler to catch exceptions from your database code so that you can catch more errors.
As a side note, you need to strip the line before splitting to remove the \n newline character.
UPDATE
From the comments, here's an example on splitting multiple forms of the comma. In this case its Unicode FULLWIDTH COMMA U+FF0C. Whether you can enter unicode directly into your python scripts depends on your text editor and etc..., but that comma could be represented by "\uff0c" or ",". Anyway could can use a regular expression to split on multiple characters.
I create the text using unicode escapes
>>> text='a,b,c\uff0cd\n'
>>> print(text)
a,b,c,d
and I can write the regex with excapes
>>> re.split('[,\uff0c]', text.strip())
['a', 'b', 'c', 'd']
or by copy/paste of the alternate comma character
>>> re.split('[,,]', text.strip())
['a', 'b', 'c', 'd']
i neeed a python script to generate a csv file from my database XXXX. i wrote thise script but i have something wrong :
import mysql.connector
import csv
filename=open('test.csv','wb')
c=csv.writer(filename)
cnx = mysql.connector.connect(user='XXXXXXX', password='XXXXX',
host='localhost',
database='XXXXX')
cursor = cnx.cursor()
query = ("SELECT `Id_Vendeur`, `Nom`, `Prenom`, `email`, `Num_magasin`, `Nom_de_magasin`, `Identifiant_Filiale`, `Groupe_DV`, `drt_Cartes`.`gain` as 'gain', `Date_Distribution`, `Status_Grattage`, `Date_Grattage` FROM `drt_Cartes_Distribuer`,`drt_Agent`,`drt_Magasin`,`drt_Cartes` where `drt_Cartes_Distribuer`.`Id_Vendeur` = `drt_Agent`.`id_agent` AND `Num_magasin` = `drt_Magasin`.`Numero_de_magasin` AND `drt_Cartes_Distribuer`.`Id_Carte` = `drt_Cartes`.`num_carte`")
cursor.execute(query)
for Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage in cursor:
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
cursor.close()
filename.close()
cnx.close()
when i executing the command on phpmyadmin its look working very well but from my shell i got thise message :
# python test.py
Traceback (most recent call last):
File "test.py", line 18, in <module>
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 5: ordinal not in range(128)
It looks you are using csv for Python 2.7. Quoting docs:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
Options, choice one of them:
Follow doc link, go to samples section, and modify your code accordantly.
Use a csv packet with unicode supprt like https://pypi.python.org/pypi/unicodecsv
Your data from the database are not only ascii characteres. I suggest you use the 'unicodecvs' python module as suggested in the answer to this question: How to write UTF-8 in a CSV file
I have the following code . I use Python 2.7
import csv
import sqlite3
conn = sqlite3.connect('torrents.db')
c = conn.cursor()
# Create table
c.execute('''DROP TABLE torrents''')
c.execute('''CREATE TABLE IF NOT EXISTS torrents
(name text, size long, info_hash text, downloads_count long,
category_id text, seeders long, leechers long)''')
with open('torrents_mini.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter='|')
for row in spamreader:
name = unicode(row[0])
size = row[1]
info_hash = unicode(row[2])
downloads_count = row[3]
category_id = unicode(row[4])
seeders = row[5]
leechers = row[6]
c.execute('INSERT INTO torrents (name, size, info_hash, downloads_count,
category_id, seeders, leechers) VALUES (?,?,?,?,?,?,?)',
(name, size, info_hash, downloads_count, category_id, seeders, leechers))
conn.commit()
conn.close()
The error message I receive is
Traceback (most recent call last):
File "db.py", line 15, in <module>
name = unicode(row[0])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
If I don't convert into unicode then the error i get is
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
adding name = row[0].decode('UTF-8') gives me another error
Traceback (most recent call last):
File "db.py", line 27, in <module>
for row in spamreader:
_csv.Error: line contains NULL byte
the data contained in the csv file is in the following format
Tha Twilight New Moon DVDrip 2009 XviD-AMiABLE|694554360|2cae2fc76d110f35917d5d069282afd8335bc306|0|movies|0|1
Edit:I finally dropped the attempt and accomplished the task using sqlite3 command-line tool(it was quite easy).
I do not yet know what caused the errors , but when sqlite3 was importing the said csv file , it kept popping warnings about "unescaped character", the character being quotes(").
Thanks to everyone who tried to help.
Your data is not encoded as ASCII. Use the correct codec for your data.
You can tell Python what codec to use with:
unicode(row[0], correct_codec)
or use the str.decode() method:
row[0].decode(correct_codec)
What that correct codec is, we cannot tell you. You'll have to consult whatever you got the file from.
If you cannot figure out what encoding was used, you could use a package like chardet to make an educated guess, but take into account that such a library is not fail-proof.
Thank you for bobince in solving the first bugs!
How can you use pg.escape_bytea or pg.escape_string in the following?
#1 With both pg.escape_string and pg.escape_bytea
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_bytea(pg.espace_string(f.read())), pg.espace_string(pg.escape_bytea(f.name)))
I get the error
AttributeError: 'module' object has no attribute 'espace_string'
I tested the two escapes in the reverse order unsuccessfully too.
#2 Without pg.escape_string()
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_bytea(f.read()), pg.escape_bytea(f.name))
)
I get
WARNING: nonstandard use of \\ in a string literal
LINE 1: INSERT INTO files (file, file_name) VALUES ('%PDF-1.4\\012%\...
^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
------------------------
-- Putting pdf files in
I get the following error
# 3 With only pg.escape_string
------------------------
-- Putting pdf files in
------------------------
Traceback (most recent call last):
File "<stdin>", line 30, in <module>
File "<stdin>", line 27, in put_pdf_files_in
File "/usr/lib/python2.6/dist-packages/pg.py", line 313, in query
return self.db.query(qstr)
pg.ProgrammingError: ERROR: invalid byte sequence for encoding "UTF8": 0xc7ec
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
INSERT INTO files('binf','file_name') VALUES(file,file_name)
You've got the (...) sections the wrong way round, you're trying to insert the columns (file, filename) into the string literals ('binf', 'file_name'). You're also not actually inserting the contents of the variables binf and file_name into the query.
The pg module's query call does not support parameterisation. You would have to make the string yourself:
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_string(f.read()), pg.escape_string(f.name))
)
This is assuming f is a file object; I'm not sure where file is coming from in the code above or what .read(binf) is supposed to mean. If you are using a bytea column to hold your file data you must use escape_bytea instead of escape_string.
Better than creating your own queries is letting pg do it for you with the insert method:
con1.insert('files', file= f.read(), file_name= f.name)
Alternatively, consider using the pgdb interface or one of the other DB-API-compliant interfaces that is not PostgreSQL-specific, if you ever want to consider running your app on a different database. DB-API gives you parameterisation in the execute method:
cursor.execute(
'INSERT INTO files (file, file_name) VALUES (%(content)s, %(name)s)',
{'content': f.read(), 'name': f.name }
)