Python SQLite: enforce UTF-8 encoding

Python SQLite: enforce UTF-8 encoding - python

I'm developing a cross-platform Python (3.7+) application, and I need to rely on sort order of TEXT columns in SQLite, meaning the comparison algorithm of TEXT values must be based on UTF-8 bytes. Even if the system encoding (sys.getdefaultencoding()) is not utf-8.
But in documentation of sqlite3 module I can't find an encoding option for sqlite3.connect.
And I read that the use of sys.setdefaultencoding("utf-8") is an ugly hack and highly discouraged (that's why we need to reload(sys) before calling it)
So what's the solution?

Looking at Python's _sqlite/connection.c code, either sqlite3_open_v2 or sqlite3_open is called (depending on a compile flag). And based on sqlite doc, both of them use UTF-8 as default database encoding. I'm still not sure about the meaning of word "default" since it doesn't mention any way to override it! But I it doesn't look like that Python can open with another encoding.
#ifdef SQLITE_OPEN_URI
Py_BEGIN_ALLOW_THREADS
rc = sqlite3_open_v2(database, &self->db,
SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE |
(uri ? SQLITE_OPEN_URI : 0), NULL);
#else
if (uri) {
PyErr_SetString(pysqlite_NotSupportedError, "URIs not supported");
return -1;
}
Py_BEGIN_ALLOW_THREADS
rc = sqlite3_open(database, &self->db);
#endif

Related

Python bug: null byte in input prompt

I've found that
input('some\x00 text')
will prompt for some instead of some text.
From sources, I've figured out that this function uses C function PyOS_Readline, which ignores everything in prompt after NULL byte.
From PyOS_StdioReadline(FILE *sys_stdin, FILE *sys_stdout, const char *prompt):
fprintf(stderr, "%s", prompt);
https://github.com/python/cpython/blob/3.6/Python/bltinmodule.c#L1989
https://github.com/python/cpython/blob/3.6/Parser/myreadline.c#L251
Is this a bug or there is a reason for that?
Issue: http://bugs.python.org/issue30431

The function signature pretty much requires a NUL terminated C-string, PyOS_StdioReadline(FILE *sys_stdin, FILE *sys_stdout, const char *prompt), so there isn't much than can be done about this without changing the API and breaking interoperability with GNU readline.

How can I create a file with `/` in its file name? [duplicate]

I know that this is not something that should ever be done, but is there a way to use the slash character that normally separates directories within a filename in Linux?

The answer is that you can't, unless your filesystem has a bug. Here's why:
There is a system call for renaming your file defined in fs/namei.c called renameat:
SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
When the system call gets invoked, it does a path lookup (do_path_lookup) on the name. Keep tracing this, and we get to link_path_walk which has this:
static int link_path_walk(const char *name, struct nameidata *nd)
{
struct path next;
int err;
unsigned int lookup_flags = nd->flags;
while (*name=='/')
name++;
if (!*name)
return 0;
...
This code applies to any file system. What's this mean? It means that if you try to pass a parameter with an actual '/' character as the name of the file using traditional means, it will not do what you want. There is no way to escape the character. If a filesystem "supports" this, it's because they either:
Use a unicode character or something that looks like a slash but isn't.
They have a bug.
Furthermore, if you did go in and edit the bytes to add a slash character into a file name, bad things would happen. That's because you could never refer to this file by name :( since anytime you did, Linux would assume you were referring to a nonexistent directory. Using the 'rm *' technique would not work either, since bash simply expands that to the filename. Even rm -rf wouldn't work, since a simple strace reveals how things go on under the hood (shortened):
$ ls testdir
myfile2 out
$ strace -vf rm -rf testdir
...
unlinkat(3, "myfile2", 0) = 0
unlinkat(3, "out", 0) = 0
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
close(3) = 0
unlinkat(AT_FDCWD, "testdir", AT_REMOVEDIR) = 0
...
Notice that these calls to unlinkat would fail because they need to refer to the files by name.

You could use a Unicode character that displays as / (for example the fraction slash), assuming your filesystem supports it.

It depends on what filesystem you are using. Of some of the more popular ones:
ext3: No
ext4: No
jfs: Yes
reiserfs: No
xfs: No

Only with an agreed-upon encoding. For example, you could agree that % will be encoded as %% and that %2F will mean a /. All the software that accessed this file would have to understand the encoding.

The short answer is: No, you can't. It's a necessary prohibition because of how the directory structure is defined.
And, as mentioned, you can display a unicode character that "looks like" a slash, but that's as far as you get.

In general it's a bad idea to try to use "bad" characters in a file name at all; even if you somehow manage it, it tends to make it hard to use the file later. The filesystem separator is flat-out not going to work at all, so you're going to need to pick an alternative method.
Have you considered URL-encoding the URL then using that as the filename? The result should be fine as a filename, and it's easy to reconstruct the name from the encoded version.
Another option is to create an index - create the output filename using whatever method you like - sequentially-numbered names, SHA1 hashes, whatever - then write a file with the generated filename/URL pair. You can save that into a hash and use it to do a URL-to-filename lookup or vice-versa with the reversed version of the hash, and you can write it out and reload it later if needed.

The short answer is: you must not. The long answer is, you probably can or it depends on where you are viewing it from and in which layer you are working with.
Since the question has Unix tag in it, I am going to answer for Unix.
As mentioned in other answers that, you must not use forward slashes in a filename.
However, in MacOS you can create a file with forward slashes / by:
# avoid doing it at all cost
touch 'foo:bar'
Now, when you see this filename from terminal you will see it as foo:bar
But, if you see it from finder: you will see finder converted it as foo/bar
Same thing can be done the other way round, if you create a file from finder with forward slashes in it like /foobar, there will be a conversion done in the background. As a result, you will see :foobar in terminal but the other way round when viewed from finder.
So, : is valid in the unix layer, but it is translated to or from / in the Mac layers like Finder window, GUI. : the colon is used as the separator in HFS paths and the slash / is used as the separator in POSIX paths
So there is a two-way translation happening, depending on which “layer” you are working with.
See more details here: https://apple.stackexchange.com/a/283095/323181

You can have a filename with a / in Linux and Unix. This is a very old question, but surprisingly nobody has said it in almost 10 years since the question was asked.
Every Unix and Linux system has the root directory named /. A directory is just a special kind of file. Symbolic links, character devices, etc are also special kinds of files. See here for an in depth discussion.
You can't create any other files with a /, but you certainly have one -- and a very important one at that.

Curses library in python

I have C code which draws a vertical & a horizontal line in the center of screen as below:
#include<stdio.h>
#define HLINE for(i=0;i<79;i++)\
printf("%c",196);
#define VLINE(X,Y) {\
gotoxy(X,Y);\
printf("%c",179);\
}
int main()
{
int i,j;
clrscr();
gotoxy(1,12);
HLINE
for(y=1;y<25;y++)
VLINE(39,y)
return 0;
}
I am trying to convert it literally in python version 2.7.6:
import curses
def HLINE():
for i in range(0,79):
print "%c" % 45
def VLINE(X,Y):
curses.setsyx(Y,X)
print "%c" % 124
curses.setsyx(12,1)
HLINE()
for y in range(1,25):
VLINE(39,y)
My questions:
1.Do we have to change the position of x and y in setsyx function i.e, gotoxy(1,12) is setsyx(12,1) ?
2.Curses module is only available for unix not for windows?If yes, then what about windows(python 2.7.6)?
3.Why character value of 179 and 196 are � in python but in C, it is | and - respectively?
4.Above code in python is literally right or it needs some improvement?

Yes, you will have to change the argument positions. setsyx(y, x) and gotoxy(x, y)
There are Windows libraries made available. I find most useful binaries here: link
This most likely has to do with unicode formatting. What you could try to do is add the following line to the top of your python file (after the #!/usr/bin/python line) as this forces python to work with utf-8 encoding in String objects:
# -*- coding: utf-8 -*-
Your Python code to me looks acceptable enough, I wouldn't worry about it.

Yes.
Duplicate of Curses alternative for windows
Presumably you are using Python 2.x, thus your characters are bytes and therefore encoding-dependent. The meaning of a particular numeric value is determined by the encoding used. Most likely you are using utf8 on Linux and something non-utf8 in your Windows program, so you cannot compare the values. In curses you should use curses.ACS_HLINE and curses.ACS_VLINE.
You cannot mix print and curses functions, it will mess up the display. Use curses.addch or variants instead.

Are PyArg_ParseTuple() "s" format specifiers useful in Python 3.x C API?

I'm trying to write a Python C extension that processes byte strings, and I have something basically working for Python 2.x and Python 3.x.
For the Python 2.x code, near the start of my function, I currently have a line:
if (!PyArg_ParseTuple(args, "s#:in_bytes", &src_ptr, &src_len))
...
I notice that the s# format specifier accepts both Unicode strings and byte strings. I really just want it to accept byte strings and reject Unicode. For Python 2.x, this might be "good enough"--the standard hashlib seems to do the same, accepting Unicode as well as byte strings. However, Python 3.x is meant to clean up the Unicode/byte string mess and not let the two be interchangeable.
So, I'm surprised to find that in Python 3.x, the s format specifiers for PyArg_ParseTuple() still seem to accept Unicode and provide a "default encoded string version" of the Unicode. This seems to go against the principles of Python 3.x, making the s format specifiers unusable in practice. Is my analysis correct, or am I missing something?
Looking at the implementation for hashlib for Python 3.x (e.g. see md5module.c, function MD5_update() and its use of GET_BUFFER_VIEW_OR_ERROUT() macro) I see that it avoids the s format specifiers, and just takes a generic object (O specifier) and then does various explicit type checks using the GET_BUFFER_VIEW_OR_ERROUT() macro. Is this what we have to do?

I agree with you -- it's one of several spots where the C API migration of Python 3 was clearly not designed as carefully and thouroughly as the Python coder-visible parts. I do also agree that probably the best workaround for now is focusing on "buffer views", per that macro -- until and unless something better gets designed into a future Python C API (don't hold your breath waiting for that to happen, though;-).

How do I use extended characters in Python's curses library?

I've been reading tutorials about Curses programming in Python, and many refer to an ability to use extended characters, such as line-drawing symbols. They're characters > 255, and the curses library knows how to display them in the current terminal font.
Some of the tutorials say you use it like this:
c = ACS_ULCORNER
...and some say you use it like this:
c = curses.ACS_ULCORNER
(That's supposed to be the upper-left corner of a box, like an L flipped vertically)
Anyway, regardless of which method I use, the name is not defined and the program thus fails. I tried "import curses" and "from curses import *", and neither works.
Curses' window() function makes use of these characters, so I even tried poking around on my box for the source to see how it does it, but I can't find it anywhere.

you have to set your local to all, then encode your output as utf-8 as follows:
import curses
import locale
locale.setlocale(locale.LC_ALL, '') # set your locale
scr = curses.initscr()
scr.clear()
scr.addstr(0, 0, u'\u3042'.encode('utf-8'))
scr.refresh()
# here implement simple code to wait for user input to quit
scr.endwin()
output:
あ

From curses/__init__.py:
Some constants, most notably the ACS_*
ones, are only added to the C
_curses module's dictionary after initscr() is called. (Some versions
of SGI's curses don't define values
for those constants until initscr()
has been called.) This wrapper
function calls the underlying C
initscr(), and then copies the
constants from the
_curses module to the curses package's dictionary. Don't do 'from curses
import *' if you'll be needing the
ACS_* constants.
In other words:
>>> import curses
>>> curses.ACS_ULCORNER
exception
>>> curses.initscr()
>>> curses.ACS_ULCORNER
>>> 4194412

I believe the below is appropriately related, to be posted under this question. Here I'll be using utfinfo.pl (see also on Super User).
First of all, for standard ASCII character set, the Unicode code point and the byte encoding is the same:
$ echo 'a' | perl utfinfo.pl
Char: 'a' u: 97 [0x0061] b: 97 [0x61] n: LATIN SMALL LETTER A [Basic Latin]
So we can do in Python's curses:
window.addch('a')
window.border('a')
... and it works as intended
However, if a character is above basic ASCII, then there are differences, which addch docs don't necessarily make explicit. First, I can do:
window.addch(curses.ACS_PI)
window.border(curses.ACS_PI)
... in which case, in my gnome-terminal, the Unicode character 'π' is rendered. However, if you inspect ACS_PI, you'll see it's an integer number, with a value of 4194427 (0x40007b); so the following will also render the same character (or rater, glyph?) 'π':
window.addch(0x40007b)
window.border(0x40007b)
To see what's going on, I grepped through the ncurses source, and found the following:
#define ACS_PI NCURSES_ACS('{') /* Pi */
#define NCURSES_ACS(c) (acs_map[NCURSES_CAST(unsigned char,c)])
#define NCURSES_CAST(type,value) static_cast<type>(value)
#lib_acs.c: NCURSES_EXPORT_VAR(chtype *) _nc_acs_map(void): MyBuffer = typeCalloc(chtype, ACS_LEN);
#define typeCalloc(type,elts) (type *)calloc((elts),sizeof(type))
#./widechar/lib_wacs.c: { '{', { '*', 0x03c0 }}, /* greek pi */
Note here:
$ echo '{π' | perl utfinfo.pl
Got 2 uchars
Char: '{' u: 123 [0x007B] b: 123 [0x7B] n: LEFT CURLY BRACKET [Basic Latin]
Char: 'π' u: 960 [0x03C0] b: 207,128 [0xCF,0x80] n: GREEK SMALL LETTER PI [Greek and Coptic]
... neither of which relates to the value of 4194427 (0x40007b) for ACS_PI.
Thus, when addch and/or border see a character above ASCII (basically an unsigned int, as opposed to unsigned char), they (at least in this instance) use that number not as Unicode code point, or as UTF-8 encoded bytes representation - but instead, they use it as a look-up index for acs_map-ping function (which ultimately, however, would return the Unicode code point, even if it emulates VT-100). That is why the following specification:
window.addch('π')
window.border('π')
will fail in Python 2.7 with argument 1 or 3 must be a ch or an int; and in Python 3.2 would render simply a space instead of a character. When we specify 'π'. we've actually specified the UTF-8 encoding [0xCF,0x80] - but even if we specify the Unicode code point:
window.addch(0x03C0)
window.border0x03C0)
... it simply renders nothing (space) in both Python 2.7 and 3.2.
That being said - the function addstr does accept UTF-8 encoded strings, and works fine:
window.addstr('π')
... but for borders - since border() apparently handles characters in the same way addch() does - we're apparently out of luck, for anything not explicitly specified as an ACS constant (and there's not that many of them, either).
Hope this helps someone,
Cheers!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python SQLite: enforce UTF-8 encoding - python

Related

Python bug: null byte in input prompt

How can I create a file with `/` in its file name? [duplicate]

Curses library in python

Are PyArg_ParseTuple() "s" format specifiers useful in Python 3.x C API?

How do I use extended characters in Python's curses library?

Categories

Resources