After researching a bit how the different way people slugify titles, I've noticed that it's often missing how to deal with non english titles.
url encoding is very restrictive. See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
So, for example how do folks deal with for title slugs for things like
"Una lágrima cayó en la arena"
One can come up with a reasonable table for indo european languages, ie. things that can be encoded via ISO-8859-1. For example, a conversion table would translate 'á' => 'a', so the slug would be
"una-lagrima-cayo-en-la-arena"
However, I'm using unicode (in particular using UTF-8 encoding), so no guaranties about what sort code points I'm going to get (I have to prepare for things that can't be ISO-8859-1 encoded.
I a nushell. How do deal with this? Should I come up with a conversion table for chars in the ISO_8859-1 range (<255) and drop everything else?
EDIT: To give a bit more context, a priori, I don't really expect to slugify data in non indo european languages, but I'd like to have a plan if I encounter such data.
A conversion table for the extended ASCII would be nice. Any pointers?
Also, since people are asking, I'm using python, running on Google App Engine
Nearly-complete transliteration table (for latin, greek and cyrillic character sets) can be found in slughifi library. It is geared towards Django, but can be easily modified to fit general needs (I use it with Werkzeug-based app on AppEngine).
I simply use utf-8 for URL paths. As long as the domain is non-IDN FF3, IE works fine with this. Google reads and displays them correctly. The IRI RFC allows Unicode. Just make sure you parse the incoming urls correctly.
In general this is going to depend on the language you expect to get. If your primary userbase is Japanese, dropping everything but ISO-8859-1 characters is unlikely to go over well.
That said, one option might be to use transliteration mode, if your character set conversion library supports it. For example, with GNU iconv, one can do:
] echo Una lágrima cayó en la arena|iconv -f utf8 -t ascii//TRANSLIT
Una lagrima cayo en la arena
As you can see, the accented characters were automatically converted to something in the ASCII range. How to translate this to code will of course depend on the language you're using, but if your language is based on GNU iconv for charset conversion (and if it's on linux, it probably is), this trick can probably be applied directly by simply specifying "ascii//TRANSLIT" as the convert-to character set.
One thing to note with this, however, is it's only effective with characters that "look like" something in ASCII. For example:
] echo 我輩は猫である。名前はまだない。|iconv -f utf8 -t ascii//TRANSLIT
????????????????
As you can see, it's not much help for Japanese, and needs further processing afterward to remove characters not suitable for URLs.
If all else fails, you could use a conversion table, but there might be a better performing solution available. What server side language are you using?
Related
I downloaded the page source (html) of websites with Selenium (Python). And I wish to find all base 64 encoded strings in html files.
Is there a known structure to all base 64 encoded strings in htmls? From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ). Is that accurate?
From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. Can someone also confirm that?
Thanks a lot in advance!
* Edit 1 *
Thanks a lot Tris! The link you provided is very helpful! However, from that, it seems like there is no specific format for the end of a base 64 strings. If I want to detect its end, what advice would you give other than )?
I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use. To save storage, I therefore intend to remove them. An example is www.amd.com, which has the following data:image/png;base64,... (after being rendered by browser).
Since there are many different websites, I don't know all of their formats. Here are some other examples of the base64 strings that I found and are not useful to me:
data:font/truetype;base64,AAEAAA...
data:image/png;base64,iVBORw0KG...
For several of the examples that I saw, they all ended with a closing bracket ). May I ask then under what scenario would they end with ) and otherwise?
Thanks again!
Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs. If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine. The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.
Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.
There is another commonly used base64 format on the web -- the URL-safe base64 format. In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.
This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."
I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance
If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.
I'm facing a quite simple problem: the web font we are using on our website does not have the character combination for a plus combining umlaut (so ä in two unicode characters), but it does have the a with umlaut character.
As we can not really replace the web font, and as suggested by Mozilla and W3C, we should convert those two characters into a single one (fortunately Python already has unicodeddata.normalize).
So the question is: how do I make sure that any text input field is normalized? My current approach is using an event handler on creation and edition to go through all text fields and normalize them, but that feels error prone (new content types, new fields...).
Is there something like collective.dexterityindexer (as in an adapter that you can create custom converters for specific types of widgets) and make it global (so for all content types)?
Is that ever possible with z3c.form (fortunately we are only using z3c.form based forms)?
I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
Take Data from an Oracle DB using Python and cx_Oracle.
Write the data to a file using Python.
Ingest the file into Postgres using Python and psycopg2.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as �).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR field and not a NCHAR field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.
As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?
Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.
If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF