Processing delimiters with python - python

Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.

Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.

Related

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

ASCII as default encoding in python instead of utf-8

I only code in English but I have to deal with python unicode all the time.
Sometimes its hard to remove unicode character from and dict.
How can I change python default character encoding to ASCII???
That would be the wrong thing to do. As in very wrong. To start with, it would only give you an UnicodeDecodeError instead of removing the characters. Learn proper encoding and decoding to/from unicode so that you can filter out tthe values using rules like errors="ignore"
You can't just ignore the characters taht are part of your data, just because
you 'dislike" then. It is text, and in an itnerconected World, text is not composed of only 26 glyphs.
I'd suggest you get started by reading this document: http://www.joelonsoftware.com/articles/Unicode.html

Do UTF-8 characters cover all encodings of ISO8859-xx and windows-12xx?

I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()
I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "î" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.
UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.
You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.

How do I better handle encoding and decoding involving unicode characters annd going back and forth from ascii

I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look
You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.
I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.
Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.

converting UTF-16 special characters to UTF-8

I'm working in django and Python and I'm having issues with saving utf-16 characters in PostgreSQL. Is there any method to convert utf-16 to utf-8 before saving?
I'm using python 2.6 here is my code snippets
sample_data="This is the time of year when Travel & Leisure, TripAdvisor and other travel media trot out their “Best†lists, so I thought I might share my own list of outstanding hotels I’ve had the good fortune to visit over the years."
Above data contains some latin special characters but it is not showing correctly, I just want to show those latin special characters in appropriate formats.
There are no such things as "utf-16 characters". You should show your data by using print repr(data), and tell us which pieces of your data you are having trouble with. Show us the essence of your data e.g. the repr() of "Leisure “Best†lists I’ve had"
What you actually have is a string of bytes containing text encoded in UTF-8. Here is its repr():
'Leisure \xe2\x80\x9cBest\xe2\x80\x9d lists I\xe2\x80\x99ve had'
You'll notice 3 clumps of guff in what you showed. These correspond to the 3 clumps of \xhh in the repr.
Clump1 (\xe2\x80\x9c) decodes to U+201C LEFT DOUBLE QUOTATION MARK.
Clump 2 is \xe2\x80\x9d. Note that only the first 2 "latin special characters" aka "guff" showed up in your display. That is because your terminal's encoding is cp1252 which doesn't map \x9d; it just ignored it. Unicode is U+201D RIGHT DOUBLE QUOTATION MARK.
Clump 3: becomes U+2019 RIGHT SINGLE QUOTATION MARK (being used as an apostrophe).
As you have UTF-8-encoded bytes, you should be having no trouble with PostgreSQL. If you are getting errors, show your code, the full error message and the full traceback.
If you really need to display the guff to your Windows terminal, print guff.decode('utf8').encode('cp1252') ... just be prepared for unicode characters that are not supported by cp1252.
Update in response to comment I dont have any issue with saving data,problem is while displaying it is showing weired characters,so what iam thinking is convert those data before saving am i right?
Make up your mind. (1) In your question you say "I'm having issues with saving utf-16 characters in PostgreSQL". (2) Now you say "I dont have any issue with saving data,problem is while displaying it is showing weired characters"
Summary: Your sample data is encoded in UTF-8. If UTF-8 is not acceptable to PostgreSQL, decode it to Unicode. If you are having display problems, first try displaying the corresponding Unicode; if that doesn't work, try an encoding that your terminal will support (presumably one of the cp125X family.
This works for me to convert strings: sample_data.decode('mbcs').encode('utf-8')

Categories

Resources