I downloaded the page source (html) of websites with Selenium (Python). And I wish to find all base 64 encoded strings in html files.
Is there a known structure to all base 64 encoded strings in htmls? From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ). Is that accurate?
From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. Can someone also confirm that?
Thanks a lot in advance!
* Edit 1 *
Thanks a lot Tris! The link you provided is very helpful! However, from that, it seems like there is no specific format for the end of a base 64 strings. If I want to detect its end, what advice would you give other than )?
I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use. To save storage, I therefore intend to remove them. An example is www.amd.com, which has the following data:image/png;base64,... (after being rendered by browser).
Since there are many different websites, I don't know all of their formats. Here are some other examples of the base64 strings that I found and are not useful to me:
data:font/truetype;base64,AAEAAA...
data:image/png;base64,iVBORw0KG...
For several of the examples that I saw, they all ended with a closing bracket ). May I ask then under what scenario would they end with ) and otherwise?
Thanks again!
Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs. If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine. The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.
Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.
There is another commonly used base64 format on the web -- the URL-safe base64 format. In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.
This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."
Related
I am cleaning the textual data that I scrawled from multiple urls. How can I remove non-English words/symbols from the data in a csv file?
I saved the data and read the data using the following codes:
To save the data as csv file:
df.to_csv("blogdata.csv", encoding = "utf-8")
After saving the data, the csv file shows as follows including non-English words and symbols (e.g., '\n\t\t\t', m’, etc.):
The symbols did not show in the original data and some of them even appear from the data that are in English. Take 'Ross Parker' in the 7th row as an example.
The data saved in the csv file says: ['\n\t\t\t', 'It’s about time I wrote an update on what we’ve been up to over the past few months. We’re about to...
Where in the original data scrawled from the url, it shows as follows:
Can anybody explain why this happens and help me solve this issue and clean the non-English data from the file?
Thank you so much in advance!
It looks like pilot error: the data is correct but you are looking at it in a tool which is configured or hard-coded to display the text as Latin-1 (or Windows code page 1252?) even though you saved it as UTF-8.
Some tools - epecially on Windows - will do whimsical things with UTF-8 which doesn't have a BOM. Maybe try adding one (and maybe file a bug report if this really helps; the tool should at the very least let you override its default encoding, without modifying the input data).
In other words, if the screen shot with the broken data is from Excel, you probably selected a DOS oode page (or the horribly mislabelled "ANSI") instead of UTF-8 when it asked how to import this CSV file. Perhaps the best fix would be to devise a workflow which does not involve a spreadsheet.
Or perhaps you used a tool which didn't ask you anything, and tried to "sniff" the data to determine its encoding, and it guessed wrong. Adding an invisible byte sequence called a BOM which is unique to UTF-8 should hopefully allow it to guess right; but this is buggy behavior - you should not be a hostage to its clearly imperfect heuristics. (See also "Bush hid the facts" for a related story.)
I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance
If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.
After crawling many websites, in some of them i receive broken-encoding data. I can't do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.
There's NLTK which has a guess_encoding function that takes a byte string and tries all of the available encodings, would this serve your purpose?
Take a look at https://github.com/LuminosoInsight/python-ftfy
If I understand correctly, it will attempt to 'repair' incorrectly encoded/decoded text.
As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?
Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.
If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.
doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.
Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.
I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.