Related
I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))
I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))
I'm new to StackOverflow so I'm really excited! 😊
I'm having the following issue with my code: I am trying to store a 'png' image into an FTP server (a screenshot of a website).
I'm using ftplib and selenium (with webdriver):
driver.get(<someURL>)
screenshot = driver.save_screenshot(driver.title + '.png')
ftp.storbinary("STOR <PathToServer>" + driver.title + '.png', open(driver.title + '.png', 'rb'))
This method works when the website is written with Latin characters, the problem is that the image can be the screenshot of a website located in Thailand, China, or Egypt, for instance. In that case, the line with:
open(driver.title + '.png', 'rb')
returns the infamous error:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u1ec7' in position 60: ordinal not in range(256)
I understand that storbinary is accepting only binary numbers (as the name of that method implies). However, what I don't understand is how I can "encode" the png so that it will not lead to that error, and so that it can be stored successfully into the FTP server.
Thank you so very much! Any help, or comment, or insight, would be deeply appreciated.
Best!
Fact is text as represented in a running computer program and as stored in files, filenames and databases are two different things. One can think of the former as a set of characters, without minding how they are represented internally. On the other hand, to store this text in a filesystem, DB, transmit it over a network, the text have to be represented as bytes. This process of transforming the "pure" text you have when the program is running into a byte representation is called "encoding". For better understanding that, I suggest reading this article.
Python 3, both the core language and the libraries, try to automatically select the proper text encoding when doing any I/O with text. In your case, it picked the "latin1" codec for filenames in the target server.
Latin1 is limited to a little over 200 valid characters and can't represent a lot of characters or glyphs - any non-western language character,and even some western ones, such as Ĺ, Ṕ, ŵ, can't be represented with it.
The suggestion is to perform a manual encoding of the name before leting Python doing so, because then we can have control on how to handle non-existing characters in the target encoding. Since the library method (.strobinary) seems to be expecting the filename as a string, then, we "decode" the name back, but keeping the replacements for invalid characters we got when first encoding, and pass the result of this roundtrip to the library.
So, to keep the information about your characters that does not exist in latin1, I'd suggest using an escape-encoding - other options would be to replace then with a "?" or ignore, just supressing all characters:
filename = driver.title.encode("latin1", errors="xmlcharrefreplace").decode("latin1") + ".png"
screenshot = driver.save_screenshot(filename)
ftp.storbinary("STOR <PathToServer>" + filename, open(filename, 'rb'))
I executed the following code in Jupyter, trying to deal with a binary file called bfile. However, when I opened the file in Jupyter, I got the error message as shown in the picture below. And when I opened the file using Notepad, I got a bunch of messy characters. Anyone call help (please be plain in language, I'm not familiarized with encoding issues like UTF, Unicode in my daily work)?
bdata = bytes(range(0, 256))
with open('bfile', 'wb') as fin:
fin.write(bdata)
with open('bfile', 'rb') as fin:
fin.seek(-1, 2)
fin.tell()
Error message
Error! Y:\Desktop\bfile is not UTF-8 encoded
Saving disabled.
See Console for more details.
The messy characters I got when I opened the file using Notepad are,
You appear to try and open the binary file you just written (with just the binary values 0 to 255 inclusive) in Jupyter. That's not how Jupyter works: what is it supposed to do with that (random) binary file?
Jupyter can open e.g. notebook files. It will likely also open plain text files, or even image files, because it read them either as text, or pass the handling of the file (images) to the filebrowser.
Jupyter has no idea what to do with your binary file: it is not a known specification.
If, instead, you want to use the binary data inside Python, then read the file within Python. In fact, your code sample already shows that you're doing that, so I'm not sure why you're also trying to open your binary file inside Jupyter (or Notepad. Note that Notepad tries the best it can with your binary file, and shows the ASCII values for 0 to 255 on your system, as far as it can. Hence you see ranges of known characters. It won't show you the integer values 0 to 255, but assumes some encoding and translates those values to characters, according to that encoding. Probably not ASCII, but some Windows ASCII-like encoding).
tl;dr - While trying to reverse engineer a proprietary database file, I found that Wordpad was able to automagically decode some of the data into a legible format. I'm trying to implement that decoding in python. Now, even the Wordpad voodoo is not repeatable.
Ready for a brain teaser?
I'm trying to crack a bit of a strange problem. I have a data file, it is the database of a program for a scientific instrument (Mettler DSC / STARe software), and I'm trying to grab sample information from experiments. From my digging around in the file, it appears to consist of plaintext, unencrypted information about the experiments run, along with data. It's a .t00 file, over 40 mb in size (it stores essentially all the data of the runs), and I know very little about the encoding (other than it's seemingly arbitrary. It's not meant to be a text file). I can open this file in Wordpad and can see the information I'm looking for (sample names, timestamp, experiment parameters), surrounded by experimental run data (as expected, this looks like lots of gobbledygook, e.g. ¶+ú#”‹ø#ðßö#¨...). It seems like I basically got lucky with it being able to make some sense of the contents, and I'm trying to replicate that.
I can read the file into python with a basic file handler and use regex to get some of the pieces of info I want. 'r' vs 'rb' doesn't seem to help.
def textOpenLines(filename,mode='rb'):
with open(filename, mode) as content_file:
return [line for line in content_file]
I'm able to take that list and search it for relevant strings and get the sample name from it. BUT from looking at the file in Wordpad, I found that the sample name is listed twice, the second time it has the datestamp following it (e.g. 'Dibenzoylperoxid 120 C 03.05.1994 14:24:30'). In python, I can't find this string. I can't find even the timestamp by itself. When I look at the line where it is supposed to occur, I get a bunch of random bytes. Opening in Notepad looks like the python output.
I suspect it's an encoding issue. I've tried reading the file in as Unicode, I've tried taking snippets of lines and reading those in, but I can't crack it. I'm stumped.
Any thoughts on how to read this in so that it decodes right? Wordpad got it right (though now subsequently trying to open it, it looks like the Notepad output).
Thanks!!
Edit:
I don't know who changed the title, but of course it 'looks like random bytes in Python/Notepad'. It's mostly data.
It's not meant to be a text file. I sorta got lucky with the Wordpad opening
It's not corrupted. The DSC instrument program reads it just fine. It's just proprietary so I have no idea how it ticks.
I've tried using 'r', 'rb', and 'U' flags.
I've tried codecs.open using utf8, 16 and 32, but it gives UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 49: invalid continuation byte. I don't think it has a BOM, because I don't think it's meant to be human readable.
First 32 bytes (f.read(32)) reads
'\x10 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x10\x00\x00'
I don't know much about BOMs, but from reading the Wiki page, that doesn't look like any of the valid UTF markings.
The start of the file, when first automagically decoded in Wordpad, looks like this:
121 22Dibenzoylperoxid 120 C 03.05.1994 14:24:30 1 0 4096 ESTimeAI–#£®#nôÂ#49Õ#kÉå#FÞò#`sþ#N5A2A®"A"A—¥A¿ÝA¡zA"ÓAÿãAÐÅAäHA‚œAÑÌAŸäA¤ÆAE–AFNATöAÐ|AõAº^A(ÄAèAýqA¹AÖûAº8A¬uAK«AgÜAüAÞAo4A>N
AfAB
The start of the file, when opened in Notepad, Python, and now Wordpad, looks like this:
(empty bytes x00...)](x00...)eß(x00...)NvN(x00)... etc
Your file is not comprised of ascii characters but is being interpreted as such by applications that open it. The same thing would happen if you opened up a .jpg image in wordpad - you would get a bunch of binary and some ascii characters that are printable and recognizible to the human eye.
This is the reason why you can't do a plain-text search for your timestamp, for example.
Here is an example in code to demonstrate the issue. In your binary file you have the following bytes:
\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30
If you were to open this inside of a text editor like wordpad it would render the following:
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
Here is a code snippet in Python:
>>> c='\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
>>> print c
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
These bytes are in hexadecimal format which is why you can't search it with plaintext.
The reason for this is because the binary file is following a very particular structure (protocol, specification) so that the program that reads it can parse it correctly. If you take a jpeg image as an example you will find that the first bytes and the last bytes of the image are always the same (depending on the format used) - FF D8 will be the first two bytes of a jpeg and FF D9 will be the last two bytes of a jpeg to identify it as such. An image editing program will now know to start parsing this binary data as a jpeg and it will "walk" the structures inside the file to render the image. Here is a link to a resource that helps you identify files based on "signatures" or "headers" - the first two bytes of your file 10 00 do not show up in that database so you are likely dealing with a proprietary format so you won't be able to find the specs online very easily. This is where reverse engineering comes in handy.
I would recommend you open your file up in a hexeditor - it will give you both the hexadecimal output as well as the ascii output so that you can start to analyze the file format. I personally use the Hackman Hexeditor found here (it's free and has a lot of features).
But for now - to give you something useful to use in searching the file for data that you are interested in here is a quick method to covert your search queries to binary before the start the search.
import struct
#binary_data = open("your_binary_file.bin","rb").read()
#your binary data would show up as a big string like this one when you .read()
binary_data = '\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
def search(text):
#convert the text to binary first
s = ""
for c in text:
s+=struct.pack("b", ord(c))
results = binary_data.find(s)
if results == -1:
print "no results found"
else:
print "the string [%s] is found at position %s in the binary data"%(text, results)
search("Dibenzoylperoxid")
search("03.05.1994")
The results of the above script are:
the string [Dibenzoylperoxid] is found at position 0 in the binary data
the string [03.05.1994] is found at position 25 in the binary data
This should get you started.
it's FutureMe.
You probably got lucky with the Wordpad thing. I don't know for sure, because that data is long gone, but I am guessing Wordpad made a valiant effort to try to decode the file as UTF-8 (or maybe UTF-16 or CP1252). The reason this seemed to work was that in most binary protocols, strings are probably encoded as UTF-8, so for the ASCII character set, they will look the same in the dump. However, everything else is going to be binary encoded.
You had the right idea with open(fn, 'rb') but you should have just read the whole blob in, rather than readlines, which tries to split on \n. Since the db file isn't \n delimited, that just won't work.
What would have been a better approach is a histogram on the bytes and try to infer what the field/row separators are, if even exist. Look for TLV (type-length-value) encoded fields. Since you know the list of sample names, you could take a list of the starting strings, use that to find slice points in the blob, and determine how regular the field sizes are.
Also, buy bitcoin.