Using RegEx and Reading files inside of a .EGG File?

Using RegEx and Reading files inside of a .EGG File? - python

newemail = 'test#gmail.com'
import zipfile
import re
egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg')
file = egg.open('scraping_tool/settings.py')
text = file.read().decode('utf8')
emailregex = re.compile(r'[A-Za-z0-9-.]+#[A-Za-z0-9-.]+')
newtext = emailregex.sub(newemail,text)
newtext = newtext.encode('utf8')
file.close()
egg.close()
egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg', 'w')
file = egg.open('scaping_tool/settings.py', 'w')
file.write(newtext)
file.close()
egg.close()
I'm a week into programming so let me know if anything I say doesn't make sense. The objective I'm trying to achieve right now is getting the email out of a .py file in a egg file.
In the interactive shell I was able to successfully to retrieve the txt = file.read() but once I start getting the match objects and regEX involved I start getting errors like "can not get strings from byte objects"
Tried reading stackoverflow questions about the errors but still too new to decipher what they are talking about and might need it dumbed down a bit more. I understand the zip file is messing up how regex works with strings but not sure how to fix it.
EDIT: Bonus Question about encoding

When you .read() an entry from a zipfile, you will get bytes. There is no automatism that detects whether a zip entry is a binary file or a text file, you have to make that decision yourself.
In order to convert bytes into string, you must decode them, which requires knowledge of the text encoding these files have been saved in. If you don't know for sure, UTF-8 (utf8) and Windows-1252 (cp1252) are common encodings to try. You'll know that you've picked the right encoding when special/accented characters look right in the result:
txt = file.read().decode('utf8')
print(txt)
Once you're working with strings, the "can not get strings from byte objects" error won't occur anymore.

Related

How to write some non-ascii characters in iso-8859-1 format

I have a file, which is not created by me, that contains some non ascii characters. The name of file is foo.in. In there there is a turkish word rahatlattı. the last character is non-ascii. The file format is foo.in: text/html; charset=iso-8859-1. When I open the file with less command I see the word is written as rahatlatt<FD> Its not a problem for me. But when I try to create a similar file, I couldn't make it in the same way. Up to know, I try the following:
import codecs
outputFile = codecs.open("foo2.in","w","ISO-8859-1")
p = unicode('rahatlatt\u0131n')
outputFile.write(p)
outputFile.close()
Unfortunatelly, when I checked the content and format of the file I see that format is set as foo2.in: text/plain; charset=us-ascii and content is written as rahatlatt\u0131n
So, I wonder how can I create a file that is similar to foo.in. The reason behind my request is that, another program takes these files as input and I guess the format of the file is important for it and I cannot run the program with another file format.

Reading a binary file as plain text using Python

A friend of mine has written simple poetry using C's fprintf function. It was written using the 'wb' option so the generated file is in binary. I'd like to use Python to show the poetry in plain text.
What I'm currently getting are lots of strings like this: ��������
The code I am using:
with open("read-me-if-you-can.bin", "rb") as f:
print f.read()
f.close()

The thing is, when dealing with text written to a file, you have to know (or correctly guess) the character encoding used when writing said file. If the program reading the file is assuming the wrong encoding here, you will end up with strange characters in the text if you're lucky and with utter garbage if you're unlucky.
Don't try to guess, try to know: you need to ask your friend in what character encoding he or she wrote the poetry text to the file. You then have to open the file in Python specifying that character encoding. Let's say his/her answer is "UTF-16-LE" (for sake of example), you then write:
with open("poetry.bin", encoding="utf-16-le") as f:
print(f.read())
It seems you're on Python 2 still though, so there you write:
import io
with io.open("poetry.bin", encoding="utf-16-le") as f:
print f.read()
You could start by trying UTF-8 first though, that is an often used encoding.

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

I just had similar issue: python readlines() reports invalid chars heading the first line, something like ï»¿. I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Python failed to parse txt file but the file is confirmed to be 'txt' file

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()

The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

python opens text file with a space between every character

Whenever I try to open a .csv file with the python command
fread = open('input.csv', 'r')
it always opens the file with spaces between every single character. I'm guessing it's something wrong with the text file because I can open other text files with the same command and they are loaded correctly. Does anyone know why a text file would load like this in python?
Thanks.
Update
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!

The post by recursive is probably right... the contents of the file are likely encoded with a multi-byte charset. If this is, in fact, the case you can likely read the file in python itself without having to convert it first outside of python.
Try something like:
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
The 'b' flag ensures the file is read as binary data. You'll need to know (or guess) the original encoding... in this example, I've used utf-16, but YMMV. This will convert the file to unicode. If you truly have a file with multi-byte chars, I don't recommend converting it to ascii as you may end up losing a lot of the characters in the process.
EDIT: Thanks for uploading the file. There are two bytes at the front of the file which indicates that it does, indeed, use a wide charset. If you're curious, open the file in a hex editor as some have suggested... you'll see something in the text version like 'I.D.|.' (etc). The dot is the extra byte for each char.
The code snippet above seems to work on my machine with that file.

The file is encoded in some unicode encoding, but you are reading it as ascii. Try to convert the file to ascii before using it in python.

Isn't csv a simple txt file with values separated with comma.
Just try to open it with a text editor to see if the file is correctly formed.

To read an encoded file, you can simply replace open with codecs.open.
fread = codecs.open('input.csv', 'r', 'utf-16')

It did never ocurred to me, but as truppo said, it must be something wrong with the file.
Try to open the file in Excel/BrOffice Calc and Save As the file as Csv again.
If the problem persists, try a subset of the data: fist 10/last 10/intermediate 10 lines of the file.

Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!

Open the file in binary mode, 'rb'. Check it in a HEX Editor and check for null padding '00'. Open the file in something like Scintilla Text Editor to check the characters present in the file.

Here's the quick and easy way, esp if python won't parse the input correctly
sed 's/ \(.\)/\1/g'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.