How to get raw representation of existing string or escaping backslash - python

Problem
I'm running dataflow job where I have steps - reading txt file from cloud storage using dataflow/beam - apache_beam.io.textio.ReadFromText() which has StrUtf8Coder (utf-8) by default and after that loading it into postgres using StringIteratorIO with copy_from.
data coming from pcollection element by element, there are some elements which will look like this:
line = "some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
After that, I need to download it to postgres (the delimiter here is "|"), but the problem is these kinds of elements because postgres try to encode it(and I'm getting: 'invalid byte sequence for encoding "UTF8"'):
from F\226 we are getting this -> F\x96
This slash is not visible so I can not just replace it like this:
line.replace("\\", "\\\\")
Using python 3.8.
Have tried repr() or encode("unicode_escape").decode()
Also in every line we have different elements so let's say in the next one can be r\456
I'm able to catch and change it with regex only if I will use a raw string, but not sure how to represent a regular string as a raw if we already have it in a variable.
import re
line = r"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
updated = re.sub("([a-zA-Z])\\\\(\\d*)", "\\1\\\\\\\\\\2",string)
print(updated)
$ some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|
Goal
Have an extra backslash if after backslash we have some element, so the line need to look like this:
line = "some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|"
Thank's for any help!

If you're able to read the file in binary or select the encoding, you could get a better starting point. This is how to do it in binary:
>>> line = b"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
>>> line.decode('cp1252')
'some information|more information S\\\\H, F–|DIST|local|app\\\\\\lock\\|'
This is how to decode the whole file:
f = open('file.txt', encoding='cp1252')
f.read()
The encoding CP-1252 is the legacy Microsoft latin-1 encoding.

Related

i tried reading a .mat file through scipy.io but the path i copied is displaying error. I don't know what to do

The error it has been displaying
I want my file to run and display the datasets
Your filename includes the characters \U. That is an escape sequence, and Python expects a Unicode code point to follow it.
Add an r in front of the filename, e.g. loadmat(r'C:\Users\IFE\Desktop\CMP'), to disable escape sequences in the string.
Of course, you'll also want to assign the return value to a variable, e.g.
mat = loadmat(r'C:\Users\IFE\Desktop\CMP')

How to save a dataframe as a csv file with '/' in the file name

I want to save a dataframe to a .csv file with the name '123/123', but it will split it in to two strings if I just type like df.to_csv('123/123.csv').
Anyone knows how to keep the slash in the name of the file?
You can't use a slash in the name of the file.
But you can use Unicode character that looks like slash if your file system support it http://www.fileformat.info/info/unicode/char/2215/index.htm
... "/" is used for indicating a path to a child folder, so your filename says the file 123.csv is in a folder "123"
however, that does not make it entirely impossible, just very difficult see this question:
https://superuser.com/questions/187469/how-would-i-go-about-creating-a-filename-with-invalid-characters-such-as
and that a charmap can find a character that looks like it, which is legal. In this case a division character
You can not use any of these chars in the file name ;
/:*?\"|
You can use a similar unicode character as the "Fraction slash - Unicode hexadecimal: 0x2044"
Example:
df.to_csv("123{0}123".format(u'\u2044'.encode('utf-8')))
It gives you the filename that you asked.

Open URL encoded filenames in Unix

I'm a python n00b. I have downloaded URL encoded file and I want to work with it on my unix system(Ubuntu 14).
When I try and run some operations on my file, the system says that the file doesn't exist. How do I change my filename to a unix recognizable format?
Some of the files I have download have spaces in them so they would have to be presented with a backslash and then a space. Below is a snippet of my code
link = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
output = open(link.split('/')[-1],'wb')
output.write(site.read())
output.close()
shutil.copy(link.split('/')[-1], tmp_dir)
The "link" you have actually is a URL. URLs are special and are not allowed to contain certain characters, such as spaces. These special characters can still be represented, but in an encoded form. The translation from special characters to this encoded form happens via a certain rule set, often known as "URL encoding". If interested, have a read over here: http://en.wikipedia.org/wiki/Percent-encoding
The encoding operation can be inverted, which is called decoding. The tool set with which you downloaded the files you mentioned most probably did the decoding already, for you. In your link example, there is only one special character in the URL, "%20", and this encodes a space. Your download tool set probably decoded this, and saved the file to your file system with the actual space character in the file name. That is, most likely you have a file in the file system with the following basename:
Scheherezade Theme.mp3
So, when you want to open that file from within Python, and all you have is the link, you first need to get the decoded variant of it. Python can decode URL-encoded strings with built-in tools. This is what you need:
>>> import urllib.parse
>>> url = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
>>> urllib.parse.unquote(url)
'http://www.stephaniequinn.com/Music/Scheherezade Theme.mp3'
>>>
This assumes that you are using Python 3, and that your link object is a unicode object (type str in Python 3).
Starting off with the decoded URL, you can derive the filename. Your link.split('/')[-1] method might work in many cases, but J.F. Sebastian's answer provides a more reliable method.
To extract a filename from an url:
#!/usr/bin/env python2
import os
import posixpath
import urllib
import urlparse
def url2filename(url):
"""Return basename corresponding to url.
>>> url2filename('http://example.com/path/to/file?opt=1')
'file'
"""
urlpath = urlparse.urlsplit(url).path # pylint: disable=E1103
basename = posixpath.basename(urllib.unquote(urlpath))
if os.path.basename(basename) != basename:
raise ValueError # refuse 'dir%5Cbasename.ext' on Windows
return basename
Example:
>>> url2filename("http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3")
'Scheherezade Theme.mp3'
You do not need to escape the space in the filename if you use it inside a Python script.
See complete code example on how to download a file using Python (with a progress report).

How can I fix encoding errors in a string in python

I have a python script as a subversion pre-commit hook, and I encounter some problems with UTF-8 encoded text in the submit messages. For example, if the input character is "å" the output is "?\195?\165". What would be the easiest way to replace those character parts with the corresponding byte values? Regexp doesn't work as I need to do processing on each element and merge them back together.
code sample:
infoCmd = ["/usr/bin/svnlook", "info", sys.argv[1], "-t", sys.argv[2]]
info = subprocess.Popen(infoCmd, stdout=subprocess.PIPE).communicate()[0]
info = info.replace("?\\195?\\166", "æ")
I do the same things in my code and you should be able to use:
...
u_changed_path = unicode(changed_path, 'utf-8')
...
When using the approach above, I've only run into issues with characters like line feeds and such. If you post some code, it could help.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.
What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()
You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.
You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Categories

Resources