How can I fix encoding errors in a string in python

How can I fix encoding errors in a string in python - python

I have a python script as a subversion pre-commit hook, and I encounter some problems with UTF-8 encoded text in the submit messages. For example, if the input character is "å" the output is "?\195?\165". What would be the easiest way to replace those character parts with the corresponding byte values? Regexp doesn't work as I need to do processing on each element and merge them back together.
code sample:
infoCmd = ["/usr/bin/svnlook", "info", sys.argv[1], "-t", sys.argv[2]]
info = subprocess.Popen(infoCmd, stdout=subprocess.PIPE).communicate()[0]
info = info.replace("?\\195?\\166", "æ")

I do the same things in my code and you should be able to use:
...
u_changed_path = unicode(changed_path, 'utf-8')
...
When using the approach above, I've only run into issues with characters like line feeds and such. If you post some code, it could help.

Related

How to get raw representation of existing string or escaping backslash

Problem
I'm running dataflow job where I have steps - reading txt file from cloud storage using dataflow/beam - apache_beam.io.textio.ReadFromText() which has StrUtf8Coder (utf-8) by default and after that loading it into postgres using StringIteratorIO with copy_from.
data coming from pcollection element by element, there are some elements which will look like this:
line = "some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
After that, I need to download it to postgres (the delimiter here is "|"), but the problem is these kinds of elements because postgres try to encode it(and I'm getting: 'invalid byte sequence for encoding "UTF8"'):
from F\226 we are getting this -> F\x96
This slash is not visible so I can not just replace it like this:
line.replace("\\", "\\\\")
Using python 3.8.
Have tried repr() or encode("unicode_escape").decode()
Also in every line we have different elements so let's say in the next one can be r\456
I'm able to catch and change it with regex only if I will use a raw string, but not sure how to represent a regular string as a raw if we already have it in a variable.
import re
line = r"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
updated = re.sub("([a-zA-Z])\\\\(\\d*)", "\\1\\\\\\\\\\2",string)
print(updated)
$ some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|
Goal
Have an extra backslash if after backslash we have some element, so the line need to look like this:
line = "some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|"
Thank's for any help!

If you're able to read the file in binary or select the encoding, you could get a better starting point. This is how to do it in binary:
>>> line = b"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
>>> line.decode('cp1252')
'some information|more information S\\\\H, F–|DIST|local|app\\\\\\lock\\|'
This is how to decode the whole file:
f = open('file.txt', encoding='cp1252')
f.read()
The encoding CP-1252 is the legacy Microsoft latin-1 encoding.

stdout captured from command has twice as many newlines when passed to selenium

I have a bit of code that I am trying to capture the stdout:
def MediaInfo():
cmd= ['MediaInfo.exe', 'videofile.mkv']
test = subprocess.run(cmd, capture_output=True)
info = test.stdout.decode("utf-8")
print (info)
When using print or writing it to file, it looks fine. But when I use selenium to fill it into a message box:
techinfo = driver.find_element(By.NAME, "techinfo").send_keys(info)
there is an additional empty line between every line. Originally I had an issue where the stdout was a byte literal. It looked like b"This is the first line.\r\nThis is the second line.\r\n" Adding .decode("utf-8") is what fixed that but I am wondering if in certain instances something is interpreting \r\n as creating two lines. I'm just not sure if it is an issue with Selenium or subprocess or something else. The webpage element Selenium is writing to doesn't seem to have an issue. It looks correct if I copy and paste it from the text file. Meaning, it's not just the way it's displayed, there are actually twice as many line feeds. Any ideas? I don't want to just loop through and delete the extra lines. Too kludgy. I'm guessing this is an issue with Python 3, from what I've read.

send_keys() will send each key individually which means "\r\n" is sent as two key presses. Replacing "\r\n" with "\n" prior to sending to element should do the trick.

Concatenating strings retrieved by regex from a subprocess STDERR results in disorder

I have an audio file, Sample.flac. The title and length can be read with ffprobe to result in the output being sent to STDERR.
I want to run ffprobe through subprocess, and have done so successfully. I then retrieve the output (piped to subprocess.PIPE) with *.communicate()[1].decode() as indicated that I should by the Python docs.
communicate() returns a tuple, (stdout, stderr), with the output from the Popen() object. The proper index for stderr is then accessed and decoded from a byte string into a Python 3 UTF-8 string.
This decoded output is then parsed with a multiline regex pattern matching the format of the ffprobe metadata output. The match groups are then placed appropriately into a dictionary, with each first group converted to lowercase, and used as the key for the second group (value).
Here is an example of the output and the working regex.
The data can be accessed through the dictionary keys as expected. But upon concatenating the values together (all are strings), the output appears mangled.
This is the output I would expect:
Believer (Kaskade Remix) 190
Instead, this is what I get:
190ever (Kaskade Remix)
I don't understand why the strings appear to "overlap" each other and result in a mangled form. Can anyone explain this and what I have done wrong?
Below is the complete code that was run to produce the results above. It is a reduced section of my full project.
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
import os
from re import findall, MULTILINE
from subprocess import Popen, PIPE
def media_metadata(file_path):
"""Use FFPROBE to get information about a media file."""
stderr = Popen(("ffprobe", file_path), shell=True, stderr=PIPE).communicate()[1].decode()
metadata = {}
for match in findall(r"(\w+)\s+:\s(.+)$", stderr, MULTILINE):
metadata[match[0].lower()] = match[1]
return metadata
if __name__ == "__main__":
meta = media_metadata("C:/Users/spike/Music/Sample.flac")
print(meta["title"], meta["length"])
# The above and below have the same result in the console
# print(meta["title"] + " " + meta["length"])
# print("{title} {length}".format(meta))
Can anyone explain this unpredictable output?
I have asked this question here earlier, however I dont think it was very clear. In the raw output when this is run on multiple files, you can see that towards the end the strings start becoming as unpredictable as not even printing part of the title value at all.
Thanks.

You are catching up the "\r" symbol. At printing, cursor is returned to the beginning of the string, so the next print and overwrites the first part. Stripping whitespaces (will also remove trailing "\r") should solve the problem:
metadata[match[0].lower()] = match[1].strip()

Reproduce:
print('Believer (Kaskade Remix)\r 190')
Output:
190ever (Kaskade Remix)
Issue:
End-Of-Line is \r\n. re $ matches \n. \r remains in the matching group.
Fix:
Insert \r before $ in your re pattern. i.e. (\w+)\s+:\s(.+)\r$
Or use universal_newlines=True as a Popen argument and remove .decode()
as the output will be text with \n instead of \r\n.
Or stderr = stderr.replace('\r', '') before re processing.
Alternative:
ffprobe can output a json string. Use json module which loads the string
and returns a dictionary.
i.e. command
['ffprobe', '-show_format', '-of', 'json', file_path]
The json string will be the stdout stream.

Twitter Search Program

I'm using this program, and all the tweets that I'm getting are like this"because it is in Arabic Language":
"text": "\\u0637\\u0627\\u0644\\u0628\\u0629 \\u062c\\u0633\\u0645\\u0647\\u0627 \\u062c\\u0628\\u0627\\u0631 \\u062a\\u062a\\u062e\\u062f \\u0645\\u0646 \\u0627\\u0644\\u0634\\u0627\\u0631\\u0639 \\u0648 \\u062a\\u062a\\u0646\\u0627\\u0643..\\n\\n\\u0633\\u0643\\u0633_\\u0627\\u062c\\u0646\\u0628\\u064a\\n\\u0645
I had a question about it and got the answer here
the question is : Where I can use ensure_ascii=False in the program so it can read the Arabic tweet correctly? I don't know in which place I need to copy it.

You need to modify twitter_search.py
Replace all
json.dump(<something>,fd)
For
json.dump(<something>,fd,ensure_ascii=False)
You'll also need to replace all <file_descriptor> for utf-8 ones
import codecs
...
...
fd = codecs.open("/tmp/lol", "w", "utf-8")
If you're processing the results with python another approach would be to unescape the ascii string.
s='\\u0637\\u0627\\u0644\\u0628\\u0629...'
print s.encode("utf-8").decode('unicode_escape')

cURL, cat, python and missing parts from a web page

I have downloaded a web page(charset=iso-8859-1) using curl
curl "webpage_URL" > site.txt
The encoding of my terminal is utf-8. Here I try to see the encoding of this file:
file -i site.txt
site.txt: regular file
Now: the strange thing: If I open the file with nano I find all the words that are visible in a normal browser. BUT when I use:
cat site.txt
some words are missing. This fact makes me curious and after some hours of research I didn't figure out why.
In python too, it does't find all the words:
def function(url):
p = subprocess.Popen(["curl", url], stdout=subprocess.PIPE)
output, err = p.communicate()
print output
soup=BeautifulSoup(output)
return soup.body.find_all(text=re.compile('common_word'))
I also tried to use urllib2 but I had no success.
What am I doing wrong?

If somebody will face the same problem:
The root of my problem were some carriage return characters (\r) that are present in the web page. The terminal cannot print them. This wouldn't be a big problem, but the whole line that contains a \r is skipped.
So, in order to see the content of the entire file: this characters should be escaped with the -v or -e option:
cat -v site.txt
(thanks to MendiuSolves who has suggested to use the cat command options)
In order to solve a part of the python problem: I changed the return value from soup.body.find_all(text=re.compile('common_word')) to soup.find_all(text=re.compile('common_word'))
It is obvious that if the word you search for is on one of the line containing a \r and you will print it you will not see the result. The solution could be either filter the character or write the content in a file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I fix encoding errors in a string in python - python

I do the same things in my code and you should be able to use: ... u_changed_path = unicode(changed_path, 'utf-8') ... When using the approach above, I've only run into issues with characters like line feeds and such. If you post some code, it could help.

Related

How to get raw representation of existing string or escaping backslash

stdout captured from command has twice as many newlines when passed to selenium

Concatenating strings retrieved by regex from a subprocess STDERR results in disorder

Twitter Search Program

cURL, cat, python and missing parts from a web page

Categories

Resources