Error when trying to encode text in python - python

I’ve been trying to correct this text for the standard view for hours. I tried several ways for utf-8 and nothing works. can anybody help me?
I believe it is not a duplicate question because I have tried everything and failed.
Here is an example of one of the codes I used:
string_old = u"\u00c2\u00bfQu\u00c3\u00a9 le pasar\u00c3\u00a1 a quien desobedezca los mandamientos? "
print(string_old.encode("utf-8"))
Result:
>>> b'\xc3\x82\xc2\xbfQu\xc3\x83\xc2\xa9 le pasar\xc3\x83\xc2\xa1 a quien desobedezca los mandamientos? '
I expect the following result:
>>> "¿Qué le pasará a quien desobedezca los mandamientos? "

The string was wrongly decoded as Latin1 (or cp1252):
string_old.encode('latin1').decode('utf8')
# '¿Qué le pasará a quien desobedezca los mandamientos? '

Related

How to read files (with special characters) with Pandas?

I have an (encode/decode) problem.
Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. The language is French. I would be very happy if you can help with this, thank you in advance.
The first line of data examined
b"Sur la #route des stations ou de la maison\xf0\x9f\x9a\x98\xe2\x9d\x84\xef\xb8\x8f?\nCet apr\xc3\xa8s-midi, les #gendarmes veilleront sur vous, comme dans l'#Yonne, o\xc3\xb9 les exc\xc3\xa8s de #vitesse & les comportements dangereux des usagers de l'#A6 seront verbalis\xc3\xa9s\xe2\x9a\xa0\xef\xb8\x8f\nAlors prudence, \xc3\xa9quipez-vous & n'oubliez-pas la r\xc3\xa8gle des 3\xf0\x9f\x85\xbf\xef\xb8\x8f !"
import pandas as pd
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding="utf-8")
data.head()
Output:
text
0 b"Sur la #route des stations ou de la maison\x...
1 b"#Guyane Soutien \xc3\xa0 nos 10 #gendarmes e...
2 b'#CoupDeCoeur \xf0\x9f\x92\x99 Journ\xc3\xa9e...
3 b'RT #servicepublicfr: \xf0\x9f\x97\xb3\xef\xb...
4 b"\xe2\x9c\x85 7 personnes interpell\xc3\xa9es...
I believe for this cases you can try with different encoding. I believe the decoding parameter that might help you solve this issue is 'ISO-8859-1':
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding='iso-8859-1')
Edit:
Given the output of reading the file:
<_io.TextIOWrapper name='C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv' mode='r' encoding='cp1254'>
From python's codec cp1254 alias windows-1254 is language turkish so I suggested trying latin5 and windows-1254 too but none of these options seems to help.

Python ValueError: shape mismatch: objects cannot be broadcast to a single shape

My dataset file looks like
__label__ita Adesso datemi le chiavi.
__label__ara ياله من طفل محبب! يييي!
__label__eng You're a really bad bartender.
__label__epo En kiu hotelo vi restados?
__label__spa Él dijo haber perdido su vigor a los cuarenta.
__label__tat Сиңа булышмакчы идем.
__label__heb את מה פותח המפתח הזה?
__label__eng I caught a glimpse of him from the bus.
__label__eng I advise you to do that today.
__label__jpn この歌の歌い方を教えてくれますか。
__label__deu Ich habe gewusst, dass ihr Tom nicht vergessen würdet.
I'm using this function to parse the first column labels
def parse_labels(path):
with open(path, 'r') as f:
return np.array( list(map(lambda x: x[9:], f.read().decode('utf-8').split() )) )
so I split the row and get the ita label from the prefix __label__ita by example, but it breaks for some reason
test_labels = parse_labels(args.test)
print("Test labels:%d (sample)\n%s" % (len(test_labels),test_labels[:1]) )
print("labels:%s" % test_labels)
and I get
Test labels:71828 (sample)
[u'ita']
labels:[u'ita' u'' u'' ... u'' u'' u'']
while I should have had
[u'ita',u'ara',u'eng',...]
The title of your question does not seem to match the content, and I am answering the question posed in the body. I made your code a little more modular and tested it. It returns the desired list that you have at the end of the question (u'ita',u'ara',u'eng',...]):
def parse_labels(path):
test_labels = []
with open(path,'rb') as f:
for line in f:
test_labels.append(line.decode('utf-8').split(' ')[0][10:])
return [x for x in test_labels if x] #removes empty strings
parse_labels(args.test)
Since the language codes are at fixed offsets in each line, this can be processed more simply with a list comprehension. data.txt is the UTF-8-encoded input data. This code will work in Python 2 and 3:
from __future__ import print_function
import io
def parse_labels(path):
with io.open(path,encoding='utf8') as f:
return [line[9:12] for line in f]
print(parse_labels('data.txt'))
Output (Python 3):
['ita', 'ara', 'eng', 'epo', 'spa', 'tat', 'heb', 'eng', 'eng', 'jpn', 'deu']

UnicodeDecodeError: ('unknown', u'\xe0', 0, 1, '')

temp = "à la Carte"
print type(temp)
utemp = unicode(temp)
The code above results in an error.
My goal is to process the temp string and use a find to check if it contains specific string in it but cannot process due to the error:
UnicodeDecodeError: ('unknown', u'\xe0', 0, 1, '')
You need to specify the encoding: otherwise unicode() doesn't know what \xe0 means, because that is encoding-specific.
>>> temp = "à la Carte"
>>> utemp = unicode(temp,encoding="Windows-1252")
>>> utemp
u'\xe0 la Carte'
>>> print utemp
à la Carte
In python 2, the ordinary string literal cannot hold such unicode characters, so even if the parser manages to get through it, it is still an error. That's why there exists a unicode literal type. So to make it work, first you have to declare the encoding of the python file, and second, use a unicode literal. Like this:
# -*- coding: utf-8 -*-
temp = u"à la Carte"
print type(temp)
utemp = unicode(temp)

How to use multiline flag in python regex?

I want to transform chunks of text into a database of single line entries database with regex. But I don't know why the regex group isn't recognized.
Maybe because the multiline flag isn't properly set.
I am a beginner at python.
import re
with open("a-j-0101.txt", encoding="cp1252") as f:
start=1
ecx=r"(?P<entrcnt>[0-9]{1,3}) célébrités ou évènements"
ec1=""
nmx=r"(?P<ename>.+)\r\nAfficher le.*"
nm1=""
for line in f:
if start == 1:
out = open('AST0101.txt' + ".txt", "w", encoding="cp1252") #utf8 cp1252
ec1 = re.search(ecx,line)
out.write(ec1.group("entrcnt"))
start=0
out.write(r"\r\n")
nm1 = re.search(nmx,line, re.M)
out.write(str(nm1.group("ename")).rstrip('\r\n'))
out.close()
But I get the error:
File "C:\work-python\transform-asth-b.py", line 16, in <module>
out.write(str(nm1.group("ename")).rstrip('\r\n'))
builtins.AttributeError: 'NoneType' object has no attribute 'group'
here is the input:
210 célébrités ou évènements ont été trouvés pour la date du 1er janvier.
Création de l'euro
Afficher le...
...
...
...
expected output:
210
Création de l'euro ;...
... ;...
... ;...
EDIT: I try to change nmx to match \n or \r\n but no result:
nmx=r"(?P<ename>.+)(\n|\r\n)Afficher le"
best regards
In this statement:
nm1 = re.search(nmx,line, re.M)
you get an NoneType object (nm1 = None), because no matches were found. So make more investigation on the nmx attribute, why you get no matches in the regex.
By the way if it´s possible to get a NoneType object, you can avoid this by preventing a NoneType:
If nm1 is not None:
out.write(str(nm1.group("ename")).rstrip('\r\n'))
else:
#handle your NoneType case
If you are reading a single line at a time, there is no way for a regex to match on a previous line you have read and then forgotten.
If you read a group of lines, you can apply a regex to the collection of lines, and the multiline flag will do something useful. But your current code should probably simply search for r'^Afficher le\.\.\.' and use the state machine (start == 0 or start == 1) to do this in the right context.

Error in the coding of the characters in reading a PDF

I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader
f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())
print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I'm using Python 3
The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.
what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.
text = text.replace("'", "’")

Categories

Resources