UnicodeDecodeError in converting shapefile into GeoJSON file

UnicodeDecodeError in converting shapefile into GeoJSON file - python

I've tried to convert 'shapefile' into 'geojson' format using python and this is my code
import shapefile
import json
path = "shapefiles/WTL_VALV_PS"
sf = shapefile.Reader(path)
fields = sf.fields[1:]
field_names = [field[0] for field in fields]
buffer = []
for sr in sf.shapeRecords():
atr = dict(zip(field_names, sr.record))
geom = sr.shape.__geo_interface__
buffer.append(dict(type="Feature", geometry=geom, properties=atr))
geojson = open("test4.geojson", "w", encoding='utf-8')
geojson.write(json.dumps({"type": "FeatureCollection", "features": buffer}, indent=2, ensure_ascii=False))
geojson.close()
but I got this error
Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/ConvertGeoJSON/geoJSON.py", line 16, in <module>
for sr in sf.shapeRecords():
File "C:\Users\user\PycharmProjects\ConvertGeoJSON\venv\lib\site-packages\shapefile.py", line 1039, in shapeRecords
for rec in zip(self.shapes(), self.records())])
File "C:\Users\user\PycharmProjects\ConvertGeoJSON\venv\lib\site-packages\shapefile.py", line 1012, in records
r = self.__record(oid=i)
File "C:\Users\user\PycharmProjects\ConvertGeoJSON\venv\lib\site-packages\shapefile.py", line 987, in __record
value = u(value, self.encoding, self.encodingErrors)
File "C:\Users\user\PycharmProjects\ConvertGeoJSON\venv\lib\site-packages\shapefile.py", line 104, in u
return v.decode(encoding, encodingErrors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1: invalid start byte
I thought it is becuase shapefile contains 'Korean character' but I could convert file with Arabic character like 'أفغانستان ' and of course I could convert file with English.
I lost my way and I don't know where I'm supposed to start from

Related

UnicodeDecodeError when trying to encode/decode CSV file

I am trying to figure out how to make the following function work. Basically, what I'm trying to achieve with this function is to create a csv file from a DataFrame, encode it, and then decode it for download.:
def filedownload(df):
csv = df.to_csv(index=False, encoding='utf-8')
# strings <-> bytes conversion
encoded = base64.b64encode(csv)
decoded = base64.b64decode(encoded)
href = f'Download Predictions'
return href
However, when running the entire program, I get the following error:
File "/app/.heroku/python/lib/python3.9/site-packages/streamlit/script_runner.py", line 354, in _run_script
exec(code, module.__dict__)
File "/app/bioactivity_app.py", line 97, in <module>
build_model(desc_subset)
File "/app/bioactivity_app.py", line 35, in build_model
load_model = pickle.load(open('sars_cov_proteinase_model.pkl'))
File "/app/.heroku/python/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
This is an example of what the input CSV file looks like: https://ufile.io/sbl163ty
This is part of the code I'm using to generate this file:
prediction_output = pd.Series(Y_pred, name = 'pIC50')
molecule_name = pd.Series(load_data['molecular_id'], name = 'Molecule Name')
df2 = pd.concat([molecule_name, prediction_output], axis=1)
csv = df2.to_csv('example_generated.csv')
I believe it has something to do with how the file is getting encoded but am not sure. Any help would be appreciated!

ignore encoding error when parsing pdf with pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fn='test.pdf'
with open(fn, mode='rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
item = {}
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
item[name]=value
Hello, I need help with this code as it is giving me Unicode error on some characters
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
x = x.resolve(default=default)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 67, in resolve
return self.doc.getobj(self.objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 673, in getobj
stream = stream_value(self.getobj(strmid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 676, in getobj
obj = self._getobj_parse(index, objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 648, in _getobj_parse
raise PDFSyntaxError('objid mismatch: %r=%r' % (objid1, objid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 85, in __repr__
return self.name.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
is there anything I can add so it "ingores" the charchters that its not able to decode or at least return the name with the value as blank in name, value = field.get('T'), field.get('V').
any help is appreciated

Here is one way you can fix it
nano "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py"
then in line 85
def __repr__(self):
return self.name.decode('ascii', 'ignore') # this fixes it
I don't believe it's recommended to edit source scripts, you should also post an issue on Github

Giving an error when extracting cities from a text using geograpy(Python)

I am trying to extract city name from a text but it is giving an error.
Here is my code:
import geograpy
text = 'I am from Delhi'
places = geograpy.get_place_context(text=text)
print(places.cities)
ERROR:
Traceback (most recent call last):
File "C:/Users/M.B.C. Kadawatha/PycharmProjects/NewsFeed/NLP.py", line 17, in <module>
places = geograpy.get_place_context(text=text)
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\__init__.py", line 11, in get_place_context
pc.set_cities()
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\places.py", line 137, in set_cities
self.populate_db()
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\places.py", line 30, in populate_db
for row in reader:
File "C:\Users\M.B.C. Kadawatha\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 274: character maps to <undefined>

This should be a bug in the package geograpy.
In geograpy/places.py, revise:
with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "rb") as info:
to
with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "r", encoding="utf-8") as info:
And the problem of encoding should go away.

Json Decode Error, resolved only when copy pasting to new txt file and resaving as json

I have pulled JSON files from an API using the command line. I am hit with decode errors when attempting to load the file using the JSON module. As far as I can tell the files are definitely JSON.
When I copy paste the JSON file to a new text file and save it as a JSON file, I can parse the file no problem. I would like to avoid manually copy, pasting, resaving all these files every time i need to run my script.
Sample of Json:
{
"ReservedDBInstancesOfferings": [
{
"MultiAZ": true,
"OfferingType": "Partial Upfront",
"FixedPrice": 280.0,
"UsagePrice": 0.0,
"ReservedDBInstancesOfferingId": "001b899a-be28-489b-9a71-4ff7406d2107",
"RecurringCharges": [
{
"RecurringChargeAmount": 0.032,
"RecurringChargeFrequency": "Hourly"
}
],
"ProductDescription": "sqlserver-se(byol)",
"Duration": 31536000,
"DBInstanceClass": "db.t2.small",
"CurrencyCode": "USD"
},
How i am opening:
with open('C:\\Users\\xxx\\PycharmProjects\\Pricing_File\\CLI Files\\'+inputFile+'-cli.json', 'r') as f:
rawData = json.load(f)
Error:
Traceback (most recent call last):
File "C:/Users/xxx/PycharmProjects/Pricing_File/RDS_CLI_JSON_parse_script.py", line 4, in <module>
data = json.load(f)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
As a reminder, I literally copy paste the json file into a new file and it works.

I think I've got it. Most likely, the original file is saved in Unicode - also referred to as [Wikipedia]: Wide character (wchar_t - check [MSDN]: Working with Strings for more details) in Win world.
To demonstrate, I created 3 files containing the json below, but each having something different:
{
"dummy": 0
}
ux.jsn - Ux style eoln (\n or 0x0A)
win.jsn - Win style eoln (\r\n or 0x0D 0x0A)
uc.jsn - The Wide character format that I talked about (the eolns are Win style, but that's irrelevant)
Then, some code that for each one of the files:
Opens it in binary (or raw) mode, gets its content and displays it (using [Python]: repr(object)) on screen
Opens it in normal mode and tries to load its json content (and display it on screen as well)
code.py:
import json
FILE_NAMES = [
"ux.jsn",
"win.jsn",
"uc.jsn",
]
def main():
for file_name in FILE_NAMES:
print("\n{}:".format(file_name))
with open(file_name, "rb") as f:
raw_content = f.read()
print(" Size: {} - [{}]".format(len(raw_content), repr(raw_content)))
with open(file_name, "r") as f:
print(" {}".format(json.load(f)))
if __name__ == "__main__":
main()
Output (running Python3.6.2x86 on W10x64):
E:\Work\Dev\StackOverflow\q48194775>"c:\Install\x86\Python\Python\3.6\python.exe" code.py
ux.jsn:
Size: 19 - [b'{\n "dummy": 0\n}\n']
{'dummy': 0}
win.jsn:
Size: 22 - [b'{\r\n "dummy": 0\r\n}\r\n']
{'dummy': 0}
uc.jsn:
Size: 44 - [b'{\x00\r\x00\n\x00 \x00 \x00 \x00 \x00"\x00d\x00u\x00m\x00m\x00y\x00"\x00:\x00 \x000\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00']
Traceback (most recent call last):
File "code.py", line 23, in <module>
main()
File "code.py", line 19, in main
print(" {}".format(json.load(f)))
File "c:\Install\x86\Python\Python\3.6\lib\json\__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "c:\Install\x86\Python\Python\3.6\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "c:\Install\x86\Python\Python\3.6\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\Install\x86\Python\Python\3.6\lib\json\decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Conclusions:
Although they apparently have the same content, it's not the case. The faulty file has twice the size, that is because each character is represented on 2 bytes (and since they are all regular ASCII characters, the 1st byte is \0 or 0x00)
For some reason, json does not like that and throws an error when trying to parse the content (although the errors are a little bit different)
When you copy/paste the file, Win automatically converts it to a normal format and therefore json is able to parse it afterwards
As a fix, I converted it using a dummy method (note that if the json content may contain legitimate \0s, they will be removed as well), although I'm sure there are tools that do this properly:
E:\Work\Dev\StackOverflow\q48194775>"c:\Install\x86\Python\Python\3.6\python.exe"
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("uc.jsn", "rb") as f:
... raw_content = f.read()
...
>>> with open("uc0.jsn", "wb") as f:
... f.write(raw_content.replace(b"\0", b""))
...
22
>>> import json
>>> with open("uc0.jsn", "r") as f:
... print(" {}".format(json.load(f)))
...
{'dummy': 0}
>>> with open("uc.jsn", "r") as f:
... print(" {}".format(json.load(f)))
...
Traceback (most recent call last):
# The rest of the traceback

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

I'm trying to read YAML file and convert it into dictionary file. I'm seeing an issue while loading the file into dict variable.
I tried to search for similar issues. One of the replies in stackoverflow was to replace each character '\\xa0' with ' '. I tried do that line = line.replace('\\xa0',' '). This program doesn't work on Python 2.7 version. I tried using Python 3 it works fine.
import yaml
import sys
yaml_dir = "/root/tools/test_case/"
#file_name = "TC_CFD_SR.yml"
file_name = "TC_QB.yml"
tc_file_name = yaml_dir + file_name
def write(file,content):
file = open(file,'a')
file.write(content)
file.close()
def verifyYmlFile(yml_file):
data = {}
with open(yml_file, 'r') as fin:
for line in fin:
line = line.replace('\\xa0',' ')
write('anand-yaml.yml',line)
with open('anand-yaml.yml','r') as fin:
data = yaml.load(fin)
return data
if __name__ == '__main__':
data = {}
print "verifying yaml"
data= verifyYmlFile(tc_file_name)
Error:
[root#anand-harness test_case]# python verify_yaml.py
verifying yaml
Traceback (most recent call last):
File "verify_yaml.py", line 29, in <module>
data= verifyYmlFile(tc_file_name)
File "verify_yaml.py", line 23, in verifyYmlFile
data = yaml.load(fin)
File "/usr/lib64/python2.6/site-packages/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/usr/lib64/python2.6/site-packages/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 111, in compose_sequence_node
node.value.append(self.compose_node(node, index))
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 64, in compose_node
if self.check_event(AliasEvent):
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 449, in parse_block_mapping_value
if not self.check_token(KeyToken, ValueToken, BlockEndToken):
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 244, in fetch_more_tokens
return self.fetch_single()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 653, in fetch_single
self.fetch_flow_scalar(style='\'')
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 667, in fetch_flow_scalar
self.tokens.append(self.scan_flow_scalar(style))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1156, in scan_flow_scalar
chunks.extend(self.scan_flow_scalar_non_spaces(double, start_mark))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1196, in scan_flow_scalar_non_spaces
while self.peek(length) not in u'\'\"\\\0 \t\r\n\x85\u2028\u2029':
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 91, in peek
self.update(index+1)
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 165, in update
exc.encoding, exc.reason)
yaml.reader.ReaderError: 'utf8' codec can't decode byte #xa0: invalid start byte
in "anand-yaml.yml", position 3246
What am I missing?

The character sequence "\\xa0" is not the problem that you see in the message, the problem is the sequence "\xa0" (note that the backslash is not escaped).
You replacement line should be:
line = line.replace('\xa0',' ')
to circumvent the problem.
If you know what the format is you can do the correct conversion yourself, but that should not be necessary and that or the above patching is not a structural solution. It would be best if the YAML file was generated in a correct way (they default to UTF-8, so it should contain correct UTF-8). It could UTF-16 without the appropriate BOM (which the yaml library interprets IIRC).
s1 = 'abc\\xa0xyz'
print(repr(s1))
u1 = s1.decode('utf-8') # this works fine
s = 'abc\xa0xyz'
print(repr(s))
u = s.decode('utf-8') # this throws an error

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError in converting shapefile into GeoJSON file - python

Related

UnicodeDecodeError when trying to encode/decode CSV file

ignore encoding error when parsing pdf with pdfminer

Giving an error when extracting cities from a text using geograpy(Python)

Json Decode Error, resolved only when copy pasting to new txt file and resaving as json

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

Categories

Resources