Conversion from big csv to parquet using python error

Conversion from big csv to parquet using python error - python

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error:
csv_file = 'bigcut.csv'
chunksize = 100_000
parquet_file ='output.parquet'
parser=argparse.ArgumentParser(description='Process Arguments')
parser.add_argument("--fname",action="store",default="",help="specify <run/update>")
args=parser.parse_args()
argFname=args.__dict__["fname"]
csv_file=argFname
csv_stream = pd.read_csv(csv_file, encoding = 'utf-8',sep=',', >chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i==0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
When I ran, it produces the following error
File "pyconv.py", line 25, in <module>
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 387, in dataframe_to_arrays
convert_types))
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 376, in convert_column
raise e
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
return pa.array(col, type=ty, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)", 'Conversion failed for column agent_number__c with type float64')
I am new pandas/pyarrow/python, if anyone has any recommendation what should i do next to debug is appreciated.

'utf-32-le' codec can't decode bytes in position 0-3
It looks like a library is trying to decode your data in utf-32-le whereas you read the csv data as utf-8.
So you'll somehow have to tell that reader (pyarrow.lib) to read as utf-8 (I don't know Python/Parquet so I can't provide the exact code to do this).

The csv has around 3mils records. I managed to catch 1 potential problem.
On 1 of the col has data type of string/text. Somehow most of them are numeric, however some of them is mixed with text, for example many of them are 1000,230,400 etc but few of them was being entered like 5k, 100k, 29k.
So the code somehow did not like it try to set as number/int all around.
Can you advise?

Related

pd.read_parquet() works fine on pycharm but not on ubuntu (KeyError: 24)

I am making a program that reads all my s3 bucket. As I have a lot of them, I want to run it on an EC2 instance. My program works fine on pycharm, but as soon as I try to run it on my ubuntu instance I get this error: :
File "/home/ubuntu/DataRecap/main.py", line 72, in <module>
create_table()
File "/home/ubuntu/DataRecap/main.py", line 43, in create_table
small_column = get_column()
File "/home/ubuntu/DataRecap/main.py", line 32, in get_column
df = pd.read_parquet(buffer)
File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 347, in read
result = parquet_file.to_pandas(columns=columns, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/api.py", line 751, in to_pandas
self.read_row_group_file(rg, columns, categories, index,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/api.py", line 361, in read_row_group_file
core.read_row_group(
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 608, in read_row_group
read_row_group_arrays(file, rg, columns, categories, schema_helper,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 580, in read_row_group_arrays
read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 466, in read_col
dic2 = convert(dic2, se)
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/converted_types.py", line 249, in convert
parquet_thrift.ConvertedType._VALUES_TO_NAMES[ctype]) # pylint:disable=protected-access
KeyError: 24
I have no idea why it does not work. Here is my code :
buffer = io.BytesIO()
object = s3.Object(bucket, parquet_name)
object.download_fileobj(buffer)
df1 = pd.read_parquet(buffer)
Any idea?
Thanks you very much in advance

json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 57358 (char 57357)

I am getting a JSON response from the server and storing it in a text file
Later I parse certain data from the text file, but while doing so I am getting the above error. The JSON format seems to be okay to me Can you please help me here. Thanks Also I do not have any control over the response I receive from the server. For now, I am just using dump to write it to a text file
Code for json dump and load
balance = exchange.fetch_balance()
#print(balance)
with open('balance.txt', 'w') as json_file:
json.dump(balance, json_file)
print("Balance file created")
with open('balance.txt') as json_file:
data = json.load(json_file)
freeusdt = float(data['info']['balances'][11]['free'])
Qty=float(df['close'][last_row_index])
Qty=Qty+300
buyqty=freeusdt/Qty
My JSON file is here (in the .txt file Since StackOverflow won't let me post the full code as it exceeds the character limit)
https://drive.google.com/file/d/1kpBMl6JbzlH9OQvvteLl0Ieb2PaBU_kr/view?usp=sharing
The full error is
Traceback (most recent call last):
File "C:\Users\pavan.alur\Desktop\SUPERTRENDWITHKAMA\supertrendwithkama.py", line 232, in <module>
schedule.run_pending()
File "C:\Users\pavan.alur\Miniconda3\lib\site-packages\schedule\__init__.py", line 780, in run_pending
default_scheduler.run_pending()
File "C:\Users\pavan.alur\Miniconda3\lib\site-packages\schedule\__init__.py", line 100, in run_pending
self._run_job(job)
File "C:\Users\pavan.alur\Miniconda3\lib\site-packages\schedule\__init__.py", line 172, in _run_job
ret = job.run()
File "C:\Users\pavan.alur\Miniconda3\lib\site-packages\schedule\__init__.py", line 661, in run
ret = self.job_func()
File "C:\Users\pavan.alur\Desktop\SUPERTRENDWITHKAMA\supertrendwithkama.py", line 223, in run_bot
check_buy_sell_signals(supertrend_data)
File "C:\Users\pavan.alur\Desktop\SUPERTRENDWITHKAMA\supertrendwithkama.py", line 121, in check_buy_sell_signals
data1 = json.load(json_file1)
File "C:\Users\pavan.alur\Miniconda3\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\pavan.alur\Miniconda3\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\pavan.alur\Miniconda3\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\pavan.alur\Miniconda3\lib\json\decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 57358 (char 57357)

The indentation problem as mentioned by #MattDMo solved it
balance = exchange.fetch_balance()
with open('balance.txt', 'w') as json_file:
json.dump(balance, json_file)
print("Balance file created")
with open('balance.txt') as json_file:
data = json.load(json_file)
freeusdt = float(data['info']['balances'][11]['free'])
Qty=float(df['close'][last_row_index])
Qty=Qty+300
buyqty=freeusdt/Qty

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

I'm trying to read YAML file and convert it into dictionary file. I'm seeing an issue while loading the file into dict variable.
I tried to search for similar issues. One of the replies in stackoverflow was to replace each character '\\xa0' with ' '. I tried do that line = line.replace('\\xa0',' '). This program doesn't work on Python 2.7 version. I tried using Python 3 it works fine.
import yaml
import sys
yaml_dir = "/root/tools/test_case/"
#file_name = "TC_CFD_SR.yml"
file_name = "TC_QB.yml"
tc_file_name = yaml_dir + file_name
def write(file,content):
file = open(file,'a')
file.write(content)
file.close()
def verifyYmlFile(yml_file):
data = {}
with open(yml_file, 'r') as fin:
for line in fin:
line = line.replace('\\xa0',' ')
write('anand-yaml.yml',line)
with open('anand-yaml.yml','r') as fin:
data = yaml.load(fin)
return data
if __name__ == '__main__':
data = {}
print "verifying yaml"
data= verifyYmlFile(tc_file_name)
Error:
[root#anand-harness test_case]# python verify_yaml.py
verifying yaml
Traceback (most recent call last):
File "verify_yaml.py", line 29, in <module>
data= verifyYmlFile(tc_file_name)
File "verify_yaml.py", line 23, in verifyYmlFile
data = yaml.load(fin)
File "/usr/lib64/python2.6/site-packages/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/usr/lib64/python2.6/site-packages/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 111, in compose_sequence_node
node.value.append(self.compose_node(node, index))
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 64, in compose_node
if self.check_event(AliasEvent):
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 449, in parse_block_mapping_value
if not self.check_token(KeyToken, ValueToken, BlockEndToken):
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 244, in fetch_more_tokens
return self.fetch_single()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 653, in fetch_single
self.fetch_flow_scalar(style='\'')
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 667, in fetch_flow_scalar
self.tokens.append(self.scan_flow_scalar(style))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1156, in scan_flow_scalar
chunks.extend(self.scan_flow_scalar_non_spaces(double, start_mark))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1196, in scan_flow_scalar_non_spaces
while self.peek(length) not in u'\'\"\\\0 \t\r\n\x85\u2028\u2029':
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 91, in peek
self.update(index+1)
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 165, in update
exc.encoding, exc.reason)
yaml.reader.ReaderError: 'utf8' codec can't decode byte #xa0: invalid start byte
in "anand-yaml.yml", position 3246
What am I missing?

The character sequence "\\xa0" is not the problem that you see in the message, the problem is the sequence "\xa0" (note that the backslash is not escaped).
You replacement line should be:
line = line.replace('\xa0',' ')
to circumvent the problem.
If you know what the format is you can do the correct conversion yourself, but that should not be necessary and that or the above patching is not a structural solution. It would be best if the YAML file was generated in a correct way (they default to UTF-8, so it should contain correct UTF-8). It could UTF-16 without the appropriate BOM (which the yaml library interprets IIRC).
s1 = 'abc\\xa0xyz'
print(repr(s1))
u1 = s1.decode('utf-8') # this works fine
s = 'abc\xa0xyz'
print(repr(s))
u = s.decode('utf-8') # this throws an error

PyTables writing error

I am creating and filling a PyTables Carray the following way:
#a,b = scipy.sparse.csr_matrix
f = tb.open_file('../data/pickle/dot2.h5', 'w')
filters = tb.Filters(complevel=1, complib='blosc')
out = f.create_carray(f.root, 'out', tb.Atom.from_dtype(a.dtype),
shape=(l, n), filters=filters)
bl = 2048
l = a.shape[0]
for i in range(0, l, bl):
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
The script was running fine for nearly two days (I estimated that it would need at least 4 days more).
However, suddenly I received this error stack trace:
File "prepare_data.py", line 168, in _tables_dot
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 719, in __setitem__
self._write_slice(startl, stopl, stepl, shape, nparr)
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 809, in _write_slice
self._g_write_slice(startl, stepl, countl, nparr)
File "hdf5extension.pyx", line 1678, in tables.hdf5extension.Array._g_write_slice (tables/hdf5extension.c:16287)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5Dio.c", line 266, in H5Dwrite
can't write data
File "../../../src/H5Dio.c", line 671, in H5D_write
can't write data
File "../../../src/H5Dchunk.c", line 1840, in H5D_chunk_write
error looking up chunk address
File "../../../src/H5Dchunk.c", line 2299, in H5D_chunk_lookup
can't query chunk address
File "../../../src/H5Dbtree.c", line 998, in H5D_btree_idx_get_addr
can't get chunk info
File "../../../src/H5B.c", line 362, in H5B_find
can't lookup key in subtree
File "../../../src/H5B.c", line 340, in H5B_find
unable to load B-tree node
File "../../../src/H5AC.c", line 1322, in H5AC_protect
H5C_protect() failed.
File "../../../src/H5C.c", line 3567, in H5C_protect
can't load entry
File "../../../src/H5C.c", line 7957, in H5C_load_entry
unable to load entry
File "../../../src/H5Bcache.c", line 143, in H5B_load
wrong B-tree signature
End of HDF5 error back trace
Internal error modifying the elements (H5ARRAYwrite_records returned errorcode -6)
I am really clueless what the problem is as it was running fine for about a quarter of the dataset. Disk space is available.

TypeError: not JSON serializable Py2neo Batch submit

I am creating a huge graph database with over 1.4 million nodes and 160 million relationships. My code looks as follows:
from py2neo import neo4j
# first we create all the nodes
batch = neo4j.WriteBatch(graph_db)
nodedata = []
for index, i in enumerate(words): # words is predefined
batch.create({"term":i})
if index%5000 == 0: #so as not to exceed the batch restrictions
results = batch.submit()
for x in results:
nodedata.append(x)
batch = neo4j.WriteBatch(graph_db)
results = batch.submit()
for x in results:
nodedata.append(x)
#nodedata contains all the node instances now
#time to create relationships
batch = neo4j.WriteBatch(graph_db)
for iindex, i in enumerate(weightdata): #weightdata is predefined
batch.create((nodedata[iindex], "rel", nodedata[-iindex], {"weight": i})) #there is a different way how I decide the indexes of nodedata, but just as an example I put iindex and -iindex
if iindex%5000 == 0: #again batch constraints
batch.submit() #this is the line that shows error
batch = neo4j.WriteBatch(graph_db)
batch.submit()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 53, in <module>
batch.submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2116, in submit
for response in self._submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2085, in _submit
for id_, request in enumerate(self.requests)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 427, in _send
return self._client().send(request)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 351, in send
rs = self._send_request(request.method, request.uri, request.body, request.$
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 326, in _send_re$
data = json.dumps(data, separators=(",", ":"))
File "/usr/lib64/python2.6/json/__init__.py", line 237, in dumps
**kw).encode(obj)
File "/usr/lib64/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib64/python2.6/json/encoder.py", line 306, in _iterencode
for chunk in self._iterencode_list(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 204, in _iterencode_list
for chunk in self._iterencode(value, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 317, in _iterencode
for chunk in self._iterencode_default(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 323, in _iterencode_default
newobj = self.default(o)
File "/usr/lib64/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 3448 is not JSON serializable
Could anybody please suggest me what exactly is happening here and how can I overcome it? Any kind of help would be appreciated. Thanks in advance! :)

It's hard to tell without being able to run your code with the same data set but this is likely to be caused by the type of the items in weightdata.
Step through your code or print the data type as you go to determine what the type of i is within the {"weight": i} portion of the relationship descriptor. You may find that this is not an int - which would be required for JSON number serialisation. If this theory is correct, you will need to find a way to cast or otherwise convert that property value into an int before using it in a property set.

I've never used the p2neo, but if I look at the documentation
This:
batch.create((nodedata[iindex], "rel", nodedata[-iindex], {"weight": i}))
Is missing the rel() Part:
batch.create(rel(nodedata[iindex], "rel", nodedata[-iindex], {"weight": i}))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conversion from big csv to parquet using python error - python

Related

pd.read_parquet() works fine on pycharm but not on ubuntu (KeyError: 24)

json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 57358 (char 57357)

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

PyTables writing error

TypeError: not JSON serializable Py2neo Batch submit

Categories

Resources