overflow error in Python using dataframes

overflow error in Python using dataframes - python

I get the below error while trying to fetch rows from Excel using as a data frame. Some of the columns have very big values like 1405668170987000000, while others are time stamp columns having values like 11:46:00.180630.
I did convert the format of the above columns to text. However, I'm still getting the below error for a simple select statement (select * from df limit 5):
Overflow Error: Python int too large to convert to SQLite INTEGER

SQLite INTEGERS are 64-bit, meaning the maximum value is 9,223,372,036,854,775,807.
It looks like some of your values are larger than that so they will not fit into the SQLite INTEGER type. You could try converting them to text in order to extract them.

SQL integer values have a upper bound of 2**63 - 1. And the value provided in your case 1405668170987000000 is simply too large for SQL.
Try converting them into string and then perform the required operation

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'

Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

astype('float') changes data, not just data type

I download a bunch of csv-files from an aws s3-bucket and put them in a dataframe. Before uploading the dataframe to sql server I would like to change the columns of the dataframe to have the right datatypes.
When I run astype('float64') on a column I want to change it not only changes the datatype but also the data.
Code:
df['testcol'] = df['lineId'].astype('float64')
pycharm image of the result
I attached a picture to visualize the error. As you can see the data in the third column (testcol) is different to the data in the second column (lineId) even though only the datatype should be changed.
A pl_id can have multiple lineId's, that's why I added and sorted by pl_id in the picture.
Am I using astype() wrong or is this a pandas bug?

Basically it seems that the float64 is not sufficient to carry that long integer:
np.float64('211052094743748628')
Out[135]: 2.1105209474374864e+17
"The max precision a float 64 can reach is close to 10-16 (unit in the last place (ULP), see en.wikipedia.org/wiki/Floating-point_arithmetic) so the idea of an exact decimal value with significantly more than 16 digits for a floating point is misleading."
Numpy float64 vs Python float
Consider maybe using int64, which can be more suitable for the size of Id in your dataset:
np.int64('211052094743748628')
Out[150]: 211052094743748628

U-SQL + Python return dataframe with unknown number of columns

If my python script is pivoting and i can no predict how many columns will be outputed, can this be done with the U-SQL REDUCE statement?
e.g.
#pythonOutput =
REDUCE #filteredBets ON [BetDetailID]
PRODUCE [BetDetailID] string, EventID float
USING new Extension.Python.Reducer(pyScript:#myScript);
There could be multiple columns, so i can't hard set the names in the Produce part.
Any ideas?

If you have a way to produce a SqlMap<string,string> value from within Python (I am not sure if that is supported right now, you can do it with a C# reducer :)), then you could use the map for the dynamic schema part.
If it is not supported in Python, please file a feature request at http://aka.ms/adlfeedback.

The only way right now is to serialize all the columns into a single column, either as a byte[] or string in your python script. SqlMap/SqlArray are not supported yet as output columns.

pandas read_sql_query returns negative and incorrect values for Oracle Database number field containing positive values

I'm running pandas read_sql_query and cx_Oracle 6.0b2 to retrieve data from an Oracle database I've inherited to a DataFrame.
A field in many Oracle tables has data type NUMBER(15, 0) with unsigned values. When I retrieve data from this field the DataFrame reports the data as int64 but the DataFrame values have 9 or fewer digits and are all signed negative. All the values have changed - I assume an integer overflow is happening somewhere.
If I convert the database values using to_char in the SQL query and then use pandas to_numeric on the DataFrame the values are type int64 and correct.
I'm using Python 3.6.1 x64 and pandas 0.20.1. _USE_BOTTLENECK is False.
How can I retrieve the correct values from the tables without using to_char?

Removing pandas and just using cx_Oracle still resulted in an integer overflow so in the SQL query I'm using:
CAST(field AS NUMBER(19))
At this moment I can only guess that any field between NUMBER(11) and NUMBER(18) will require an explicit CAST to NUMBER(19) to avoid the overflow.

Simple sqlite question

When I use:
for i in Selection:
Q = "SELECT columnA FROM DB WHERE wbcode='"+i+"' and commodity='1'"
cursor.execute(Q)
ydata[i] = cursor.fetchall()
I get:
ydata = {'GBR': [(u'695022',), (u'774291',), (u'791499',)... ]}
How can I change my code to get:
ydata = {'GBR': [695022, 774291, 791499,...]}
Thank you very much.
obs: this is just a a simplified example. try to refrain from making recommendations about sql injection.

[int(x[0]) for x in cursor.fetchall()]

Based on this and another question of yours, you need to understand SQLite's affinity and how you are populating the database. Other databases require that the values stored in a column are all of the same type - eg all strings or all integers. SQLite allows you to store anything so the type in each row can be different.
To a first approximation, if you put in a string for that row then you'll get a string out, put in an integer and you'll get an integer out. In your case you are getting strings out because you put strings in instead of integers.
However you can declare a column affinity and SQLite will try to convert when you insert data. For example if a column has integer affinity then if what you insert can be safely/correctly converted to an integer then SQLite will do so, so the string "1" will indeed be stored as the integer 1 while "1 1" will be stored as the string "1 1".
Read this page to understand the details. You'll find things a lot easier getting data out if you put it in using the correct types.
http://www.sqlite.org/datatype3.html
If you are importing CSV data then start the APSW shell and use ".help import" to get some suggestions on how to deal with this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

overflow error in Python using dataframes - python

SQLite INTEGERS are 64-bit, meaning the maximum value is 9,223,372,036,854,775,807. It looks like some of your values are larger than that so they will not fit into the SQLite INTEGER type. You could try converting them to text in order to extract them.

SQL integer values have a upper bound of 2**63 - 1. And the value provided in your case 1405668170987000000 is simply too large for SQL. Try converting them into string and then perform the required operation

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

astype('float') changes data, not just data type

U-SQL + Python return dataframe with unknown number of columns

pandas read_sql_query returns negative and incorrect values for Oracle Database number field containing positive values

Simple sqlite question

Categories

Resources