Why pandas DataFrame allows to set column using too large Series? - python

Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.

You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here

Related

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?
With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

While passing same number of column name and column data get the error

a=["ExpNCCIFactor","Requestid","EffDate","TransresposnseDate","QuoteEffDate","ApplicationID","PortUrl","UQuestion","DescriptionofOperations","Error"]
d = [ExpNCCIFactor,Requestid,EffDate,TransresposnseDate,QuoteEffDate,ApplicationID,PortUrl,UQuestion,DescriptionofOperations,Error]
df2 = pd.DataFrame(data = d , columns = a)
Got error
Traceback (most recent call last):
File "C:\Users\praka\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 982, in _finalize_columns_and_data
columns = _validate_or_indexify_columns(contents, columns)
File "C:\Users\praka\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 1030, in _validate_or_indexify_columns
raise AssertionError(
AssertionError: 10 columns passed, passed data had 11 columns
You are creating a dataframe from list.
In your case, data arugment should be list of list.
a=["ExpNCCIFactor","Requestid","EffDate","TransresposnseDate","QuoteEffDate","ApplicationID","PortUrl","UQuestion","DescriptionofOperations","Error"]
d = [[ExpNCCIFactor,Requestid,EffDate,TransresposnseDate,QuoteEffDate,ApplicationID,PortUrl,UQuestion,DescriptionofOperations,Error]]
It seems like one of your columns in variable d has a different shape probably a double column on it.
Try this code to iterate over them and print their shapes:
import numpy as np
c = 0
for i in d:
print('Shape of column {} is {}'.format(c, np.shape(i)))
c += 1

python - Getting error while taking difference between two dates in columns

this is my code, I am trying to get business days between two dates
the number of days is saved in a new column 'nd'
import numpy as np
df1 = pd.DataFrame(pd.date_range('2020-01-01',periods=26,freq='D'),columns=['A'])
df2 = pd.DataFrame(pd.date_range('2020-02-01',periods=26,freq='D'),columns=['B'])
df = pd.concat([df1,df2],axis=1)
# Iterate over each row of the DataFrame
for index , row in df.iterrows():
bc = np.busday_count(row['A'],row['B'])
df['nd'] = bc
I am getting this error.
Traceback (most recent call last):
File "<input>", line 35, in <module>
File "<__array_function__ internals>", line 5, in busday_count
TypeError: Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Is there a way to fix it or another way to get the solution?
busday_count only accepts dates, not datetimes change
bc = np.busday_count(row['A'],row['B'])
to
np.busday_count(row['A'].date(), row['B'].date())

Pandas IndexingError

I'm following a tutorial about bitcoin and pandas where I'm receiving a data from websocket and storing in a dataframe. Everything is working fine but randomly my script is throwing an error:
Traceback (most recent call last):: 26561.29| MIN: 26530.0 | MAX: 26582.691
File "/home/user/Desktop/BTC/price.py", line 89, in <module>
df = df.loc[df.date >= start_time]
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1090, in _getitem_axis
return self._getbool_axis(key, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 896, in _getbool_axis
key = check_bool_indexer(labels, key)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 2183, in check_bool_indexer
"Unalignable boolean Series provided as "
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
This how my code snippet looks like:
df = price['BTCGBP']
start_time = df.date.iloc[-1] - pd.Timedelta(minutes=5)
df = df.loc[df.date >= start_time]
max_price = df.price.max()
I think this is related to websocket data because is totally random.
I have changed from 5 minutes to 1 min. and the result of this comparison is:
print(df.loc[df.date >= start_time])
date price
0 2021-01-19 18:50:51.724977 27078.59
until
15 2021-01-19 18:51:51.723815 27113.82
df.date >= start_time
This part is a boolean comparison and returns either True or False. Try printing the result and you'll see it's a boolean value. However, df.loc[]
expects a row number in the form of an integer. What is the intended output for this comparison?

Python pandas duplicate values error

I have a large tab delimited data file, and I want to read it in python using pandas "read_csv or 'read_table' function. When I am reading this large file it is showing me the following error, even after turning off the "index_col" value.
>>> read_csv("test_data.txt", sep = "\t", header=0, index_col=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 160, in _read
return parser.get_chunk()
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 613, in get_chunk
raise Exception(err_msg)
Exception: Implicit index (columns 0) have duplicate values [372, 1325, 1497, 1636, 2486,<br> 2679, 3032, 3125, 4261, 4669, 5215, 5416, 5569, 5783, 5821, 6053, 6597, 6835, 7485, 7629, 7684, 7827, 8590, 9361, 10194, 11199, 11707, 11782, 12397, 15134, 15299, 15457, 15637, 16147, 17448,<br> 17659, 18146, 18153, 18398, 18469, 19128, 19433, 19702, 19830, 19940, 20284, 21724, 22764, 23514, 25095, 25195, 25258, 25336, 27011, 28059, 28418, 28637, 30213, 30221, 30574, 30611, 30871, 31471, .......
I thought I might have duplicate values in my data and thus used grep to redirect some of these values into a file.
grep "9996744\|9965107\|740645\|9999752" test_data.txt > delnow.txt
Now, when I read this file, it is read correctly as you can see below.
>>> read_table("delnow.txt", sep = "\t", header=0, index_col=None)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns:
0740645 20 non-null values
M 20 non-null values
BLACK/CAPE VERDEAN 20 non-null values
What is going on here? I am struggling for a solution but to no avail.
I also tried 'uniq' command in unix to see if duplicate lines exist but could not find any.
Does it has to do something with chunk-size?
I am using the following version of pandas
>>> pandas.__version__
'0.7.3'
>>>
Installed pandas latest version.
I am able to read now.
>>> import pandas
>>> pandas.__version__
'0.8.1'

Categories

Resources