substract each two row in one column with pandas - python

I found this problem bellow while executing the code bellow on google colab it works normaly
df['temps'] = df['temps'].view(int).div(1e9).diff().fillna(0).abs()
print(df)
but while using jupyter notebook localy the error bellow appears
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 df3['rebounds'] = pd.Series(df3['temps'].view(int).div(1e9).diff().fillna(0))
File C:\Python310\lib\site-packages\pandas\core\series.py:818, in Series.view(self, dtype)
815 # self.array instead of self._values so we piggyback on PandasArray
816 # implementation
817 res_values = self.array.view(dtype)
--> 818 res_ser = self._constructor(res_values, index=self.index)
819 return res_ser.__finalize__(self, method="view")
File C:\Python310\lib\site-packages\pandas\core\series.py:442, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
440 index = default_index(len(data))
441 elif is_list_like(data):
--> 442 com.require_length_match(data, index)
444 # create/copy the manager
445 if isinstance(data, (SingleBlockManager, SingleArrayManager)):
File C:\Python310\lib\site-packages\pandas\core\common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )
ValueError: Length of values (830) does not match length of index (415)
any suggetions to resolve this !!

Here are two ways to get this to work:
df3['rebounds'] = pd.Series(df3['temps'].view('int64').diff().fillna(0).div(1e9))
... or:
df3['rebounds'] = pd.Series(df3['temps'].astype('int64').diff().fillna(0).div(1e9))
For the following sample input:
df3.dtypes:
temps datetime64[ns]
dtype: object
df3:
temps
0 2022-01-01
1 2022-01-02
2 2022-01-03
... both of the above code samples give this output:
df3.dtypes:
temps datetime64[ns]
rebounds float64
dtype: object
df3:
temps rebounds
0 2022-01-01 0.0
1 2022-01-02 86400.0
2 2022-01-03 86400.0
The issue is probably that view() essentially reinterprets the raw data of the existing series as a different data type. For this to work, according to the Series.view() docs (see also the numpy.ndarray.view() docs) the data types must have the same number of bytes. Since the original data is datetime64, your code specifying int as the argument to view() may not have met this requirement. Explicitly specifying int64 should meet it. Or, using astype() instead of view() with int64 will also work.
As to why this works in colab and not in jupyter notebook, I can't say. Perhaps they are using different versions of pandas and numpy which treat int differently.
I do know that in my environment, if I try the following:
df3['rebounds'] = pd.Series(df3['temps'].astype('int').diff().fillna(0).div(1e9))
... then I get this error:
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]
This suggests that int means int32. It would be interesting to see if this works on colab.

Related

ValueError: Incompatible indexer with Series while adding date to Date to Data Frame

I am new to python and I can't figure out why I get this error: ValueError: Incompatible indexer with Series.
I am trying to add a date to my data frame.
The date I am trying to add:
date = (chec[(chec['Día_Sem']=='Thursday') & (chec['ID']==2011957)]['Entrada'])
date
Date output:
56 1900-01-01 07:34:00
Name: Entrada, dtype: datetime64[ns]
Then I try to add 'date' to my data frame using loc:
rep.loc[2039838,'Thursday'] = date
rep
And I get this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-347-3e0678b0fdbf> in <module>
----> 1 rep.loc[2039838,'Thursday'] = date
2 rep
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
188 key = com.apply_if_callable(key, self.obj)
189 indexer = self._get_setitem_indexer(key)
--> 190 self._setitem_with_indexer(indexer, value)
191
192 def _validate_key(self, key, axis):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
640 # setting for extensionarrays that store dicts. Need to decide
641 # if it's worth supporting that.
--> 642 value = self._align_series(indexer, Series(value))
643
644 elif isinstance(value, ABCDataFrame):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _align_series(self, indexer, ser, multiindex_indexer)
781 return ser.reindex(ax)._values
782
--> 783 raise ValueError('Incompatible indexer with Series')
784
785 def _align_frame(self, indexer, df):
ValueError: Incompatible indexer with Series
I was also facing similar issue but in a different scenario. I came across threads of duplicate indices, but of-course that was not the case with me. What worked for me was to use .at in place of .loc. So you can try and see if it works
rep['Thursday'].at[2039838] = date.values[0]
Try date.iloc[0] instead of date:
rep.loc[2039838,'Thursday'] = date.iloc[0]
Because date is actually a Series (so basically like a list/array) of the values, and .iloc[0] actually selects the value.
You use loc to get a specific value, and your date type is a series or dataframe, the type between the two can not match, you can change the code to give the value of date to rep.loc[2039838,'Thursday'], for example, if your date type is a series and is not null, you can do this:
rep.loc[2039838,'Thursday'] = date.values[0]

ValueError: Cannot convert non-finite values (NA or inf) to integer

df.dtypes
name object
rating object
genre object
year int64
released object
score float64
votes float64
director object
writer object
star object
country object
budget float64
gross float64
company object
runtime float64
dtype: object
Then when i try to convert using :
df['budget'] = df['budget'].astype("int64")
it says:
ValueError Traceback (most recent call last)
<ipython-input-23-6ced5964af60> in <module>
1 # Change Datatype for Columns
----> 2 df['budget'] = df['budget'].astype("int64")
3
4 #df['column_name'].astype(np.float).astype("Int32")
5 #df['gross'] = df['gross'].astype('int64')
~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5696 else:
5697 # else, only a single dtype is given
-> 5698 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
5699 return self._constructor(new_data).__finalize__(self)
5700
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
580
581 def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
583
584 def convert(self, **kwargs):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
440 applied = b.apply(f, **kwargs)
441 else:
--> 442 applied = getattr(b, f)(**kwargs)
443 result_blocks = _extend_blocks(applied, result_blocks)
444
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
623 vals1d = values.ravel()
624 try:
--> 625 values = astype_nansafe(vals1d, dtype, copy=True)
626 except (ValueError, TypeError):
627 # e.g. astype_nansafe can fail on object-dtype of strings
~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
866
867 if not np.isfinite(arr).all():
--> 868 raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
869
870 elif is_object_dtype(arr):
ValueError: Cannot convert non-finite values (NA or inf) to integer
Assuming that the budget does not contain infinite values, the problem may be because you have nan values. These values are usually allowed in floats but not in ints.
You can:
Drop na values before converting
Or, if you still want the na values and have a recent version of pandas, you can convert to an int type that accepts nan values (note the i is capital):
df['budget'] = df['budget'].astype("Int64")
Try this notice the capital "i" in Int64
df['budget'] = df['budget'].astype("Int64")
you might have some NaN values in this column which might be the reason for this issue
From pandas docs:
Changed in version 1.0.0: Now uses pandas.NA as the missing value rather than numpy.nan
Follow the link to find out more:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Or you could fill the NaN/NA values with 0 and than do .astype("int64")
df['budget'] = df['budget'].fillna(0)
Check for any null values present in the column.
If there are no null values. Try using apply() instead of astype()
df['budget'] = df['budget'].apply("int64")

Memory error during dataframe creation in Python

Ehy guys
i had problem during the creation of my dataset with Python.
I m doing this:
userTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_user_id.tsv',delimiter="\t",names =
["User","Sequence"])
wordTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_word_id.tsv',delimiter="\t",names =
["Word","Sequence"])
df = pd.DataFrame(data=data, index= userTab.User, columns=wordTab.Word)
I m trying to create a dataset from 2 elements, userTab.User are the row and wordTab.Word are the columns elements.
Maybe the shape is too big for compute in this way.
I print the shape of my element, because first all i think that i wrong the dimensions.
((603668,), (37419,), (603668, 37419))
after that i try to print the type, and my user and word are Seris element, and data is scipy.sparse.csc.csc_matrix
Maybe i need use chunk for this shape, but I seen the pandas.DataFrame reference and don't have an attribute.
I have a 8GB Ram on 64bit Python. The sparse matrix is in npz file (300mb about)
the error is general error:
MemoryError Traceback (most recent call
last)
<ipython-input-26-ad363966ef6a> in <module>()
10 type(sparse_matrix)
11
---> 12 df = pd.DataFrame(data=sparse_matrix, index=
np.array(userTab.User), columns= np.array(wordTab.Word))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self,
data, index, columns, dtype, copy)
416 if arr.ndim == 0 and index is not None and columns is not
None:
417 values = cast_scalar_to_array((len(index),
len(columns)),
--> 418 data, dtype=dtype)
419 mgr = self._init_ndarray(values, index, columns,
420 dtype=values.dtype,
copy=False)
~\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in
cast_scalar_to_array(shape, value, dtype)
1164 fill_value = value
1165
-> 1166 values = np.empty(shape, dtype=dtype)
1167 values.fill(fill_value)
1168
MemoryError:
maybe the problem could be this, because i have a sort of ID that when i try to access at User column, id will be remain in userTab.User

Type error on first steps with Apache Parquet

Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?
import pandas
import pyarrow
import numpy
data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
raises:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)
table.pxi in pyarrow.lib.Table.from_pandas()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
354 arrays = list(executor.map(convert_column,
355 columns_to_convert,
--> 356 convert_types))
357
358 types = [x.type for x in arrays]
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
54
55 try:
---> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
343
344 def convert_column(col, ty):
--> 345 return pa.array(col, from_pandas=True, type=ty)
346
347 if nthreads == 1:
array.pxi in pyarrow.lib.array()
array.pxi in pyarrow.lib._ndarray_to_array()
error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes is:
0 object
1 object
2 object
3 object
4 object
5 float64
6 float64
7 object
8 object
9 object
10 object
11 object
12 object
13 float64
14 object
15 float64
16 object
17 float64
...
In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.
We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.
Had this same issue and took me a while to figure out a way to find the offending column. Here is what I came up with to find the mixed type column - although I know there must be a more efficient way.
The last column printed before the exception happens is the mixed type column.
# method1: try saving the parquet file by removing 1 column at a time to
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
drop = set(cat_cols) - set([col])
print(col)
df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")
Another attempt - list the columns and each type based on the unique values.
# method2: list all columns and the types within
def col_types(col):
types = set([type(x) for x in col.unique()])
return types
df.select_dtypes("object").apply(col_types, axis=0)
I faced similar situation, if possible, you can first convert all columns to the required field datatype and then try to convert to parquet. Example :-
import pandas as pd
column_list = df.columns
for col in column_list:
df[col] = df[col].astype(str)
df.to_parquet('df.parquet.gzip', compression='gzip')

object of type 'float' has no len() when using to_stata

I have three columns in my dataset that I'm trying to save as a STATA dta file. These are the last three lines I run after I clean the data.
macro1=macro1.rename(columns={'index':'year', 'Price Index, PCE':'pce','Unemployment Rate':'urate'})
macro1.convert_objects(convert_numeric=True).dtypes
macro1[['year','pce', 'urate]].to_stata('file path\file name.dta', write_index=False)
these are the data types of these variables
year float64
pce float64
urate float64
dtype: object
The problem is, when i try to convert these columns to .dta I get an error message
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-69-a2069ee823e7> in <module>()
36 macro1=macro1.rename(columns={'index':'year', 'Price Index, PCE':'pce','Unemployment Rate':'urate'})
37 macro1.convert_objects(convert_numeric=True).dtypes
---> 38 macro1[['pce']].to_stata('file path\file name.dta', write_index=False)
39 #macro1
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label) 1262 time_stamp=time_stamp, data_label=data_label, 1263 write_index=write_index)
-> 1264 writer.write_file() 1265 1266 #Appender(fmt.docstring_to_string, indents=1)
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in write_file(self) 1245 self._write(_pad_bytes("", 5)) 1246 if self._convert_dates is None:
-> 1247 self._write_data_nodates() 1248 else: 1249 self._write_data_dates()
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in _write_data_nodates(self) 1327 if var is None or var == np.nan: 1328 var =
_pad_bytes('', typ)
-> 1329 if len(var) < typ: 1330 var = _pad_bytes(var, typ) 1331 if compat.PY3:
TypeError: object of type 'float' has no len()
the problem is with both urate and pce because when I try saving only year, it works.
I'm not sure where the problem lies. Any help would be much appreciated.
convert_objects does not convert the dtypes inplace so you needed to assign the operation:
macro1 = macro1.convert_objects(convert_numeric=True)
see the docs

Categories

Resources