Python: "in" does not recognize values in DataFrame column - python

I have an excerpt from a DataFrame "IRAData" and a Column called 'Labels':
380 u'itator-Research'
381 u'itator-OnSystem'
382 u'itator-QueryClient'
383 u'itator-OnSystem'
384 u'itator-OnSystem'
385 u'itator-OnSystem'
386 u'itator-OnSystem'
387 u'itator-OnSystem'
388 u'itator-OnSystem'
Name: Labels, dtype: object
But when I run the following code, I get "False" output:
print(u'itator-QueryClient' in IRAData['Labels'])
Same goes for the other values in the column and when I remove the unicode 'u'.
Anyone have an idea as to why?
EDIT: The solution that I placed in a comment below worked. Did not need to attempt the answer to the suggested duplicate question.

I think the best way to avoid this problems is to correctly import the data.
You store "u'itator-QueryClient'" where u is a raw marker of unicode string,
when 'itator-QueryClient' is the good information to store here.
For example from this html page, just select and copy the lines 381 to 384 an invoque :
In [498]: import ast
In [499]: pd.read_clipboard(names=['value'],index_col=0,header=None,\
converters={'value': ast.literal_eval})
Out[499]:
value
381 itator-OnSystem
382 itator-QueryClient
383 itator-OnSystem
384 itator-OnSystem
Then 'itator-QueryClient' in IRAData['value'] will be evaluated to True.

Related

substract each two row in one column with pandas

I found this problem bellow while executing the code bellow on google colab it works normaly
df['temps'] = df['temps'].view(int).div(1e9).diff().fillna(0).abs()
print(df)
but while using jupyter notebook localy the error bellow appears
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 df3['rebounds'] = pd.Series(df3['temps'].view(int).div(1e9).diff().fillna(0))
File C:\Python310\lib\site-packages\pandas\core\series.py:818, in Series.view(self, dtype)
815 # self.array instead of self._values so we piggyback on PandasArray
816 # implementation
817 res_values = self.array.view(dtype)
--> 818 res_ser = self._constructor(res_values, index=self.index)
819 return res_ser.__finalize__(self, method="view")
File C:\Python310\lib\site-packages\pandas\core\series.py:442, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
440 index = default_index(len(data))
441 elif is_list_like(data):
--> 442 com.require_length_match(data, index)
444 # create/copy the manager
445 if isinstance(data, (SingleBlockManager, SingleArrayManager)):
File C:\Python310\lib\site-packages\pandas\core\common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )
ValueError: Length of values (830) does not match length of index (415)
any suggetions to resolve this !!
Here are two ways to get this to work:
df3['rebounds'] = pd.Series(df3['temps'].view('int64').diff().fillna(0).div(1e9))
... or:
df3['rebounds'] = pd.Series(df3['temps'].astype('int64').diff().fillna(0).div(1e9))
For the following sample input:
df3.dtypes:
temps datetime64[ns]
dtype: object
df3:
temps
0 2022-01-01
1 2022-01-02
2 2022-01-03
... both of the above code samples give this output:
df3.dtypes:
temps datetime64[ns]
rebounds float64
dtype: object
df3:
temps rebounds
0 2022-01-01 0.0
1 2022-01-02 86400.0
2 2022-01-03 86400.0
The issue is probably that view() essentially reinterprets the raw data of the existing series as a different data type. For this to work, according to the Series.view() docs (see also the numpy.ndarray.view() docs) the data types must have the same number of bytes. Since the original data is datetime64, your code specifying int as the argument to view() may not have met this requirement. Explicitly specifying int64 should meet it. Or, using astype() instead of view() with int64 will also work.
As to why this works in colab and not in jupyter notebook, I can't say. Perhaps they are using different versions of pandas and numpy which treat int differently.
I do know that in my environment, if I try the following:
df3['rebounds'] = pd.Series(df3['temps'].astype('int').diff().fillna(0).div(1e9))
... then I get this error:
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]
This suggests that int means int32. It would be interesting to see if this works on colab.

Finding the bug in a function with while loops

I have this function:
def same_price(df=df):
df= df.sort_values(by='Ticket')
nucleus= dict()
k=0
while df.shape[0]>=2:
if df.Price.iloc[0]==df.Price.iloc[1]:
value= df.Price.iloc[0]
n=0
nucleus[k]= []
while df.Price.iloc[n]==value:
nucleus[k].append(df.index[n])
n+=1
if n>df.shape[0]:
df.drop(nucleus[k], axis=0, inplace=True)
break
else:
df.drop(nucleus[k], axis=0, inplace=True)
k+=1
else:
if df.shape[0]>=3:
df.drop(df.index[0], axis=0, inplace=True)
else:
break
return(nucleus)
The objective of the function is to go through the ordered dataframe, and list together the persons who paid the same price GIVEN the sequence of the 'Ticket'id. (I do not just want to list together ALL the people who paid the same price, no matter the sequence!)
The dataframe:
Price Ticket
Id
521 93.5000 12749
821 93.5000 12749
584 40.1250 13049
648 35.5000 13213
633 30.5000 13214
276 77.9583 13502
628 77.9583 13502
766 77.9583 13502
435 55.9000 13507
578 55.9000 13507
457 26.5500 13509
588 79.2000 13567
540 49.5000 13568
48 7.7500 14311
574 7.7500 14312
369 7.7500 14313
When I test it:
same_price(df[:11])is working just fine and the output is : {0: [521, 821], 1: [276, 628, 766], 2: [435, 578]}
but, same_fare(df[:10]) throws:IndexError: single positional indexer is out-of-bounds.
I'd like to know what is wrong with this function guys.
Thx
I found what's wrong, if anyone is interested...
df.iloc[n] gets the (n+1)th line of the dataframe. But shape[0]=n means that the dataframe has n elements.
Hence we use if n+1>df.shape[0]:, instead of if n>df.shape[0]:
Cheers :)

Pandas - Iterating over an index in a loop

I have a weird interaction that I would need help with. Basically :
1)
I have created a pandas dataframe that containts 1179 rows x 6 columns. One column is street names and the same value will have several duplicates (because each line represents a point, and each point is associated with a street).
2)
I also have a list of all the streets in this panda dataframe.
3)If I run this line, I get an output of all the rows matching that street name:
print(sub_df[sub_df.AQROUTES_3=='AvenueMermoz'])
Result :
FID AQROUTES_3 ... BEARING E_ID
983 983 AvenueMermoz ... 288.058014
984 984 AvenueMermoz ... 288.058014
992 992 AvenueMermoz ... 288.058014
1005 1005 AvenueMermoz ... 288.058014
1038 1038 AvenueMermoz ... 288.058014
1019 1019 AvenueMermoz ... 288.058014
However, if I run this command in a loop with the string of my list as the street name, it returns an empty dataframe :
x=()
for names in pd_streetlist:
print(names)
x=names
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
x=()
Returns :
RangSaint_Joseph
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
AvenueAugustin
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
and so on...
I can't figure out why. Anybody has an idea?
Thanks
I believe the issue is in this line:
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
To each names you add unnecessarily quote characters at the beginning and at the end so that each valid name of the street (in your example 'AvenueMermoz' turns into "'AvenueMermoz'" where we had to use double quotes to enclose single-quoted string).
As #busybear has commented - there is no need to cast to str either. So, the corrected line would be:
print(sub_df[sub_df.AQROUTES_3 == x])
So youre adding quotation marks to the filter which you shouldnt. now youre filtering on 'AvenueMermoz' while you just want to filter on AvenueMermoz .
so
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
should become
print(sub_df[sub_df.AQROUTES_3 ==str(x)])

Type error on first steps with Apache Parquet

Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?
import pandas
import pyarrow
import numpy
data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
raises:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)
table.pxi in pyarrow.lib.Table.from_pandas()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
354 arrays = list(executor.map(convert_column,
355 columns_to_convert,
--> 356 convert_types))
357
358 types = [x.type for x in arrays]
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
54
55 try:
---> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
343
344 def convert_column(col, ty):
--> 345 return pa.array(col, from_pandas=True, type=ty)
346
347 if nthreads == 1:
array.pxi in pyarrow.lib.array()
array.pxi in pyarrow.lib._ndarray_to_array()
error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes is:
0 object
1 object
2 object
3 object
4 object
5 float64
6 float64
7 object
8 object
9 object
10 object
11 object
12 object
13 float64
14 object
15 float64
16 object
17 float64
...
In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.
We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.
Had this same issue and took me a while to figure out a way to find the offending column. Here is what I came up with to find the mixed type column - although I know there must be a more efficient way.
The last column printed before the exception happens is the mixed type column.
# method1: try saving the parquet file by removing 1 column at a time to
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
drop = set(cat_cols) - set([col])
print(col)
df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")
Another attempt - list the columns and each type based on the unique values.
# method2: list all columns and the types within
def col_types(col):
types = set([type(x) for x in col.unique()])
return types
df.select_dtypes("object").apply(col_types, axis=0)
I faced similar situation, if possible, you can first convert all columns to the required field datatype and then try to convert to parquet. Example :-
import pandas as pd
column_list = df.columns
for col in column_list:
df[col] = df[col].astype(str)
df.to_parquet('df.parquet.gzip', compression='gzip')

Use two indexers to access values in a pandas DataFrame

I have a DataFrame (df) with many columns and rows.
What I'd like to do is access the values in one column for which the values in two other columns match my indexer.
This is what my code looks like now:
df.loc[df.delays == curr_d, df.prev_delay == prev_d, 'd_stim']
In case it isn't clear, my goal is to select the values in the column 'd_stim' for which other values in the same row are curr_d (in the 'delays' column) and prev_d (in the 'prev_delay' column).
This use of loc does not work. It raises the following error:
/home/despo/dbliss/dopa_net/behavioral_experiments/analysis_code/behavior_analysis.py in plot_prev_curr_interaction(data_frames, labels)
2061 for k, prev_d in enumerate(delays):
2062 diff = np.array(df.loc[df.delays == curr_d,
-> 2063 df.prev_delay == prev_d, 'd_stim'])
2064 ind = ~np.isnan(diff)
2065 diff_rad = np.deg2rad(diff[ind])
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1292
1293 if type(key) is tuple:
-> 1294 return self._getitem_tuple(key)
1295 else:
1296 return self._getitem_axis(key, axis=0)
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
787
788 # no multi-index, so validate all of the indexers
--> 789 self._has_valid_tuple(tup)
790
791 # ugly hack for GH #836
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
139 for i, k in enumerate(key):
140 if i >= self.obj.ndim:
--> 141 raise IndexingError('Too many indexers')
142 if not self._has_valid_type(k, i):
143 raise ValueError("Location based indexing can only have [%s] "
IndexingError: Too many indexers
What is the appropriate way to access the data I need?
your logic isn't working for two reasons.
pandas doesn't know what to do with comma separated conditions
df.delays == curr_d, df.prev_delay == prev_d
Assuming you meant and you need to wrap these up in parenthesis and join with &. This is #MaxU's solution in the comments and should work unless you haven't given us everything.
df.loc[(df.delays == curr_d) & (df.prev_delay == prev_d), 'd_stim'])
However, I think this looks prettier.
df.query('delays == #curr_d and prev_delay == #prev_d').d_stim
If this works then so should've #MaxU's. If neither work, I suggest you post some sample data because most folk don't like guessing what your data is.

Categories

Resources