Conditional delete in pandas dataframe - python

I want to delete any rows including specific string in dataframe.
I want to delete data rows with abnormal email address (with .jpg)
Here's my code, what's wrong with it?
df = pd.DataFrame({'email':['abc#gmail.com', 'cde#gmail.com', 'ghe#ss.jpg', 'sldkslk#sss.com']})
df
email
0 abc#gmail.com
1 cde#gmail.com
2 ghe#ss.jpg
3 sldkslk#sss.com
for i, r in df.iterrows():
if df.loc[i,'email'][-3:] == 'com':
df.drop(df.index[i], inplace=True)
Traceback (most recent call last):
File "<ipython-input-84-4f12d22e5e4c>", line 2, in <module>
if df.loc[i,'email'][-3:] == 'com':
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 998, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1911, in _getitem_axis
self._validate_key(key, axis)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1798, in _validate_key
error()
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1785, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [2] is not in the [index]'

IIUC, you can do this rather than iterating through your frame with iterrows:
df = df[df.email.str.endswith('.com')]
which returns:
>>> df
email
0 abc#gmail.com
1 cde#gmail.com
3 sldkslk#sss.com
Or, for larger dataframes, it's sometimes faster to not use the str methods provided by pandas, but just to do it in a plain list comprehension with python's built in string methods:
df = df[[i.endswith('.com') for i in df.email]]

Related

KeyError: 'None of [['col label 1', 'col label 2']] are in the [columns]'

I am attempting to slice a pandas dataframe by column labels using .loc. Based on Pandas documentation, https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, .loc seems like the right indexer for the use case.
Original pandas DataFrame and confirmation the columns w/ labels exists:
The column labels as dynamically constructed and passed as list to slice the dataframe.
# Create dictionaries
prop_dict = dict(zip(df_list.id, df_list.Company))
city_dict = dict(zip(df_list.id, df_list.city))
# Lookup keys (property ids) from prop_dict
propKeys = getKeysByValue(prop_dict, landlord)
cityKeys = getKeysByValue(city_dict, market)
prop_list = list(set(propKeys) & set(cityKeys))
print(prop_list)
[19, 27]
# Slice dataframe
df_temp = df_t.loc[:, prop_list]
However, this throws an error KeyError: 'None of [[19, 27]] are in the [columns]'
Full traceback here:
Traceback (most recent call last):
File "/Platform/Deploy/tabs/market.py", line 279, in render_table
result = top_leads(company, market)
File "/Platform/Deploy/return_leads.py", line 86, in top_leads
df_temp = df_matrix.loc[:, prop_list]
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key)
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 890, in _getitem_tuple
retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1901, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1143, in _getitem_iterable
self._validate_read_indexer(key, indexer, axis)
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1206, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)))
KeyError: 'None of [[19, 27]] are in the [columns]'
Is it possible the columns '19' and '27' are located as the 19th and 27th column and that is why the first time it gives you the appropriate result because of the integer value of the 'names' 19 and 27. If you want to pass it as a list there need to be ''s around the names of the column, meaning it should be ['19','27'] instead of [19,27]

Can use dataframe ix for assignment, but not retrieval

I am looping through rows of a pandas df, loop index i.
I am able to assign several columns using the ix function with the loop index as first parameter, column name as second.
However, when I try to retrieve/print using this method,
print(df.ix[i,"Run"])
I get a the following Typerror: str object cannot be interpreted as an integer.
somehow related to Keyerror: 'Run'
Not quite sure why this is occurring, as Run is indeed a column in the dataframe.
Any suggestions?
Traceback (most recent call last):
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3124, in get_value
return libindex.get_value_box(s, key)
File \!"pandas\_libs\index.pyx\!", line 55, in pandas._libs.index.get_value_box
File \!"pandas\_libs\index.pyx\!", line 63, in pandas._libs.index.get_value_box
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File \!"C:\...", line 365, in <module>
print(df.ix[i,\!"Run\!"])
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 116, in __getitem__
return self._getitem_tuple(key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1027, in _getitem_lowerdim
return getattr(section, self.name)[new_key]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 122, in __getitem__
return self._getitem_axis(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1116, in _getitem_axis
return self._get_label(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 136, in _get_label
return self.obj[label]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\series.py\!", line 767, in __getitem__
result = self.index.get_value(self, key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3132, in get_value
raise e1
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3118, in get_value
tz=getattr(series.dtype, 'tz', None))
File \!"pandas\_libs\index.pyx\!", line 106, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 114, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 162, in pandas._libs.index.IndexEngine.get_loc
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Run'
"
Upon changing the name of the column I print to any other column, it does work correctly. Earlier in the code, I "compressed" the rows, which had multiple rows per unique string in 'Run' column, using the following.
df=df.groupby('Run').max()
Did this last line somehow remove the column/column name from the table?
ix has been deprecated. ix has always been ambiguous: does ix[10] refer to the row with the label 10, or the row at position 10?
Use loc or iloc instead:
df.loc[i,"Run"] = ... # by label
df.iloc[i]["Run"] = ... # by position
As for the groupby removing Run: it moves Run to the index of the data frame. To get it back as a column, call reset_index:
df=df.groupby('Run').max().reset_index()
Differences between indexing by label and position:
Suppose you have a series like this:
s = pd.Series(['a', 'b', 'c', 'd', 'e'], index=np.arange(0,9,2))
0 a
2 b
4 c
6 d
8 e
The first column is the labels (aka the index). The second column is the values of the series.
Label based indexing:
s.loc[2] --> b
s.loc[3] --> error. The label doesn't exist
Position based indexing:
s.iloc[2] --> c. since `a` has position 0, `b` has position 1, and so on
s.iloc[3] --> d
According to the documentation, s.ix[3] would have returned d since it first searches for the label 3. When that fails, it falls back to the position 3. On my machine (Pandas 0.24.2), it returns an error, along with a deprecation warning, so I guess the developers changed it to behave like loc.
If you want to use mixed indexing, you have to be explicit about that:
s.loc[3] if 3 in s.index else s.iloc[3]

Pandas dataframe apply to get a list throws error

I am running Python 3.6.x and pandas version 0.19.2. I am trying to create a list for each entry in a dataframe as below. This example works.
df = pd.DataFrame({'names':['a', 'b', 'c'], 'year_min':[2001, 2010, 2005], 'year_max':[2018, 2019, 2017]})
start_year = 2017
df['years'] = df.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
df
Out[37]:
names year_max year_min years
0 a 2018 2001 [2017, 2018]
1 b 2019 2010 [2017, 2018, 2019]
2 c 2017 2005 [2017]
Unfortunately, when I try the same line of code for the dataframe in this pickle file, I get an error, despite the dtypes of the two columns still being int64. Without doubt, I have messed up some bit of this dataframe, but I have no clue what the problem is (!). Any ideas?
players = pd.read_pickle("players_2017_2019.p")
start_year = 2017
players['years']= players.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
Traceback (most recent call last):
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4262, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4339, in form_blocks
int_blocks = _multi_blockify(int_items)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4408, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4453, in _stack_arrays
stacked[i] = _asarray_compat(arr)
ValueError: could not broadcast input array from shape (2) into shape (3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "...\python36\win64\431\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-7fc8712b01b0>", line 1, in <module>
players.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 4152, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 4265, in _apply_standard
result = self._constructor(data=results, index=index)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 266, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 402, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 5408, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4267, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4231, in construction_error
raise ValueError("Empty data passed with indices specified.")
ValueError: Empty data passed with indices specified.
EDIT:
The issue was solved when I updated my pandas to 0.23.0
Also, the issue is linked to https://github.com/pandas-dev/pandas/issues/17892.

Pandas: Nondeterministic broadcast failure in variable assignment

This is a cut down version of a program that converts trade data into OHLCV format.
import pandas as pd
data = pd.DataFrame({ 'time' : [pd.Timestamp('2017-12-26 16:01:04.628431600')], 'price': [100.0], 'size': [0.06] })
data.set_index('time', inplace=True)
data = data.resample('1s').apply({ 'price' : 'ohlc', 'size': 'sum' })
I'm getting the following error
Traceback (most recent call last):
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/common.py", line 1404, in _asarray_tuplesafe
result[:] = values
ValueError: could not broadcast input array from shape (4) into shape (1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/tmp.py", line 5, in <module>
data = data.resample('1s').apply({ 'price' : 'ohlc', 'size': 'sum' })
File "/home/jun/.local/lib/python3.5/site-packages/pandas/tseries/resample.py", line 293, in aggregate
result, how = self._aggregate(arg, *args, **kwargs)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/base.py", line 560, in _aggregate
result = DataFrame(result)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 224, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 360, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 5236, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 5546, in _homogenize
raise_cast_failure=False)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/series.py", line 2922, in _sanitize_array
subarr = _asarray_tuplesafe(data, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/common.py", line 1407, in _asarray_tuplesafe
result[:] = [tuple(x) for x in values]
ValueError: cannot copy sequence with size 4 to array axis with dimension 1
This doesn't make sense. IIUC, the assignment on line 5 is just assigning a new DataFrame to a variable, so there's nothing to broadcast. Stranger even, this failure is non-deterministic: running the script sometimes results in the error, but sometimes it doesn't.
Am I doing something wrong, or is this a bug in pandas? Where's the source of non-determinism in this program? I'm using Python 3.5.3 with pandas 0.18.1.

Pandas hierarchical indexing - not working for dataframe?

I'm having trouble addressing values in a DataFrame, but I don't seem to have any problems with the Series object.
>>> df=DataFrame([0.5,1.5,2.5,3.5,4.5], index=[['a','a','b','b','b'],[1,2,1,2,3]])
>>> series=Series([0.5,1.5,2.5,3.5,4.5], index=[['a','a','b','b','b'],[1,2,1,2,3]])
>>> series['a']
1 0.5
2 1.5
dtype: float64
>>> df['a']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 2003, in __getitem__
return self._get_item_cache(key)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 667, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1655, in get
_, block = self._find_block(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1935, in _find_block
self._check_have(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1942, in _check_have
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named a'
I'm definitely misunderstanding something, if someone could help me out it would be very much appreciated!
You are trying to select a column, and there is indeed no column named 'a'. Try df.loc['a'] instead.
I recommend to look at the basic indexing docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
In summary:
series[label] selects element in series at index label
dataframe[label] selects column with name label

Categories

Resources