Applying a model and modifying a Pandas dataframe - python

I have the following problem
I have a DF with a season variable, that have have used hot encode on,
so I now have 4 Boolean columns with 1's and 0's, that were used to make a model from some known good data, I now need to use this model to find the correct season in some bad data
so I built a simple test case to try to had code summer
def season_model(row1):
row1 = row1.iloc[:]
row1.loc[:,'Summer'] =1
row1.loc[:,'Winter'] =0
row1.loc[:,'Spring'] =0
row1.loc[:,'Autumn'] =0
predictions = model.predict(row1)
cur_pred= predictions[0][0]
return cur_pred
this worked when I manually subset a row like shown below
row1 = prediction_data[3:4]
row1 =row1.iloc[:,:-1]
However when I try to do so using the apply() function on a data frame like below:
oos_df['s_predictions'] = oos_df[["Summer", "Winter", "Spring","Autumn"]].apply(lambda x: season_model(x),axis=1)
I run in to the following error, I have been trying to resolve this for a while but keep coming up blank
<ipython-input-254-241c900a588c> in season_model(row1)
5 # for season in season_encode:
6 #encode = season_encode[season]
----> 7 row1.loc[:,'Summer'] =1
8 row1.loc[:,'Winter'] =0
9 row1.loc[:,'Spring'] =0
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
667 else:
668 key = com.apply_if_callable(key, self.obj)
--> 669 indexer = self._get_setitem_indexer(key)
670 self._setitem_with_indexer(indexer, value)
671
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_setitem_indexer(self, key)
660 if "cannot do" in str(e):
661 raise
--> 662 raise IndexingError(key)
663
664 def __setitem__(self, key, value):
IndexingError: (slice(None, None, None), 'Summer')

When you do:
oos_df[["Summer", "Winter", "Spring","Autumn"]].apply(lambda x: season_model(x),axis=1)
You are sending the row values of ood_df columns to your function season_model where x represent a row and you apply it on each of the column (axis=1) (in your case Summer, Winter, Spring and Authm).
Take a look at your function - when you get a row1 argument (which is ["Summer", "Winter", "Spring","Autumn"]) you have only a single data (row) so there's no need to do row1 = row1.iloc[:].
When you do this:
row1 = prediction_data[3:4]
row1 = row1.iloc[:,:-1]
This will work because row1 holds one row then row1 holds the last element of the column. when you send it to a function with apply and lambda - you're saying: hey function - here's a row from my df I want you to do something with it.
Put it differently, use this function below and see the result and then you can rebuild 'sesson_model' correctly.
def season_model(row1):
print (row1)
oos_df[["Summer", "Winter", "Spring","Autumn"]].apply(lambda x: season_model(x),axis=1)

Related

How do I add a list to a column in pandas?

I'm trying to merge the columns kw1, kw2, kw3 shown here:
and have it in one separate column called keywords. This is what I tried:
df['keywords'] = list((df['kw1'], df['kw2'], df['kw3']))
df
but I'm getting this error:
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 df['keywords'] = list((df['kw1'], df['kw2'], df['kw3']))
2 df
File /lib/python3.10/site-packages/pandas/core/frame.py:3655, in DataFrame.__setitem__(self, key, value)
3652 self._setitem_array([key], value)
3653 else:
3654 # set column
-> 3655 self._set_item(key, value)
File /lib/python3.10/site-packages/pandas/core/frame.py:3832, in DataFrame._set_item(self, key, value)
3822 def _set_item(self, key, value) -> None:
3823 """
3824 Add series to DataFrame in specified column.
3825
(...)
3830 ensure homogeneity.
3831 """
-> 3832 value = self._sanitize_column(value)
3834 if (
3835 key in self.columns
3836 and value.ndim == 1
3837 and not is_extension_array_dtype(value)
3838 ):
3839 # broadcast across multiple columns if necessary
3840 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File /lib/python3.10/site-packages/pandas/core/frame.py:4535, in DataFrame._sanitize_column(self, value)
4532 return _reindex_for_setitem(value, self.index)
4534 if is_list_like(value):
-> 4535 com.require_length_match(value, self.index)
4536 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File /lib/python3.10/site-packages/pandas/core/common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )
ValueError: Length of values (3) does not match length of index (141)
Is there a way to make it so that it turns it into a list like this [{value of kw1}, {value of kw2}, {value of kw3}]
You can do it like this
df['keywords'] = np.stack([df['kw1'], df['kw2'], df['kw3']], axis=1).tolist()
Pandas treats each element in the outermost list as a single value, so it complains that you only has three values (which are your three series) while you need 141 values for a new column since your original frame has 141 rows.
Stacking the underlying numpy arrays of the three series on the last dimension gives you a shape (141,3) and converting them to list gives you a list of length 141, with each element being another list of length 3.
A more concise way is to extract three columns as another df and let pandas do the stacking for you
df['keywords'] = df[['kw1', 'kw2', 'kw3']].values.tolist()

Summing up multiple values in single row

Given a dataframe such as this, is it possible to add up the countries specific value even if there are multiple countries in one row? For example, for the 1st row Japan and USA are present, so i would want the value to be Japan=1 USA=1
import pandas as pd
import numpy as np
countries=["Europe","USA","Japan"]
data= {'Employees':[1,2,3,4],
'Country':['Japan;USA','USA;Europe',"Japan","Europe;Japan"]}
df=pd.DataFrame(data)
print(df)
patt = '(' + '|'.join(countries) + ')'
grp = df.Country.str.extractall(pat=patt).values
new_df = df.groupby(grp).agg({'Employees': sum})
print(new_df)
I have tried this but it returns a grouper and axis must be same length error. Is this the correct way to do it?
ValueError Traceback (most recent call last)
<ipython-input-81-53e8e9f0f301> in <module>()
10 patt = '(' + '|'.join(countries) + ')'
11 grp = df.Country.str.extractall(pat=patt).values
---> 12 new_df = df.groupby(grp).agg({'Employees': sum})
13 print(new_df)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/grouper.py in _convert_grouper(axis, grouper)
842 elif isinstance(grouper, (list, Series, Index, np.ndarray)):
843 if len(grouper) != len(axis):
--> 844 raise ValueError("Grouper and axis must be same length")
845 return grouper
846 else:
Thus, i would like the end result to be
Japan: 8
Europe:6
USA:3
Thanks
Could you please try following, written and tested with shown samples. Using split, explode, groupby functions of Pandas.
df['Country'] = df['Country'].str.split(';')
df.explode('Country').groupby('Country')['Employees'].sum()
Output will be as follows:
Country
Eurpoe 6
Japan 8
USA 3
Name: Employees, dtype: int64
Explanation: Simple explanation would be:
Firstly splitting Country column of DataFrame by ; and saving results into same column.
Then using explode on Country column then using groupby on Country column and using sum function on it to get its sum in Employees column.

How to fix Numpy 'otypes' within Pandas dataframe?

Objective: to run association rules on a binary values dataset
d = {'col1': [0, 0,1], 'col2': [1, 0,0], 'col3': [0,1,1]}
df = pd.DataFrame(data=d)
This produces a data frame with 0's and 1's for corresponding column values.
The problem is when I make use of code like the following:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(pattern_dataset, min_support=0.50,use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules
Typically this runs just fine, but in running it this time I have encountered an error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-61-46ec6f572255> in <module>()
4 frequent_itemsets = apriori(pattern_dataset, min_support=0.50,use_colnames=True)
5 frequent_itemsets
----> 6 rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
7 rules
D:\AnaConda\lib\site-packages\mlxtend\frequent_patterns\association_rules.py in association_rules(df, metric, min_threshold, support_only)
127 values = df['support'].values
128 frozenset_vect = np.vectorize(lambda x: frozenset(x))
--> 129 frequent_items_dict = dict(zip(frozenset_vect(keys), values))
130
131 # prepare buckets to collect frequent rules
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
1970 vargs.extend([kwargs[_n] for _n in names])
1971
-> 1972 return self._vectorize_call(func=func, args=vargs)
1973
1974 def _get_ufunc_and_otypes(self, func, args):
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2040 res = func()
2041 else:
-> 2042 ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
2043
2044 # Convert args to object arrays first
D:\AnaConda\lib\site-packages\numpy\lib\function_base.py in _get_ufunc_and_otypes(self, func, args)
1996 args = [asarray(arg) for arg in args]
1997 if builtins.any(arg.size == 0 for arg in args):
-> 1998 raise ValueError('cannot call `vectorize` on size 0 inputs '
1999 'unless `otypes` is set')
2000
ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set
This is what I have for dtypes in Pandas, any help would be appreciated.
col1 int64
col2 int64
col3 int64
dtype: object
128 frozenset_vect = np.vectorize(lambda x: frozenset(x))
--> 129 frequent_items_dict = dict(zip(frozenset_vect(keys), values))
Here np.vectorize wraps the frozenset(x) function in code that can take an array or list (keys), and pass each element for evaluation. It a kind of numpy iteration (convenient, but not fast). But to determine what kind (dtype) of array it returns it performs a test run with the first element of keys. An alternative to doing this test run is to use the otypes parameter.
Anyways, in this particular run, keys is evidently empty, a 0 size array or list. It could return an equivalent shape result array, but it still has to set a dtype. Hence the error.
Evidently the code writer never anticipated the case where keys was empty. So you need to tackle the question of why is it empty?
We need to look at the association_rules code see how keys is set. Its use in line 129 suggests that it has the same number of elements as values, which is derived from the df with:
values = df['support'].values
If keys has 0 elements, then values does as well, and df has 0 'rows'.
What the size of frequent_itemsets?
I add a mlxtend tag because the error arises during the use of its code. You/we need to examine that code or its documentation to determine why this dataframe is empty.
Workaround:
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
yourdataset_sets = yourdataset.applymap(encode_units)
frequent_itemsets = apriori(yourdataset_sets, min_support=0.001, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
Credit: saeedesmaili

Pandas Calculate Median of Group over Columns

I am trying to calculate the Median of Groups over columns. I found a very clear example at
Pandas: Calculate Median of Group over Columns
This question and answer is the exactly the answer I needed. I created the exact example posted to work through the details on my own
import pandas
import numpy
data_3 = [2,3,4,5,4,2]
data_4 = [0,1,2,3,4,2]
df = pandas.DataFrame({'COL1': ['A','A','A','A','B','B'],
'COL2': ['AA','AA','BB','BB','BB','BB'],
'COL3': data_3,
'COL4': data_4})
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
When I tried to calculate the median of Group over columns I encounter the error
TypeError: Series.name must be a hashable type
If I do the exact same code with the only difference replacing median with a different statistic (mean, min, max, std) and everything works just fine.
I don't understand the cause of this error and why it only occurs for median, which is what I really need to calculate.
Thanks in advance for your help,
Bob
Here is the full error message. I am using python 3.5.2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-af0ef7da3347> in <module>()
----> 1 m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
649 # ignore SettingWithCopy here in case the user mutates
650 with option_context('mode.chained_assignment', None):
--> 651 return self._python_apply_general(f)
652
653 def _python_apply_general(self, f):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _python_apply_general(self, f)
658 keys,
659 values,
--> 660 not_indexed_same=mutated or self.mutated)
661
662 def _iterate_slices(self):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
3373 coerce = True if any([isinstance(x, Timestamp)
3374 for x in values]) else False
-> 3375 return (Series(values, index=key_index, name=self.name)
3376 ._convert(datetime=True,
3377 coerce=coerce))
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
231 generic.NDFrame.__init__(self, data, fastpath=True)
232
--> 233 self.name = name
234 self._set_axis(0, index, fastpath=True)
235
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
2692 object.__setattr__(self, name, value)
2693 elif name in self._metadata:
-> 2694 object.__setattr__(self, name, value)
2695 else:
2696 try:
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in name(self, value)
307 def name(self, value):
308 if value is not None and not com.is_hashable(value):
--> 309 raise TypeError('Series.name must be a hashable type')
310 object.__setattr__(self, '_name', value)
311
TypeError: Series.name must be a hashable type
Somehow the series name at this stage is being interpreted as un-hashable, despite supposedly being a tuple. I think it may be the same bug as the one fixed and closed:
Apply on selected columns of a groupby object - stopped working with 0.18.1 #13568
Basically, single scalar values in groups (as you have in your example) were causing the name of the Series to not be passed through. It is fixed in 0.19.2.
In any case, it shouldn't be a practical concern since you can (and should) call mean, median, etc. on GroupBy objects directly.
>>> df.groupby(['COL1', 'COL2'])[['COL3', 'COL4']].median()
COL3 COL4
COL1 COL2
A AA 2.5 0.5
BB 4.5 2.5
B BB 3.0 3.0

Use two indexers to access values in a pandas DataFrame

I have a DataFrame (df) with many columns and rows.
What I'd like to do is access the values in one column for which the values in two other columns match my indexer.
This is what my code looks like now:
df.loc[df.delays == curr_d, df.prev_delay == prev_d, 'd_stim']
In case it isn't clear, my goal is to select the values in the column 'd_stim' for which other values in the same row are curr_d (in the 'delays' column) and prev_d (in the 'prev_delay' column).
This use of loc does not work. It raises the following error:
/home/despo/dbliss/dopa_net/behavioral_experiments/analysis_code/behavior_analysis.py in plot_prev_curr_interaction(data_frames, labels)
2061 for k, prev_d in enumerate(delays):
2062 diff = np.array(df.loc[df.delays == curr_d,
-> 2063 df.prev_delay == prev_d, 'd_stim'])
2064 ind = ~np.isnan(diff)
2065 diff_rad = np.deg2rad(diff[ind])
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1292
1293 if type(key) is tuple:
-> 1294 return self._getitem_tuple(key)
1295 else:
1296 return self._getitem_axis(key, axis=0)
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
787
788 # no multi-index, so validate all of the indexers
--> 789 self._has_valid_tuple(tup)
790
791 # ugly hack for GH #836
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
139 for i, k in enumerate(key):
140 if i >= self.obj.ndim:
--> 141 raise IndexingError('Too many indexers')
142 if not self._has_valid_type(k, i):
143 raise ValueError("Location based indexing can only have [%s] "
IndexingError: Too many indexers
What is the appropriate way to access the data I need?
your logic isn't working for two reasons.
pandas doesn't know what to do with comma separated conditions
df.delays == curr_d, df.prev_delay == prev_d
Assuming you meant and you need to wrap these up in parenthesis and join with &. This is #MaxU's solution in the comments and should work unless you haven't given us everything.
df.loc[(df.delays == curr_d) & (df.prev_delay == prev_d), 'd_stim'])
However, I think this looks prettier.
df.query('delays == #curr_d and prev_delay == #prev_d').d_stim
If this works then so should've #MaxU's. If neither work, I suggest you post some sample data because most folk don't like guessing what your data is.

Categories

Resources