tensorflow error: composing crossed columns from categorical ones - python

I want to see explicitly the input tensor for the categorical crossed feature columns but get the error:
ValueError: Items of feature_columns must be a _DenseColumn. You can wrap a categorical column with an embedding_column or indicator_column. Given: _VocabularyListCategoricalColumn(key='education', vocabulary_list=('7th', '8th', '10th', '6th', '1th'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
Code:
import tensorflow as tf
import tensorflow.feature_column as fc
import numpy as np
tf.enable_eager_execution()
x = {'education': ['7th', '8th', '10th', '10th', '6th', '1th'],
'occupation': ['pro', 'sport', 'sci', 'model', 'pro', 'tech'],}
y = {'output': [1, 2, 3, 4, 5, 6]}
education = fc.categorical_column_with_vocabulary_list(
'education',
['7th', '8th', '10th', '6th', '1th'])
occupation = fc.categorical_column_with_vocabulary_list(
'occupation',
['pro', 'sport', 'sci', 'model', 'tech'])
education_x_occupation = tf.feature_column.crossed_column(
['education', 'occupation'], 30)
feat_cols = [
education,
occupation,
education_x_occupation]
fc.input_layer(x, feat_cols) # output
What is the correct implementation?
UPD:
I changed the last 5 strings to
feat_cols = [
fc.indicator_column(education),
fc.indicator_column(occupation),
fc.indicator_column(education_x_occupation)]
example_data = fc.input_layer(x, feat_cols) # output
print(example_data.numpy())
and I got the following errors and I doubt if they correspond to tf or to python. Should I firstly handle overflow?
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
OverflowError: Python int too large to convert to C long
During handling of the above exception, another exception occurred:
SystemError Traceback (most recent call last)
<ipython-input-125-db0c0525e02b> in <module>()
4 fc.indicator_column(education_x_occupation)]
5
----> 6 example_data = fc.input_layer(x, feat_cols) # output
7 print(example_data.numpy())
...
~\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_sparse_ops.py in sparse_cross(indices, values, shapes, dense_inputs, hashed_output, num_buckets, hash_key, out_type, internal_type, name)
1327 dense_inputs, "hashed_output", hashed_output, "num_buckets",
1328 num_buckets, "hash_key", hash_key, "out_type", out_type,
-> 1329 "internal_type", internal_type)
1330 _result = _SparseCrossOutput._make(_result)
1331 return _result
SystemError: <built-in function TFE_Py_FastPathExecute> returned a result with an error set

You need to pass your categorical columns through fc.indicator_column() before you can view them.
Try amending your last few line to this:
feat_cols = [
fc.indicator_column(education),
fc.indicator_column(occupation),
fc.indicator_column(education_x_occupation)]
example_data = fc.input_layer(x, feat_cols) # output
print(example_data.numpy())
Is that what you were hoping to see?

Related

How do I convert "WWWDDLLW" into a numerical value

I tried training model built with scikit-learn(Decision trees)
my goal is convert Teams Form into a numerical value.
I tried using the map method to map the values of the "Home Team Form" column in a DataFrame X to corresponding numerical values. Specifically, it maps the string values "W" (for win), "D" (for draw), and "L" (for loss) to the integers 3, 2, and 1, respectively. The result of this operation is that the "Home Team Form" column in the DataFrame will now contain these mapped integer values, rather than the original string values
X = df[['Home Team Form', 'Away Team Form']]
y = df['Score Line']
X['Home Team Form'] = X['Home Team Form'].map({'W': 3, 'D': 2, 'L': 1})
X['Away Team Form'] = X['Away Team Form'].map({'W': 3, 'D': 2, 'L': 1})
Sample football_data.csv
Home Team,Away Team,Home Team Form,Away Team Form,Score Line
Chelsea,Arsenal,WWDDL,WWWWW,3-2
Manchester United,Liverpool,LLDWD,WDWWW,1-2
Error Message:
[Running] python -u "/home/abjay/Desktop/PythonCrush/new/machineLearning.py"
Traceback (most recent call last):
File "/home/abjay/Desktop/PythonCrush/new/machineLearning.py", line 9, in <module>
X = df[['Home Team Form', 'Away Team Form']]
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3511, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['Home Team Form', 'Away Team Form'], dtype='object')] are in the [columns]"

Infer multivalent features with tfdv from pandas dataframe

I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.
Given the following dataframe:
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
inferring and displaying the schema results in:
Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':
If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:
ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')
Is there any way to achieve this with tfdv?
A String will be interpreted as a String. Regarding your issue with the List, it might be related to this issue:
Currently only pandas columns of primitive types are supported.
Could not find anything more recent. Here is a workaround:
import pandas as pd
import tensorflow_data_validation as tfdv
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
df['feat_2'] = df['feat_2'].str.split(',')
df = df.explode('feat_2').reset_index(drop=True)
train_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Python ValueError from np.where create flag based on one condition

If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})
I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")

Pandas DataFrame Groupby Named Aggregation Using Lambda Results in a KeyError, Dict-of-Dicts Approach Works Fine

Groupby.Agg() with a Dict-of-Dict argument to name the resulting columns is being deprecated in favor of the new Named Aggregation approach. However, I am having trouble applying lambda functions that worked fine previously (using a Dict-of-Dict).
I'm using Python 3.7.4, NumPy 1.16.4, Pandas 0.25.0
import numpy as np
import pandas as pd
data = [['tom', 10, 'blue', 1000, 'a'], ['nick', 15, 'blue', 2000, 'b'], ['julie', 14, 'green', 3000, 'a'], ['bob', 11, 'green', 4000, 'a'], ['cindy', 16, 'red', 5000, 'b']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Color', 'Num', 'Letter'])
# Dict-style renaming seems to work fine:
df.groupby(by='Color').agg({'Num': {'SumNum' : np.sum, 'SumNumIfLetterA': lambda x: x[df.iloc[x.index].Letter=='a'].sum()}})
C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\groupby\generic.py:1455: FutureWarning: using a dict with renaming is deprecated and will be removed
in a future version.
For column-specific groupby renaming, use named aggregation
df.groupby(...).agg(name=('column', aggfunc))
return super().aggregate(arg, *args, **kwargs)
Out[4]:
Num
SumNum SumNumIfLetterA
Color
blue 3000 1000
green 7000 7000
red 5000 0
# Named aggregation throws a KeyError:
df.groupby(by='Color').agg(SumNum = ('Num', np.sum), SumNumIfLetterA = ('Num', lambda x: x[df.iloc[x.index].Letter=='a'].sum()))
Traceback (most recent call last):
File "<ipython-input-5-9be7b560a3f5>", line 2, in <module>
df.groupby(by='Color').agg(SumNum = ('Num', np.sum), SumNumIfLetterA = ('Num', lambda x: x[df.iloc[x.index].Letter=='a'].sum()))
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\groupby\generic.py", line 1455, in aggregate
return super().aggregate(arg, *args, **kwargs)
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\groupby\generic.py", line 264, in aggregate
result = result[order]
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\frame.py", line 2981, in __getitem__
indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\indexing.py", line 1271, in _convert_to_indexer
return self._get_listlike_indexer(obj, axis, **kwargs)[1]
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\indexing.py", line 1078, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
File "C:\Users\AppData\Local\Continuum\anaconda3\Lib\site-packages\pandas\core\indexing.py", line 1171, in _validate_read_indexer
raise KeyError("{} not in index".format(not_found))
KeyError: "[('Num', '<lambda>')] not in index"
I had a very similar problem. After digging a little deeper in to github, I found a workaround by creating a dummy column in the main data frame. So in your code if you do the following it should work
data = [['tom', 10, 'blue', 1000, 'a'], ['nick', 15, 'blue', 2000, 'b'], ['julie', 14, 'green', 3000, 'a'], ['bob', 11, 'green', 4000, 'a'], ['cindy', 16, 'red', 5000, 'b']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Color', 'Num', 'Letter'])
#Dummy Columns
df['Num1']=df['Num']
#now your groupby with NamedAgg on Num and Num1
df.groupby(by='Color').agg(SumNum = ('Num', np.sum), SumNumIfLetterA = ('Num1', lambda x: x[df.iloc[x.index].Letter=='a'].sum()))
Output from Ipython Console
df['Num1']=df['Num']
df.groupby(by='Color').agg(SumNum = ('Num', np.sum), SumNumIfLetterA = ('Num1', lambda x: x[df.iloc[x.index].Letter=='a'].sum()))
Out[46]:
SumNum SumNumIfLetterA
Color
blue 3000 1000
green 7000 7000
red 5000 0
Hope this works!

Apply CountVectorizer to column with list of words in rows in Python

I made a preprocessing part for text analysis and after removing stopwords and stemming like this:
test[col] = test[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
train[col] = train[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
I've got a column with list of "cleaned words". Here are 3 rows in a column:
['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']
I now want to apply CountVectorizer to this column:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])
But I got an Error:
TypeError: expected string or bytes-like object
It would be a bit strange to create string from list and than separate by CountVectorizer again.
To apply CountVectorizer to list of words you should disable analyzer.
x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()
Out:
array([[1, 1, 0],
[1, 0, 1]], dtype=int64)
As I found no other way to avoid an error, I joined the lists in column
train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )
Only after that I started to get the result
X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
When you use fit_transform, the params passed in have to be an iterable of strings or bytes-like objects. Looks like you should be applying that over your column instead.
X_train = train[col].apply(lambda x: cv.fit_transform(x))
You can read the docs for fit_transform here.
Your input should be list of strings or bytes, in this case you seem to provide list of list.
It looks like you already tokenized your string into tokens, inside separate lists. What you can do is a hack as below:
inp = [['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap',
'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps',
'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft',
'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']]
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']
inp = ["<some_space>".join(x) for x in inp]
vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word")
vectorizer.fit_transform(inp)

Categories

Resources