How do I convert "WWWDDLLW" into a numerical value - python

I tried training model built with scikit-learn(Decision trees)
my goal is convert Teams Form into a numerical value.
I tried using the map method to map the values of the "Home Team Form" column in a DataFrame X to corresponding numerical values. Specifically, it maps the string values "W" (for win), "D" (for draw), and "L" (for loss) to the integers 3, 2, and 1, respectively. The result of this operation is that the "Home Team Form" column in the DataFrame will now contain these mapped integer values, rather than the original string values
X = df[['Home Team Form', 'Away Team Form']]
y = df['Score Line']
X['Home Team Form'] = X['Home Team Form'].map({'W': 3, 'D': 2, 'L': 1})
X['Away Team Form'] = X['Away Team Form'].map({'W': 3, 'D': 2, 'L': 1})
Sample football_data.csv
Home Team,Away Team,Home Team Form,Away Team Form,Score Line
Chelsea,Arsenal,WWDDL,WWWWW,3-2
Manchester United,Liverpool,LLDWD,WDWWW,1-2
Error Message:
[Running] python -u "/home/abjay/Desktop/PythonCrush/new/machineLearning.py"
Traceback (most recent call last):
File "/home/abjay/Desktop/PythonCrush/new/machineLearning.py", line 9, in <module>
X = df[['Home Team Form', 'Away Team Form']]
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3511, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/abjay/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['Home Team Form', 'Away Team Form'], dtype='object')] are in the [columns]"

Related

vaex apply does not work when using dataframe columns

I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town is a attr_root in the country." Then find common patterns using n-grams.
For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using
df['Test'] = df.apply(lambda x: x['Name'].replace(x['Rep'], x['Sub']), axis=1)
but I cannot find the equivalent vaex method. This issue led me to believe that this should be possible in vaex based on Maarten Breddels' example code, however when trying it I get the below error.
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
def func(x, y, z):
return x.replace(y, z)
dfv['Test'] = dfv.apply(func, arguments=[df.Name.astype('str'), df.Rep.astype('str'), df.Sub.astype('str')])
Gives
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\dataframe.py", line 455, in apply
arguments = _ensure_strings_from_expressions(arguments)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in _ensure_strings_from_expressions
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in <listcomp>
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 782, in _ensure_strings_from_expressions
return _ensure_string_from_expression(expressions)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 775, in _ensure_string_from_expression
raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))
ValueError: 0 Braund, Mr. Owen Harris
1 Allen, Mr. William Henry
2 Bonnell, Miss. Elizabeth
Name: Name, dtype: object is not of string or Expression type, but <class 'pandas.core.series.Series'>
How can I accomplish this in vaex?
Turns out I had a bug. Needed dfv in the call to apply instead of df.
Also got this faster method from the nice people at vaex.
import pyarrow as pa
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
#vaex.register_function()
def replacer(x, y, z):
res = []
for i, j, k in zip(x.tolist(), y.tolist(), z.tolist()):
res.append(i.replace(j, k))
return pa.array(res)
dfv['Test'] = dfv.func.replacer(dfv['Name'], dfv['Rep'], dfv['Sub'])

I'm not sure why I am getting a KeyError: in the sample code below

The CSV file I am importing has a column named 'MED EXP DATE'. I reference this same column earlier in the code with no problem. When I reference it later, I get a key error.
import pandas as pd
df = pd.read_csv(r"C:\Users\Desktop\Air_Rally_Marketing\PILOT_BASIC.csv", low_memory=False, dtype={'UNIQUE ID': str, 'FIRST NAME': str,'LAST NAME': str, 'STREET 1': str, 'STREET 2': str, 'CITY': str, 'STATE': str, 'ZIP CODE': str, 'MED DATE': str, 'MED EXP DATE': str})
df.dropna(inplace = True)
df['EXP DATE LEN'] = df['MED EXP DATE'].apply(len) #creates a column to store the length of the column MED EXP DATE
print(df.head)
This is the error I receive:
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 135, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 109, in pandas._libs.index.Int64Engine._check_type
KeyError: 'MED EXP DATE'
When I search the meaning of this error, my understanding is that it means I am referencing a key that cannot be found. I'm confused by this because I reference "MED EXP DATE" in a prior line and do not get the key error there.
This below:
df = Right = df['MED EXP DATE'].str[-4:]
Is turning your df variable into a Series instead of a Pandas Dataframe. So by the time it gets to the apply statement, it as no idea what you are referring to.
SOLUTION: Use a double brackets to ensure df remains a pandas DataFrame
Setting your df to a series
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 1, 'b': 3}])
df = df['a']
type(df)
<class 'pandas.core.series.Series'>
Retaining your df as a DataFrame
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 1, 'b': 3}])
df = df[['a']]
type(df)
<class 'pandas.core.frame.DataFrame'>
I think the error is in this line:
'df = Right = df['MED EXP DATE'].str[-4:]'
why two assignment operators (=) here?

python IndexError: List Index out of range or TypeError: list indices must be integers or slices, not str

I am using the Fitbit Python Library to connect to the fitbit api: https://github.com/orcasgit/python-fitbit
I am not very familiar with fitbit, but I believe I am on the right path for what i am trying to do.
I have data that looks like this:
{u'activities': [],
u'goals':
{u'activeMinutes': 30, u'distance': 5, u'caloriesOut': 2364, u'steps': 10000},
u'summary':
{u'distances':
[{u'distance': 3.49, u'activity': u'total'},
{u'distance': 3.49, u'activity': u'tracker'},
{u'distance': 0, u'activity': u'loggedActivities'},
{u'distance': 1.27, u'activity': u'veryActive'},
{u'distance': 0.22, u'activity': u'moderatelyActive'},
{u'distance': 2, u'activity': u'lightlyActive'},
{u'distance': 0, u'activity': u'sedentaryActive'}],
u'sedentaryMinutes': 394,
u'lightlyActiveMinutes': 153,
u'caloriesOut': 1547,
u'caloriesBMR': 942,
u'marginalCalories': 414,
u'fairlyActiveMinutes': 8,
u'veryActiveMinutes': 29,
u'activityCalories': 750,
u'steps': 8277,
u'activeScore': -1}}'
Not its normally all on one line but i returned each row to make it easier to read.
I am trying to return only a couple of the rows into columns into a csv that would look like this:
Here is the code I have, most of it is pulled from this website with me modifying it to pull activity instead of sleep summary: https://towardsdatascience.com/collect-your-own-fitbit-data-with-python-ff145fa10873
import fitbit
import gather_keys_oauth2 as Oauth2
import pandas as pd
import datetime
import csv
CLIENT_ID = '22CZ94'
CLIENT_SECRET = '06a52bc5d8239790f630ffdd19377ba2'
server = Oauth2.OAuth2Server(CLIENT_ID, CLIENT_SECRET)
server.browser_authorize()
ACCESS_TOKEN = str(server.fitbit.client.session.token['access_token'])
REFRESH_TOKEN = str(server.fitbit.client.session.token['refresh_token'])
auth2_client = fitbit.Fitbit(CLIENT_ID, CLIENT_SECRET, access_token='eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiI2V0gyTlAiLCJhdWQiOiIyMkNaOTQiLCJpc3MiOiJGaXRiaXQiLCJ0eXAiOiJhY2Nlc3NfdG9rZW4iLCJzY29wZXMiOiJyc29jIHJzZXQgcmFjdCBybG9jIHJ3ZWkgcmhyIHJwcm8gcm51dCByc2xlIiwiZXhwIjoxNTY5Mjc5OTAxLCJpYXQiOjE1Mzc3NDM5MDF9.1StrKUUJwidejZ2pbCZzkIBG8FztQiLMvBql6fgEpaY', refresh_token=REFRESH_TOKEN)
fit_statsSum = auth2_client.activities(date='2018-09-25')['activities'][0]
actsummarypdf = pd.DataFrame({'SedentaryMinutes':fit_statsSum[u'sedentaryMinutes'],
'lightlyActiveMinutes':fit_statsSum['lightlyActiveMinutes'],
'fairlyActiveMinutes':fit_statsSum['fairlyActiveMinutes'],
'veryActiveMinutes':fit_statsSum['veryActiveMinutes'],
'steps':fit_statsSum['steps']
})
actsummarypdf.to_csv('c:\python-fitbit-master\Activities' + '2018-09-25' + '.csv')
With the code like that I get:
Traceback (most recent call last):
File ".\autho2_activity_summary.py", line 28, in <module>
fit_statsSum = auth2_client.activities(date='2018-09-25')['activities'][0]
IndexError: list index out of range
If I remove the [0], i get:
Traceback (most recent call last):
File ".\autho2_activity_summary.py", line 30, in <module>
actsummarypdf =
pd.DataFrame({'SedentaryMinutes':fit_statsSum['sedentaryMinutes'],
TypeError: list indices must be integers or slices, not str
ive also tried using u'sedentaryMinutes' and "u'sedentaryMinutes'" but no change.
Any help on what I am missing would be truly appreciated.
The IndexError means that you are trying to access an item in a list at a index that doesn't exist.
This will presumably return an object that looks like the data at the beginning of your question.
fit_statsSum = auth2_client.activities(date='2018-09-25')
And the value of the key activities is an empty list.
{u'activities': []
So when you try to access the item at index 0 (the first item), you will get an error. There is no first item in an empty list.
['activities'][0]
That's what the exception message means. But I can't tell you how to proceed, because you haven't told us anything about what you are trying to do.
If your fitbit activity data set is empty, you probably need to go for a run?

tensorflow error: composing crossed columns from categorical ones

I want to see explicitly the input tensor for the categorical crossed feature columns but get the error:
ValueError: Items of feature_columns must be a _DenseColumn. You can wrap a categorical column with an embedding_column or indicator_column. Given: _VocabularyListCategoricalColumn(key='education', vocabulary_list=('7th', '8th', '10th', '6th', '1th'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
Code:
import tensorflow as tf
import tensorflow.feature_column as fc
import numpy as np
tf.enable_eager_execution()
x = {'education': ['7th', '8th', '10th', '10th', '6th', '1th'],
'occupation': ['pro', 'sport', 'sci', 'model', 'pro', 'tech'],}
y = {'output': [1, 2, 3, 4, 5, 6]}
education = fc.categorical_column_with_vocabulary_list(
'education',
['7th', '8th', '10th', '6th', '1th'])
occupation = fc.categorical_column_with_vocabulary_list(
'occupation',
['pro', 'sport', 'sci', 'model', 'tech'])
education_x_occupation = tf.feature_column.crossed_column(
['education', 'occupation'], 30)
feat_cols = [
education,
occupation,
education_x_occupation]
fc.input_layer(x, feat_cols) # output
What is the correct implementation?
UPD:
I changed the last 5 strings to
feat_cols = [
fc.indicator_column(education),
fc.indicator_column(occupation),
fc.indicator_column(education_x_occupation)]
example_data = fc.input_layer(x, feat_cols) # output
print(example_data.numpy())
and I got the following errors and I doubt if they correspond to tf or to python. Should I firstly handle overflow?
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
OverflowError: Python int too large to convert to C long
During handling of the above exception, another exception occurred:
SystemError Traceback (most recent call last)
<ipython-input-125-db0c0525e02b> in <module>()
4 fc.indicator_column(education_x_occupation)]
5
----> 6 example_data = fc.input_layer(x, feat_cols) # output
7 print(example_data.numpy())
...
~\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_sparse_ops.py in sparse_cross(indices, values, shapes, dense_inputs, hashed_output, num_buckets, hash_key, out_type, internal_type, name)
1327 dense_inputs, "hashed_output", hashed_output, "num_buckets",
1328 num_buckets, "hash_key", hash_key, "out_type", out_type,
-> 1329 "internal_type", internal_type)
1330 _result = _SparseCrossOutput._make(_result)
1331 return _result
SystemError: <built-in function TFE_Py_FastPathExecute> returned a result with an error set
You need to pass your categorical columns through fc.indicator_column() before you can view them.
Try amending your last few line to this:
feat_cols = [
fc.indicator_column(education),
fc.indicator_column(occupation),
fc.indicator_column(education_x_occupation)]
example_data = fc.input_layer(x, feat_cols) # output
print(example_data.numpy())
Is that what you were hoping to see?

How do I use Django's ORM to return a group of rows only when 3 or more rows will be GROUPed?

Given the following Model:
class Enquiry(models.Model):
enquiryparent = models.ForeignKey('self',default=None, null=True, blank=True)
type = models.SmallIntegerField()
name = models.CharField(max_length=200)
mobile = models.CharField(max_length=40,blank=True,null=True)
email = models.EmailField(blank=True,null=True)
message = models.TextField(blank=True,null=True)
registered = models.DateTimeField(auto_now_add=True)
How can write the following query in Django:
Count for a particular type say 'x' where mobile is unique across three days (using registered date) for the complete set...
example:
id, type, mobile, registered
1, 2, 988, 01/11/2011
1, 2, 988, 02/11/2011
1, 2, 988, 03/11/2011
1, 4, 988, 04/11/2011
1, 2, 988, 05/11/2011
1, 2, 988, 06/11/2011
1, 2, 988, 07/11/2011
1, 2, 555, 07/11/2011
The result should be:
id, type, mobile, registered
1, 2, 988, 03/11/2011
1, 2, 988, 05/11/2011
1, 2, 555, 07/11/2011
Count total = 3.
If I'm understanding you correctly it seems like this should be doable, but I'm unfortunately unfamiliar with Django's ORM. I think the basic idea would probably look like:
Select by id (id == 1 in your example), a static integer column that always equals 1, and a dynamic column like sum(static1)
Filter by type (like type == 2 in your example)
Sort by registered
Group by mobile
Filter where sum(static1) is 3 or greater
I know that this is possible in raw SQL, but I don't know how to do it in Django's ORM. Your other option is to simply write a for loop that accumulates the results.

Categories

Resources