Python/pandas KeyError issue with user-defined functions

Python/pandas KeyError issue with user-defined functions - python

Why do these two functions work properly and run with no errors?
def avgSalespWpY(df, weekStart_col_name, sku_col_name, order_col_name):
df['SKU_WEEK'] = pd.DatetimeIndex(df[weekStart_col_name]).week
grouped_data = df.groupby(['SKU_WEEK']).mean().reset_index()
return grouped_data[['SKU_WEEK', 'R_ORDER_QT']]
def avg_sales(df, prediction_window):
earliest_date = prediction_window['WEEK_START_DT'].min()
filter_date = df[df['WEEK_START_DT'] < earliest_date]
avg_sales_df = avgSalespWpY(filter_date, 'WEEK_START_DT', 'SKU', 'R_ORDER_QT')
prediction_window['SKU_WEEK'] = pd.DatetimeIndex(prediction_window['WEEK_START_DT']).week
avg_sales_df.columns = ['SKU_WEEK', 'OA_SPW']
avg_sales_total = avg_sales_df['OA_SPW'].mean()
prediction_window = pd.merge(prediction_window, avg_sales_df, how = 'left', on = 'SKU_WEEK')
prediction_window = prediction_window.loc[:, prediction_window.columns != 'SKU_WEEK']
prediction_window['OA_SPW'] = prediction_window['OA_SPW'].fillna(avg_sales_total)
return prediction_window
But these two do not work and run an error?
def avgSalespMpY(df, weekStart_col_name, order_col_name):
df['SKU_MONTH'] = pd.DatetimeIndex(df[weekStart_col_name]).month
grouped_data = df.groupby(['SKU_MONTH'])[order_col_name].mean()
return grouped_data
def avg_salesMY(df, prediction_window):
earliest_date = prediction_window['WEEK_START_DT'].min()
filter_date = df[df['WEEK_START_DT'] < earliest_date]
avg_sales_df = avgSalespMpY(filter_date, 'WEEK_START_DT', 'R_ORDER_QT')
prediction_window['SKU_MONTH'] = pd.DatetimeIndex(prediction_window['WEEK_START_DT']).month
avg_sales_df.columns = ['SKU_MONTH', 'OA_SPW']
avg_sales_total = avg_sales_df['OA_SPW'].mean()
prediction_window = pd.merge(prediction_window, avg_sales_df, how = 'left', on = 'SKU_MONTH')
prediction_window = prediction_window.loc[:, prediction_window.columns != 'SKU_MONTH']
prediction_window['OA_SPW'] = prediction_window['OA_SPW'].fillna(avg_sales_total)
print(prediction_window['OA_SPW'])
return prediction_window
This is the error I am getting when I run the second function (avg_salesMY):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: 'OA_SPW'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'OA_SPW'
The output for the first function (avgSalespMpY) is what we expect and is as follows:
SKU_MONTH
1 535.687921
2 577.925649
3 611.837803
4 678.377140
5 496.411170
6 601.806244
7 688.770197
8 574.510967
9 636.203457
10 876.896305
11 719.614757
12 553.642329
Name: R_ORDER_QT, dtype: float64
I am very confused what is going on because the two segments are almost identical. There is an issue happening accessing the OA_SPW column but the one code is above the other, so why does it all of a sudden not want to run properly?
The first segment is supposed to take in data and find the overall average sales per week of year in the first segment and the second is supposed to take in prediction data and add in the output of the previous data. The second segment is supposed to take in data and find the overall average sales per month in the first segment and the second is supposed to take in prediction data and add in the output of the previous data. The biggest difference is that one's results is week by week and the other is based on month.

Related

Python Tried everything. Still getting this error: TypeError(f"Invalid comparison between dtype={left.dtype} and {typ}")

I hope you can help me with this issue, I've been having for a while. I keep getting this error no matter what i try:
This is the type as I know it:
tweets['post_date'] = pd.to_datetime(tweets['post_date'], unit='s')
tweets['date'] = pd.to_datetime(tweets['post_date'].apply(lambda date: date.date()))
tweets.head()
Output:
post_date body ticker_symbol date
19 2015-01-01 00:11:17 $UNP $ORCL $QCOM $MSFT $AAPL Top scoring mega ... MSFT 2015-01-01
43 2015-01-01 00:55:58 http://StockAviator.com....Top penny stocks, N... MSFT 2015-01-01
TypeError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/arrays/datetimelike.py in _validate_comparison_value(self, other)
539 try:
--> 540 self._check_compatible_with(other)
541 except (TypeError, IncompatibleFrequency) as err:
13 frames
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects.
The above exception was the direct cause of the following exception:
InvalidComparison Traceback (most recent call last)
InvalidComparison: 2015-01-01 12:00:00-05:00
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/ops/invalid.py in invalid_comparison(left, right, op)
32 else:
33 typ = type(right).__name__
---> 34 raise TypeError(f"Invalid comparison between dtype={left.dtype} and {typ}")
35 return res_values
36
TypeError: Invalid comparison between dtype=datetime64[ns] and Timestamp
This error comes from this column in my code:
#market opens 14:30 closes 21:00
def getAvgPerPrice (tweets,stockk):
stock = stockk.copy()
result = pd.DataFrame([])
for i in range(0,len(stock)-1):
d = stock.index[i]
next_d = stock.index[i+1]
wanted_tweets = tweets[((tweets.post_date - timedelta(hours = 3)) >=( d + timedelta(hours = h))) & ((tweets.post_date - timedelta(hours = 3)) < (next_d + timedelta(hours = h)))]
result.at[i,'date'] = d
result.at[i,'close'] = stock.iloc[i].Close
result.at[i,'avgScore'] = wanted_tweets['score'].mean()
I would really appreciate if anyone could help me find the issue. I have tried many things already but no luck. Thank you in advance

Apply a binomial function to every row in a pandas Dataframe

I am using the following code on that dataframe:
Here picture of the dataframe used
not_rescued = df["not_rescued"]
rescued = df["rescued"]
breed = df["breed"]
def get_attribute(breed, attribute):
if breed in df.breed.unique():
if attribute in df.columns:
return df[df["breed"] == breed][attribute]
else:
raise NameError('Attribute {} does not exist.'.format(attribute))
else:
raise NameError('Breed {} does not exist.'.format(breed))
def get_not_rescued(breed):
return get_attribute(breed, 'not_rescued')
def get_rescued(breed):
return get_attribute(breed, 'rescued')
def get_total(breed):
return get_attribute(breed, 'total')
from scipy.stats import binom_test
binomial_test = binom_test(get_rescued(breed), get_total(breed), 0.08)
print(binomial_test)
I would like to apply the binomial function without having to call the function each time for each dog.. How could I use a list without having the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [18], in <cell line: 2>()
1 from scipy.stats import binom_test
----> 2 binomial_test = binom_test(get_rescued(breed), get_total(breed), 0.08)
3 print(binomial_test)
File ~/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:2667, in binom_test(x, n, p, alternative)
2665 n = np.int_(n)
2666 else:
-> 2667 raise ValueError("Incorrect length for x.")
2669 if (p > 1.0) or (p < 0.0):
2670 raise ValueError("p must be in range [0,1]")
ValueError: Incorrect length for x.

ValueError: "Columns must be same length as key" > Python

I'm using Colab to run the code but I get this error and I can't fix it. Could you help me out?
I don't know what to do in order to fix it because I have tried to change upper case or lower case.
#Inativos: ajustar nomes das colunas
dfInativos = dfInativos.rename(columns={'userId': 'id'})
dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
#dfInativos['id'] = dfInativos['id'].astype(int, errors = 'ignore')
#dfInativos['ClasseId'] = dfInativos['ClasseId'].astype(int, errors = 'ignore')
dfInativos['id'] = pd.to_numeric(dfInativos['id'],errors = 'coerce')
dfInativos['ClasseId'] = pd.to_numeric(dfInativos['ClasseId'],errors = 'coerce')
#dfInativos.dropna(subset = ['lastActivityDate'], inplace=True)
dfInativos.drop_duplicates(subset = ['id','ClasseId'], inplace=True)
dfInativos['seven DayInactiveStatus'] = dfInativos['sevenDayInactiveStatus'].replace(0,'')
#Add Inactive data to main data frame
df = df.merge(dfInativos, on=['id','ClasseId'], how='left')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-79-10fe94c48d1f> in <module>()
2 dfInativos = dfInativos.rename(columns={'userId': 'id'})
3 dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
----> 4 dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
5
6
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexers.py in check_key_length(columns, key, value)
426 if columns.is_unique:
427 if len(value.columns) != len(key):
--> 428 raise ValueError("Columns must be same length as key")
429 else:
430 # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key

More efficient way to deal with KeyError from dictionary

I want to find the country names for a data frame columns with top level domains such as 'de', 'it', 'us'.. by using the iso3166 package.
There are domains in the dataset that does not exist in iso3166, therefore, Value Error got raised.
I tried to solve the value error by letting the code return Boolean values but it runs for a really long time. Will be great to know how to speed it up.
Sample data: df['country']
0 an
1 de
2 it
My code (Note the code does not raise KeyError error. My question is how to make it faster)
df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0] if \
df['country'].str.find(x).any() == countries.get(x)[1].lower() else 'unknown')
df['country] is the data frame column. countries.get() is for getting country names from iso3166
df['country'].str.find(x).any() finds top level domains and countries.get(x)[1].lower()returns top level domains. If they are the same then I use countries.get(x)[0] to return the country name
Expected output
country country_name
an unknown
de Germany
it Italy
Error if I run df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0]) (I renamed the dataframe so it's different from the error message)
KeyError Traceback (most recent call last)
<ipython-input-110-d51176ce2978> in <module>
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-110-d51176ce2978> in <lambda>(x)
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/iso3166/__init__.py in get(self, key, default)
358
359 if r == NOT_FOUND:
--> 360 raise KeyError(key)
361
362 return r
KeyError: 'an'```

A little error handling and defining your logic outside of the apply() method should get you where you want to go. Something like:
def get_country_name(x):
try:
return countries.get(x)[0]
except:
return 'unknown'
df['country_name'] = df['country'].apply(lambda x: get_country_name(x))

This James Tollefson's answer, narrowed down to its core, didn't want to change his answer too much, here's the implementation:
def get_country_name(x: str) -> str:
return countries[x][0] if countries.get(x) else 'unknown'
df['country_name'] = df['country'].apply(get_country_name)

Why I have this problem with index range?why does it not work?

I have got this error when try split my one column to few columns. But it split on just on one or two columns.If you wanna split on 3,4,5 columns it writes:
ValueError Traceback (most recent call last)
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
349 try:
--> 350 return self._range.index(new_key)
351 except ValueError:
ValueError: 2 is not in range
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-19-d4e6a4d03e69> in <module>
22 data_old[Col_1_Label] = newz[0]
23 data_old[Col_2_Label] = newz[1]
---> 24 data_old[Col_3_Label] = newz[2]
25 #data_old[Col_4_Label] = newz[3]
26 #data_old[Col_5_Label] = newz[4]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
350 return self._range.index(new_key)
351 except ValueError:
--> 352 raise KeyError(key)
353 return super().get_loc(key, method=method, tolerance=tolerance)
354
KeyError: 2
There is my code.I have csv file.And when pandas read it - create one column with value 'Контракт'.Then. I split it on another columns. But it split till two columns.I wanna 7 columns!Help please to understand this logic!
import pandas as pd
from pandas import Series, DataFrame
import re
dframe1 = pd.read_csv('po.csv')
columns = ['Контракт']
data_old = pd.read_csv('po.csv', header=None, names=columns)
data_old
# The thing you want to split the column on
SplitOn = ':'
# Name of Column you want to split
Split_Col = 'Контракт'
newz = data_old[Split_Col].str.split(pat=SplitOn, n=-1, expand=True)
# Column Labels (you can add more if you will have more)
Col_1_Label = 'Номер телефону'
Col_2_Label = 'Тарифний пакет'
Col_3_Label = 'Вихідні дзвінки з України за кордон'
Col_4_Label = 'ВАРТІСТЬ ПАКЕТА/ЩОМІСЯЧНА ПЛАТА'
Col_5_Label = 'ЗАМОВЛЕНІ ДОДАТКОВІ ПОСЛУГИ ЗА МЕЖАМИ ПАКЕТА'
Col_6_Label = 'Вартість послуги "Корпоративна мережа'
Col_7_Label = 'ЗАГАЛОМ ЗА КОНТРАКТОМ (БЕЗ ПДВ ТА ПФ)'
data_old[Col_1_Label] = newz[0]
data_old[Col_2_Label] = newz[1]
data_old[Col_3_Label] = newz[2]
#data_old[Col_4_Label] = newz[3]
#data_old[Col_5_Label] = newz[4]
#data_old[Col_6_Label] = newz[5]
#data_old[Col_7_Label] = newz[6]
data_old

Pandas does not support "unstructured text", you should convert it to a standard format or python objects and then create a dataframe from it
Imagine that you have a file with this text named data.txt:
Contract № 12345679 Number of phone: +7984563774
Total price for month : 00.00000
Total price: 10.0000
You can load an process it with Python like this:
with open('data.txt') as f:
content = list(data.readlines())
# First line contains the contract number and phone information
contract, phone = content[0].split(':')
# find contract number using regex
contract = re.findall('\d+', contract)[0]
# The phone is strightforward
phone = phone.strip()
# Second line and third line for prices
total_price = float(content[1].split(':')[1].strip())
total_month_price = float(content[2].split(':')[1].strip())
Then with these variables you can create a dataframe
df = pd.DataFrame([dict(N_of_contract=contract, total_price=total_price, total_month_price =total_month_price )])
Repeat the same for all files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/pandas KeyError issue with user-defined functions - python

Related

Python Tried everything. Still getting this error: TypeError(f"Invalid comparison between dtype={left.dtype} and {typ}")

Apply a binomial function to every row in a pandas Dataframe

ValueError: "Columns must be same length as key" > Python

More efficient way to deal with KeyError from dictionary

Why I have this problem with index range?why does it not work?

Categories

Resources