Getting a KeyError when trying to run De-dupe

Getting a KeyError when trying to run De-dupe - python

Hi I'm new to Python and I've no clue how to fix the following error:
I've a data frame with around 2 million records & 20 columns of stores data, I am grouping the stores by State and trying to run dedupe_dataframe on each state after training it on one.
Here is how my code looks (np is numpy, dp is pandas pandas_dedupe):
#Read Store Data
stores = pd.read_csv("storefile.csv",sep = ",", encoding= 'latin1',dtype=str)
#There was /t in the first column so removing that
stores= stores.replace('\t','', regex=True)
stores= stores.replace(np.nan, '', regex=True)
#Getting a lowercase state list
states=list(stores.State.str.lower().unique())
#Grouping Data by States
state_data= {state: stores[stores.State.str.lower()==state] for state in states}
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
I'm getting the following error:
importing data ...
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last) <ipython-input-37-e2ed10256338> in <module>
----> 1 dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
dedupe_dataframe(df, field_properties, canonicalize, config_name,
recall_weight, sample_size)
211 # train or load the model
212 deduper = _train(settings_file, training_file, data_d, field_properties,
--> 213 sample_size)
214
215 # ## Set threshold
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
_train(settings_file, training_file, data, field_properties, sample_size)
58 # To train dedupe, we feed it a sample of records.
59 sample_num = math.floor(len(data) * sample_size)
---> 60 deduper.sample(data, sample_num)
61
62 # If we have training data saved from a previous run of dedupe,
~\anaconda3\lib\site-packages\dedupe\api.py in sample(self, data,
sample_size, blocked_proportion, original_length)
836 sample_size,
837 original_length,
--> 838 index_include=examples)
839
840 self.active_learner.mark(examples, y)
~\anaconda3\lib\site-packages\dedupe\labeler.py in __init__(self,
data_model, data, blocked_proportion, sample_size, original_length,
index_include)
401 data = core.index(data)
402
--> 403 self.candidates = super().sample(data, blocked_proportion, sample_size)
404
405 random_pair = random.choice(self.candidates)
~\anaconda3\lib\site-packages\dedupe\labeler.py in sample(self, data,
blocked_proportion, sample_size)
50 return [(data[k1], data[k2])
51 for k1, k2
---> 52 in blocked_sample_keys | random_sample_keys]
53
54
~\anaconda3\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
49
50 return [(data[k1], data[k2])
---> 51 for k1, k2
52 in blocked_sample_keys | random_sample_keys]
53
KeyError: 2147550487

Solution
Swap the following line:
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
WITH
#Running De-Dupe for state Ohio ('oh')
state_data['oh'].dedupe_dataframe(subset = ['StoreBannerName','Address','City','State'], keep='first')
Reference
pandas.DataFrame.drop_duplicates()

I got a KeyError as well.
I used this code example:
https://github.com/dedupeio/dedupe-examples/tree/master/record_linkage_example
I had changed some of the code and I had forgotten to remove data_matching_learned_settings file before running it again.
Another reason you might get this error, especially with the first column, is that your file could contain a BOM (Byte Order Mark) in the first character.
See this for how to remove a BOM:
VS Code: How to save UTF-8 without BOM in Mac?

Related

dataframe error when comparing expression levels: TypeError: Unordered Categoricals can only compare equality or not

I am working with an anndata object gleaned from analyzing single-cell RNAseq data using scanpy to obtain clusters. This is far along in the process (near completed) and I am now trying to obtain a list of the average expression of certain marker genes in the leiden clusters from my data. I am getting an error at the following point.
# Backbone imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
# Single Cell imports
import anndata
import scanpy as sc
markers = ["MS4A1", "CD72", "CD37", "CD79A", "CD79B","CD19"]
grouping_column = "leiden"
df = sc.get.obs_df(hy_bc, markers + [grouping_column])
mean_expression = df.loc[:, ~df.columns.isin([grouping_column])].mean(axis=0)
mean_expression:
MS4A1 1.594015
CD72 0.421510
CD37 1.858241
CD79A 1.801162
CD79B 1.180483
CD19 0.430246
dtype: float32
df, mean_expression = df.align(mean_expression, axis=1, copy=False)
Error happens here
g = (df > mean_expression).groupby(grouping_column)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [88], in <cell line: 1>()
----> 1 g = (df > mean_expression).groupby(grouping_column)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
66 return NotImplemented
68 other = item_from_zerodim(other)
---> 70 return method(self, other)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\arraylike.py:56, in OpsMixin.__gt__(self, other)
54 #unpack_zerodim_and_defer("__gt__")
55 def __gt__(self, other):
---> 56 return self._cmp_method(other, operator.gt)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\frame.py:6934, in DataFrame._cmp_method(self, other, op)
6931 self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)
6933 # See GH#4537 for discussion of scalar op behavior
-> 6934 new_data = self._dispatch_frame_op(other, op, axis=axis)
6935 return self._construct_result(new_data)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\frame.py:6985, in DataFrame._dispatch_frame_op(self, right, func, axis)
6979 # TODO: The previous assertion `assert right._indexed_same(self)`
6980 # fails in cases with empty columns reached via
6981 # _frame_arith_method_with_reindex
6982
6983 # TODO operate_blockwise expects a manager of the same type
6984 with np.errstate(all="ignore"):
-> 6985 bm = self._mgr.operate_blockwise(
6986 # error: Argument 1 to "operate_blockwise" of "ArrayManager" has
6987 # incompatible type "Union[ArrayManager, BlockManager]"; expected
6988 # "ArrayManager"
6989 # error: Argument 1 to "operate_blockwise" of "BlockManager" has
6990 # incompatible type "Union[ArrayManager, BlockManager]"; expected
6991 # "BlockManager"
6992 right._mgr, # type: ignore[arg-type]
6993 array_op,
6994 )
6995 return self._constructor(bm)
6997 elif isinstance(right, Series) and axis == 1:
6998 # axis=1 means we want to operate row-by-row
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\internals\managers.py:1409, in BlockManager.operate_blockwise(self, other, array_op)
1405 def operate_blockwise(self, other: BlockManager, array_op) -> BlockManager:
1406 """
1407 Apply array_op blockwise with another (aligned) BlockManager.
1408 """
-> 1409 return operate_blockwise(self, other, array_op)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\internals\ops.py:63, in operate_blockwise(left, right, array_op)
61 res_blks: list[Block] = []
62 for lvals, rvals, locs, left_ea, right_ea, rblk in _iter_block_pairs(left, right):
---> 63 res_values = array_op(lvals, rvals)
64 if left_ea and not right_ea and hasattr(res_values, "reshape"):
65 res_values = res_values.reshape(1, -1)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\ops\array_ops.py:269, in comparison_op(left, right, op)
260 raise ValueError(
261 "Lengths must match to compare", lvalues.shape, rvalues.shape
262 )
264 if should_extension_dispatch(lvalues, rvalues) or (
265 (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
266 and not is_object_dtype(lvalues.dtype)
267 ):
268 # Call the method on lvalues
--> 269 res_values = op(lvalues, rvalues)
271 elif is_scalar(rvalues) and isna(rvalues): # TODO: but not pd.NA?
272 # numpy does not like comparisons vs None
273 if op is operator.ne:
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
66 return NotImplemented
68 other = item_from_zerodim(other)
---> 70 return method(self, other)
File C:\ProgramData\Anaconda3\envs\JHH216-hT246\lib\site-packages\pandas\core\arrays\categorical.py:141, in _cat_compare_op.<locals>.func(self, other)
139 if not self.ordered:
140 if opname in ["__lt__", "__gt__", "__le__", "__ge__"]:
--> 141 raise TypeError(
142 "Unordered Categoricals can only compare equality or not"
143 )
144 if isinstance(other, Categorical):
145 # Two Categoricals can only be compared if the categories are
146 # the same (maybe up to ordering, depending on ordered)
148 msg = "Categoricals can only be compared if 'categories' are the same."
TypeError: Unordered Categoricals can only compare equality or not
Code I have, but have not run yet because of the error:
frac = lambda z: sum(z) / z.shape[0]
frac.__name__ = "pos_frac"
g.aggregate([sum, frac])

It seems that your grouping column is a categorical column and not float or int. try adding this line after the instantiation of the dataframe.
df = sc.get.obs_df(hy_bc, markers + [grouping_column])
df[grouping_column] = df[grouping_column].astype('int64')
another issue I noticed. the expression df > mean_expression will produce all false values in leiden because leiden has the value NaN in the mean expression. therefore when you use groupby, you will only have one group which is the value False. One group defeats the purpose of groupby. Not sure what are you trying to do but wanted to point that out.

TypeError: Invalid argument, not a string or column: ['article_name'] of type <class 'list'>

I think I have a question that is not that difficult but I cannot find an answer to this online. :(
So I hope anyone can help me here.
I am writing a function that calculates the amount of customers that buys a product inside a category. For example how many customers buys Cola inside Drinks?
It should be possible to give more columns as input value for the article value, so for example article number and article name. But when I only give one column as input value, it returns the next error:
TypeError Traceback (most recent call last)
<command-107291202997552> in <module>
----> 1 raf=penetration_test(df_source,"customer_hash", ["article_name"])
2 display(raf)
<command-1757898160611556> in penetration_test(input_dataframe, customer_column, product_columns, penetration_category, attach_penetration_columns)
64 else:
65 penetration_dataframe = input_dataframe.select(
---> 66 array(col_list)
67 ).distinct()
68 return penetration_dataframe
/databricks/spark/python/pyspark/sql/functions.py in array(*cols)
2998 if len(cols) == 1 and isinstance(cols[0], (list, set)):
2999 cols = cols[0]
-> 3000 jc = sc._jvm.functions.array(_to_seq(sc, cols, _to_java_column))
3001 return Column(jc)
3002
/databricks/spark/python/pyspark/sql/column.py in _to_seq(sc, cols, converter)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in <listcomp>(.0)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
44 jcol = _create_column_from_name(col)
45 else:
---> 46 raise TypeError(
47 "Invalid argument, not a string or column: "
48 "{0} of type {1}. "
TypeError: Invalid argument, not a string or column: ['article_name'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
For this I wrote the following function (I will just give the lines that are interestant for my question):
def penetration_test(input_dataframe, customer_column="customer_hash",
product_columns=["hope_number", "article_number"],
penetration_category="category_number",attach_penetration_columns=False,
):
col_list = []
col_list.append("category_number")
if len(product_columns) >1:
for item in product_columns:
col_list.append(item)
else:
col_list.append(product_columns)
#display penetration
penetration_dataframe = input_dataframe.select(
col_list
).distinct()
return penetration_dataframe
This is the code that I execute that returns an error.
df_res = penetration_test(df_source,"customer_hash", ["article_name"], "category_name", False)
display(df_res)
Can someone help me to solve the problem?

AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates

I have a CSV file I'm trying to RCF on. If I put a date or string in the CSV then I get an error like the one below. If I limit it to just the integer and float fields the script runs fine. Is there some way to process dates and string? I see the taxi example from AWS and it has dates which appear the same as mine
eventData = pd.read_csv(data_location, delimiter=",", header=None, parse_dates=True)
print('Starting RCF Training')
# specify general training job information
rcf = RandomCutForest(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m4.xlarge',
data_location=data_location,
output_path='s3://{}/{}/output'.format(bucket, prefix),
base_job_name="ad-rcf",
num_samples_per_tree=512,
num_trees=50)
rcf.fit(rcf.record_set(eventData.values))
CSV Data that fails
392507,1613744,1/2/2020 19:11,1577238693,2469,3.30E+01,-9.67E+01
691381,1888551,12/10/2019 9:22,1575641745,3460,2.37E+01,9.04E+01
392507,1613744,1/2/2020 19:20,1577236815,1797,3.30E+01,-9.67E+01
392507,1613744,1/29/2020 19:04,1577264188,1797,3.30E+01,-9.67E+01
Error output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-ba19bf5d66a2> in <module>
---> 21 rcf.fit(rcf.record_set(eventData.values))
22
23 print('Done RCF Training')
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in record_set(self, train, labels, channel, encrypt)
281 logger.debug("Uploading to bucket %s and key_prefix %s", bucket, key_prefix)
282 manifest_s3_file = upload_numpy_to_s3_shards(
--> 283 self.instance_count, s3, bucket, key_prefix, train, labels, encrypt
284 )
285 logger.debug("Created manifest file %s", manifest_s3_file)
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
443 s3.Object(bucket, key_prefix + file).delete()
444 finally:
--> 445 raise ex
446
447
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
424 write_numpy_to_dense_tensor(file, shard, label_shards[shard_index])
425 else:
--> 426 write_numpy_to_dense_tensor(file, shard)
427 file.seek(0)
428 shard_index_string = str(shard_index).zfill(len(str(len(shards))))
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
154 )
155 resolved_label_type = _resolve_type(labels.dtype)
--> 156 resolved_type = _resolve_type(array.dtype)
157
158 # Write each vector in array into a Record in the file object
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
288 if dtype == np.dtype("float32"):
289 return "Float32"
--> 290 raise ValueError("Unsupported dtype {} on array".format(dtype))
291
292
ValueError: Unsupported dtype object on array

Figured out my issue, the RCF can't handle dates and strings. There's this page for the Kenesis offering from AWS that covers the same Random Cut Forest algorithm https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html It says the function only supports "The algorithm accepts the DOUBLE, INTEGER, FLOAT, TINYINT, SMALLINT, REAL, and BIGINT data types."
The gotcha part that AWS does with the NYC Taxi example is they use .value which is referring to only the value column of the data. They are basically dropping the dates from the RCF as a feature. It doesn't help that .values on the array does work and looks very similar to .value

Statsmodel sandwich "all groups are empty taking lags" error when using HAC-Panel clustered standard errors

I have data that looks like this:
account
DV
month
year
yearmonth
pre/post
group
1
121
oct
2019
oct 2019
pre
control
1
124
Nov
2019
nov 2019
post
control
2
120
oct
2019
oct 2019
pre
treatment
2
118
nov
2019
nov 2019
post
treatment
I run a difference-in-difference regression:
results2 = smf.ols("DV ~ C(group, Treatment('control')) * C(pre_period, Treatment(True)) + month + C(year)",
df99).fit(cov_type='HAC-Panel', cov_kwds={'groups':df99['account'], 'time':df99['yearmonth'], 'maxlags':35})
print(results2.summary())
And I get the error message below.
I do the same thing with a different dataset that is more or less the same (different experiment) but do not encounter the problem. My data cleaning process is essentially identical. Furthermore, just a few days ago, this was working fine. But it has now suddenly thrown up this error (I did reverse any changes I had made).
I can't make sense of this error at all. Even searching for this error message in "sandwich_covariance.py" doesn't reveal anything.
This person has had a similar problem but no solution was proposed: Python statsmodels robust cov_type='hac-panel' issue
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-0dcbaef1325b> in <module>
----> 1 results2 = smf.ols("dv ~ C(group, Treatment('control')) * C(pre_period, Treatment(True)) + month + C(year)",
2 df99).fit(cov_type='HAC-Panel', cov_kwds={'groups':df99['account'], 'time':df99['yearmonth'], 'maxlags':35})
3 print(results2.summary())
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/regression/linear_model.py in fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
340
341 if isinstance(self, OLS):
--> 342 lfit = OLSResults(
343 self, beta,
344 normalized_cov_params=self.normalized_cov_params,
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/regression/linear_model.py in __init__(self, model, params, normalized_cov_params, scale, cov_type, cov_kwds, use_t, **kwargs)
1584 use_t = use_t_2
1585 # TODO: warn or not?
-> 1586 self.get_robustcov_results(cov_type=cov_type, use_self=True,
1587 use_t=use_t, **cov_kwds)
1588 for key in kwargs:
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/regression/linear_model.py in get_robustcov_results(self, cov_type, use_t, **kwargs)
2530 groupidx = lzip([0] + tt, tt + [nobs_])
2531 self.n_groups = n_groups = len(groupidx)
-> 2532 res.cov_params_default = sw.cov_nw_panel(self, maxlags, groupidx,
2533 weights_func=weights_func,
2534 use_correction=use_correction)
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/stats/sandwich_covariance.py in cov_nw_panel(results, nlags, groupidx, weights_func, use_correction)
785 xu, hessian_inv = _get_sandwich_arrays(results)
786
--> 787 S_hac = S_nw_panel(xu, weights, groupidx)
788 cov_hac = _HCCM2(hessian_inv, S_hac)
789 if use_correction:
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/stats/sandwich_covariance.py in S_nw_panel(xw, weights, groupidx)
723 S = weights[0] * np.dot(xw.T, xw) #weights just for completeness
724 for lag in range(1, nlags+1):
--> 725 xw0, xwlag = lagged_groups(xw, lag, groupidx)
726 s = np.dot(xw0.T, xwlag)
727 S += weights[lag] * (s + s.T)
~/opt/anaconda3/envs/pyr/lib/python3.8/site-packages/statsmodels/stats/sandwich_covariance.py in lagged_groups(x, lag, groupidx)
706
707 if out0 == []:
--> 708 raise ValueError('all groups are empty taking lags')
709 #return out0, out_lagged
710 return np.vstack(out0), np.vstack(out_lagged)
ValueError: all groups are empty taking lags

As our statsmodel saviour Josef has very rightly pointed out. The problem was simply that data was not sorted in the way that I had shown in my example.
Data should be sorted by the group you are clustering on (in my case account), then by time. This is as per the documentation:
‘hac-panel’ heteroscedasticity and autocorrelation robust standard errors in panel data. The data needs to be sorted in this case, the time series for each panel unit or cluster need to be stacked. The membership to a timeseries of an individual or group can be either specified by group indicators or by increasing time periods.
If for whatever reason you are not able to sort the data in that way, as Josef points out, you can use HAC-groupsum instead which works even if the data is not sorted. The results are of course slightly different.

Python Panda datareader - skip erroneous or not available ticker values

The following python code works for a few stocks, but throws errors randomly for other few stocks. Please can someone help what is it with the
import pandas_datareader as web
tickers=["AMA.AX","AMC.AX"]
mkt_data = web.get_quote_yahoo(tickers)['marketCap']
for stock, mkt in zip(tickers, mkt_data):
print(stock, mkt)
The above produces the expected output.
However, the moment I add another ticker(s) to the list as follows:
tickers=["AMA.AX","AMC.AX","ANA.AX","AAP.AX"]
It throws the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-18-255a3bb5b917> in <module>
4
5 # Get market cap
----> 6 market_cap_data = web.get_quote_yahoo(tickers)['marketCap']
7
8 # print stock, price and mktcap and prc
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\data.py in get_quote_yahoo(*args, **kwargs)
100
101 def get_quote_yahoo(*args, **kwargs):
--> 102 return YahooQuotesReader(*args, **kwargs).read()
103
104
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\yahoo\quotes.py in read(self)
28 data = OrderedDict()
29 for symbol in self.symbols:
---> 30 data[symbol] = self._read_one_data(self.url, self.params(symbol)).loc[
31 symbol
32 ]
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
110 else:
111 raise NotImplementedError(self._format)
--> 112 return self._read_lines(out)
113
114 def _read_url_as_StringIO(self, url, params=None):
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\yahoo\quotes.py in _read_lines(self, out)
41
42 def _read_lines(self, out):
---> 43 data = json.loads(out.read())["quoteResponse"]["result"][0]
44 idx = data.pop("symbol")
45 data["price"] = data["regularMarketPrice"]
IndexError: list index out of range

I am aware that this error is occurring because the third (added) ticker "ANA.AX" is delisted and has no data available.
Right now, the program just stops at the first instance of missing/unavailable data error.
What I want to achieve is to skip this ticker (ANA.AX), and continue printing the data with the rest of the tickers after it ("AAP.AX" - fourth one).
Any idea how to achieve that?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a KeyError when trying to run De-dupe - python

Related

dataframe error when comparing expression levels: TypeError: Unordered Categoricals can only compare equality or not

TypeError: Invalid argument, not a string or column: ['article_name'] of type <class 'list'>

AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates

Statsmodel sandwich "all groups are empty taking lags" error when using HAC-Panel clustered standard errors

Python Panda datareader - skip erroneous or not available ticker values

Categories

Resources