Python Panda datareader - skip erroneous or not available ticker values - python

The following python code works for a few stocks, but throws errors randomly for other few stocks. Please can someone help what is it with the
import pandas_datareader as web
tickers=["AMA.AX","AMC.AX"]
mkt_data = web.get_quote_yahoo(tickers)['marketCap']
for stock, mkt in zip(tickers, mkt_data):
print(stock, mkt)
The above produces the expected output.
However, the moment I add another ticker(s) to the list as follows:
tickers=["AMA.AX","AMC.AX","ANA.AX","AAP.AX"]
It throws the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-18-255a3bb5b917> in <module>
4
5 # Get market cap
----> 6 market_cap_data = web.get_quote_yahoo(tickers)['marketCap']
7
8 # print stock, price and mktcap and prc
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\data.py in get_quote_yahoo(*args, **kwargs)
100
101 def get_quote_yahoo(*args, **kwargs):
--> 102 return YahooQuotesReader(*args, **kwargs).read()
103
104
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\yahoo\quotes.py in read(self)
28 data = OrderedDict()
29 for symbol in self.symbols:
---> 30 data[symbol] = self._read_one_data(self.url, self.params(symbol)).loc[
31 symbol
32 ]
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
110 else:
111 raise NotImplementedError(self._format)
--> 112 return self._read_lines(out)
113
114 def _read_url_as_StringIO(self, url, params=None):
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\yahoo\quotes.py in _read_lines(self, out)
41
42 def _read_lines(self, out):
---> 43 data = json.loads(out.read())["quoteResponse"]["result"][0]
44 idx = data.pop("symbol")
45 data["price"] = data["regularMarketPrice"]
IndexError: list index out of range
​
I am aware that this error is occurring because the third (added) ticker "ANA.AX" is delisted and has no data available.
Right now, the program just stops at the first instance of missing/unavailable data error.
What I want to achieve is to skip this ticker (ANA.AX), and continue printing the data with the rest of the tickers after it ("AAP.AX" - fourth one).
Any idea how to achieve that?

Related

TypeError: Invalid argument, not a string or column: ['article_name'] of type <class 'list'>

I think I have a question that is not that difficult but I cannot find an answer to this online. :(
So I hope anyone can help me here.
I am writing a function that calculates the amount of customers that buys a product inside a category. For example how many customers buys Cola inside Drinks?
It should be possible to give more columns as input value for the article value, so for example article number and article name. But when I only give one column as input value, it returns the next error:
TypeError Traceback (most recent call last)
<command-107291202997552> in <module>
----> 1 raf=penetration_test(df_source,"customer_hash", ["article_name"])
2 display(raf)
<command-1757898160611556> in penetration_test(input_dataframe, customer_column, product_columns, penetration_category, attach_penetration_columns)
64 else:
65 penetration_dataframe = input_dataframe.select(
---> 66 array(col_list)
67 ).distinct()
68 return penetration_dataframe
/databricks/spark/python/pyspark/sql/functions.py in array(*cols)
2998 if len(cols) == 1 and isinstance(cols[0], (list, set)):
2999 cols = cols[0]
-> 3000 jc = sc._jvm.functions.array(_to_seq(sc, cols, _to_java_column))
3001 return Column(jc)
3002
/databricks/spark/python/pyspark/sql/column.py in _to_seq(sc, cols, converter)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in <listcomp>(.0)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
44 jcol = _create_column_from_name(col)
45 else:
---> 46 raise TypeError(
47 "Invalid argument, not a string or column: "
48 "{0} of type {1}. "
TypeError: Invalid argument, not a string or column: ['article_name'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
For this I wrote the following function (I will just give the lines that are interestant for my question):
def penetration_test(input_dataframe, customer_column="customer_hash",
product_columns=["hope_number", "article_number"],
penetration_category="category_number",attach_penetration_columns=False,
):
col_list = []
col_list.append("category_number")
if len(product_columns) >1:
for item in product_columns:
col_list.append(item)
else:
col_list.append(product_columns)
#display penetration
penetration_dataframe = input_dataframe.select(
col_list
).distinct()
return penetration_dataframe
This is the code that I execute that returns an error.
df_res = penetration_test(df_source,"customer_hash", ["article_name"], "category_name", False)
display(df_res)
Can someone help me to solve the problem?

Watson caption converter IndexError: list out of range

I'm using watson caption converter and I'm getting an indexError list out of range while using jupyter notebook. Any way to fix this error?
Here is the error:
IndexError Traceback (most recent call last)
<ipython-input-2-c505fadd11ee> in <module>
131
132 # formats time for SRT format
--> 133 st_time = format_time(x[2], "srt")
134 en_time = format_time(x[3], "srt")
135
<ipython-input-2-c505fadd11ee> in format_time(time, format)
15 # function to format time for SRT file
16 def format_time(time, format):
---> 17 ms = (str(time).split("."))[1]
18 sec = time
19 min = sec / 60
IndexError: list index out of range

Getting a KeyError when trying to run De-dupe

Hi I'm new to Python and I've no clue how to fix the following error:
I've a data frame with around 2 million records & 20 columns of stores data, I am grouping the stores by State and trying to run dedupe_dataframe on each state after training it on one.
Here is how my code looks (np is numpy, dp is pandas pandas_dedupe):
#Read Store Data
stores = pd.read_csv("storefile.csv",sep = ",", encoding= 'latin1',dtype=str)
#There was /t in the first column so removing that
stores= stores.replace('\t','', regex=True)
stores= stores.replace(np.nan, '', regex=True)
#Getting a lowercase state list
states=list(stores.State.str.lower().unique())
#Grouping Data by States
state_data= {state: stores[stores.State.str.lower()==state] for state in states}
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
I'm getting the following error:
importing data ...
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last) <ipython-input-37-e2ed10256338> in <module>
----> 1 dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
dedupe_dataframe(df, field_properties, canonicalize, config_name,
recall_weight, sample_size)
211 # train or load the model
212 deduper = _train(settings_file, training_file, data_d, field_properties,
--> 213 sample_size)
214
215 # ## Set threshold
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
_train(settings_file, training_file, data, field_properties, sample_size)
58 # To train dedupe, we feed it a sample of records.
59 sample_num = math.floor(len(data) * sample_size)
---> 60 deduper.sample(data, sample_num)
61
62 # If we have training data saved from a previous run of dedupe,
~\anaconda3\lib\site-packages\dedupe\api.py in sample(self, data,
sample_size, blocked_proportion, original_length)
836 sample_size,
837 original_length,
--> 838 index_include=examples)
839
840 self.active_learner.mark(examples, y)
~\anaconda3\lib\site-packages\dedupe\labeler.py in __init__(self,
data_model, data, blocked_proportion, sample_size, original_length,
index_include)
401 data = core.index(data)
402
--> 403 self.candidates = super().sample(data, blocked_proportion, sample_size)
404
405 random_pair = random.choice(self.candidates)
~\anaconda3\lib\site-packages\dedupe\labeler.py in sample(self, data,
blocked_proportion, sample_size)
50 return [(data[k1], data[k2])
51 for k1, k2
---> 52 in blocked_sample_keys | random_sample_keys]
53
54
~\anaconda3\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
49
50 return [(data[k1], data[k2])
---> 51 for k1, k2
52 in blocked_sample_keys | random_sample_keys]
53
KeyError: 2147550487
Solution
Swap the following line:
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
WITH
#Running De-Dupe for state Ohio ('oh')
state_data['oh'].dedupe_dataframe(subset = ['StoreBannerName','Address','City','State'], keep='first')
Reference
pandas.DataFrame.drop_duplicates()
I got a KeyError as well.
I used this code example:
https://github.com/dedupeio/dedupe-examples/tree/master/record_linkage_example
I had changed some of the code and I had forgotten to remove data_matching_learned_settings file before running it again.
Another reason you might get this error, especially with the first column, is that your file could contain a BOM (Byte Order Mark) in the first character.
See this for how to remove a BOM:
VS Code: How to save UTF-8 without BOM in Mac?

How to read binary compressed SAS files in chunks using panda.read_sas and save as feather

I am trying to use pandas.read_sas() to read binary compressed SAS files in chunks and save each chunk as a separate feather file.
This is my code
import feather as fr
import pandas as pd
pdi = pd.read_sas("C:/data/test.sas7bdat", chunksize = 100000, iterator = True)
i = 1
for pdj in pdi:
fr.write_dataframe(pdj, 'C:/data/test' + str(i) + '.feather')
i = i + 1
However I get the following error
ValueError Traceback (most recent call
last) in ()
1 i = 1
2 for pdj in pdi:
----> 3 fr.write_dataframe(pdj, 'C:/test' + str(i) + '.feather')
4 i = i + 1
5
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write_feather(df, dest)
116 writer = FeatherWriter(dest)
117 try:
--> 118 writer.write(df)
119 except:
120 # Try to make sure the resource is closed
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write(self, df)
94
95 elif inferred_type not in ['unicode', 'string']:
---> 96 raise ValueError(msg)
97
98 if not isinstance(name, six.string_types):
ValueError: cannot serialize column 0 named SOME_ID with dtype bytes
I am using Windows 7 and Python 3.6. When I inspect it most the columns' cells are wrapped in b'cell_value' which I assume to mean that the columns are in binary format.
I am a complete Python beginner so don't understand what is the issue?
Edit: looks like this was a bug patched in a recent version:
https://issues.apache.org/jira/browse/ARROW-1672
https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855
Are the column names strings? Are you sure pdj is of type pd.DataFrame?
Limitations
Some features of pandas are not supported in Feather:
Non-string column names
Row indexes
Object-type columns with non-homogeneous data

seasonal_decompose raises error: TypeError: PeriodIndex given. Check the `freq` attribute instead of using infer_freq

I'm trying to run a basic seasonal_decompose on the oft-used airline passengers data set, which begins with these rows:
Month
1949-02 4.770685
1949-03 4.882802
1949-04 4.859812
1949-05 4.795791
1949-06 4.905275
1949-07 4.997212
1949-08 4.997212
1949-09 4.912655
1949-10 4.779123
1949-11 4.644391
1949-12 4.770685
1950-01 4.744932
1950-02 4.836282
1950-03 4.948760
1950-04 4.905275
1950-05 4.828314
1950-06 5.003946
1950-07 5.135798
1950-08 5.135798
Freq: M, Name: Passengers, dtype: float64
My index type is:
pandas.tseries.period.PeriodIndex
I try to run some very simple code:
from statsmodels.tsa.seasonal import seasonal_decompose
log_passengers.interpolate(inplace = True)
decomposition = seasonal_decompose(log_passengers)
Here is the full output of the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-113-bf122d457673> in <module>()
1 from statsmodels.tsa.seasonal import seasonal_decompose
2 log_passengers.interpolate(inplace = True)
----> 3 decomposition = seasonal_decompose(log_passengers)
/Users/ann/anaconda/lib/python3.5/site-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq)
56 statsmodels.tsa.filters.convolution_filter
57 """
---> 58 _pandas_wrapper, pfreq = _maybe_get_pandas_wrapper_freq(x)
59 x = np.asanyarray(x).squeeze()
60 nobs = len(x)
/Users/ann/anaconda/lib/python3.5/site-packages/statsmodels/tsa/filters/_utils.py in _maybe_get_pandas_wrapper_freq(X, trim)
44 index = X.index
45 func = _get_pandas_wrapper(X, trim)
---> 46 freq = index.inferred_freq
47 return func, freq
48 else:
pandas/src/properties.pyx in pandas.lib.cache_readonly.__get__ (pandas/lib.c:44097)()
/Users/ann/anaconda/lib/python3.5/site-packages/pandas/tseries/base.py in inferred_freq(self)
233 """
234 try:
--> 235 return frequencies.infer_freq(self)
236 except ValueError:
237 return None
/Users/ann/anaconda/lib/python3.5/site-packages/pandas/tseries/frequencies.py in infer_freq(index, warn)
854
855 if com.is_period_arraylike(index):
--> 856 raise TypeError("PeriodIndex given. Check the `freq` attribute "
857 "instead of using infer_freq.")
858 elif isinstance(index, pd.TimedeltaIndex):
TypeError: PeriodIndex given. Check the `freq` attribute instead of using infer_freq.
Here's what I've tried:
Use decomposition = seasonal_decompose(log_passengers, infer_freq = True) which produces the error: TypeError: seasonal_decompose() got an unexpected keyword argument 'infer_freq'
Use decomposition = seasonal_decompose(log_passengers, freq = 'M') which results in the error: TypeError: PeriodIndex given. Check thefreqattribute instead of using infer_freq.
I also verified that each period index in my index of period indices has the same frequency with the list of code: set([x.freq for x in log_passengers.index]) which indeed produced a set of only one frequency: {<MonthEnd>}
I see some talk of this on various Github issues (https://github.com/pydata/pandas/issues/6771) , but none of what's discussed seems to help. Any suggestions on how to troubleshoot this or what I'm doing wrong in this simple seasona_decompose?
seasonal_decompose does not accept PeriodIndex, a workaround would be converting the index to DatetimeIndex using to_timestamp method:
from statsmodels.tsa.seasonal import seasonal_decompose
log_passengers.interpolate(inplace = True)
log_passengers.index=log_passengers.index.to_timestamp()
decomposition = seasonal_decompose(log_passengers)

Categories

Resources