Related
So I just tried making a basic 10 line movie recommendation system with a big ML project in mind. But It's only errors I'm getting while running this:
import pandas as pd
movies = pd.read_csv('movies.csv')
users = pd.read_csv('users.csv')
recommendations = {}
def recommend(users,movies):
for f in users['favouritegenre']:
genre = movies.query(f)['genre']
movie = movies.query(genre)['movie']
userid = users.query(f)['userid']
recommendations[userid]=movie
print(recommendations)
recommend(users,movies)
My movies csv file:
movie,ratings,genre
'toy story','8','family'
'john wick','7','action'
'a quite place','8','horror'
users csv file:
userid,age,favouritegenre
'1','16','family'
'2','49','action'
'3','10','horror'
and I get the error:
Traceback (most recent call last):
File "main.py", line 15, in <module>
recommend(users,movies)
File "main.py", line 9, in recommend
genre = movies.query(f)['genre']
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\frame.py", line 4114, in query
result = self.loc[res]
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py", line 967, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py", line 1202, in _getitem_axis
return self._get_label(key, axis=axis)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py", line 1153, in _get_label
return self.obj.xs(label, axis=axis)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 3864, in xs
loc = index.get_loc(key)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexes\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'family'
IIUC, you can do
recommendations = (users.assign(movies=users['favouritegenre'].map(movies.groupby('genre')['movie'].agg(list)))
.set_index('userid')['movies'].to_dict())
print(recommendations)
{1: ['toy story'], 2: ['john wick'], 3: ['a quite place']}
I`m trying to download and then open excel file (report) generated by marketplace with openpyxl.
import requests
import config
import openpyxl
link = 'https://api.telegram.org/file/bot' + config.TOKEN + '/documents/file_66.xlsx'
def save_open(link):
filename = link.split('/')[-1]
r = requests.get(link)
with open(filename, 'wb') as new_file:
new_file.write(r.content)
wb = openpyxl.open ('file_66.xlsx')
ws = wb.active
cell = ws['B2'].value
print (cell)
save_open(link)
After running this code I got the above:
Traceback (most recent call last):
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 55, in _convert
value = expected_type(value)
TypeError: Fill() takes no arguments
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 20, in <module>
save_open(link)
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 14, in save_open
wb = openpyxl.open ('file_66.xlsx')
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 317, in load_workbook
reader.read()
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\serialisable.py", line 103, in from_tree
return cls(**attrib)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 74, in __init__
self.fills = fills
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in __set__
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in <listcomp>
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 57, in _convert
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'openpyxl.styles.fills.Fill'>
[Finished in 1.6s]
If you run file properties/details you can see that this file was generated by "Go Exelize" (author: xuri). To run this file you need to separate code in two parts. First: download file. Then you need to manually open it with MS Excel, save file and close it (after this "Go Excelize" switch to "Microsoft Excel"). And only after that you can run the second part of the code correctly with no errors. Can anyone help me to handle this problem?
I had the same problem, "TypeError('expected ' + str(expected_type))", using pandas.read_excel, which uses openpyxl. If I open the file, save and close it, it will work with both, pandas and openpyxl.
Upon further attempts I could open the file using the "read_only=True" in openpyxl, but while iterating over the rows I would still get the error, but only when all the rows ended, in the end of the file.
I belive it could be something in the EOF (end of file) and openpyxl don't have ways of treating it.
Here is the code that I used to test and worked for me:
import openpyxl
wb = openpyxl.load_workbook(my_file_name, read_only=True)
ws = wb.worksheets[0]
lis = []
try:
for row in ws.iter_rows():
lis.append([cell.value for cell in row])
except TypeError:
print('Skip error in EOF')
Used openpyxl==3.0.10
so i want to get the monthly sum with my script but i always get an AttributeError, which i dont understand. The column Timestamp does indeed exist on my combined_csv. I know for sure that this line is causing the problem since i tested al of my other code before.
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
I'll appreciate every kind of help i can get - thanks
import os
import glob
import pandas as pd
# set working directory
os.chdir("Path to CSVs")
# find all csv files in the folder
# use glob pattern matching -> extension = 'csv'
# save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames])
# Format CSV
# Transform Timestamp column into datetime
combined_csv['Timestamp'] = pd.to_datetime(combined_csv.Timestamp)
# Read out first entry of every day of every month
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
# To get the yield of day i have to subtract day 2 HtmDht_Energy - day 1 HtmDht_Energy
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
# combined_csv.reset_index()
# combined_csv.index.set_names(["year", "month"], inplace=True)
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
Output of combined_csv.columns
Index(['Timestamp', 'teHst0101', 'teHst0102', 'teHst0103', 'teHst0104',
'teHst0105', 'teHst0106', 'teHst0107', 'teHst0201', 'teHst0202',
'teHst0203', 'teHst0204', 'teHst0301', 'teHst0302', 'teHst0303',
'teHst0304', 'teAmb', 'teSolFloHexHst', 'teSolRetHexHst',
'teSolCol0501', 'teSolCol1001', 'teSolCol1501', 'vfSol', 'prSolRetSuc',
'rdGlobalColAngle', 'gSolPump01_roActual', 'gSolPump02_roActual',
'gHstPump03_roActual', 'gHstPump04_roActual', 'gDhtPump06_roActual',
'gMB01_isOpened', 'gMB02_isOpened', 'gCV01_posActual',
'gCV02_posActual', 'HtmDht_Energy', 'HtmDht_Flow', 'HtmDht_Power',
'HtmDht_Volume', 'HtmDht_teFlow', 'HtmDht_teReturn', 'HtmHst_Energy',
'HtmHst_Flow', 'HtmHst_Power', 'HtmHst_Volume', 'HtmHst_teFlow',
'HtmHst_teReturn', 'teSolColDes', 'teHstFloDes'],
dtype='object')
Traceback:
When i select it with
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
Traceback (most recent call last):
File "D:\Users\wink\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Timestamp'
traceback with mustafas solution
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3862, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\util\_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4176, in reindex
return super().reindex(**kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4022, in _reindex_axes
frame = frame._reindex_index(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4038, in _reindex_index
new_index, indexer = self.index.reindex(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2492, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 175, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 531, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas\_libs\lib.pyx", line 2527, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3888, in _sanitize_column
value = reindexer(value).T
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3870, in reindexer
raise TypeError(
TypeError: incompatible index of inserted column with frame index
This line makes the Timestamp column the index of the combined_csv:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
and therefore you get an error when you try to access .Timestamp.
Remedy is to reset_index, so instead of above line, you can try this:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first'])).reset_index()
which will take the Timestamp column back into normal columns from the index and you can then access it.
Side note:
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
is equivalent to
combined_csv["dailyYield"] = combined_csv["first"].diff()
I run my python scripts in RHEL Linux, and I get the following error:
Traceback (most recent call last):
File "main.py", line 162, in <module>
find_deltas(logging, snapshot_id)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 71, in find_deltas
data = prepare_frames(logging, file_extracts)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 606, in prepare_frames
logging.info("df_old has %d records", len(df_old))
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1041, in __len__
return len(self.index)
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
Traceback (most recent call last):
File "main.py", line 162, in <module>
find_deltas(logging, snapshot_id)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 71, in find_deltas
data = prepare_frames(logging, file_extracts)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 606, in prepare_frames
logging.info("df_old has %d records", len(df_old))
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1041, in __len__
return len(self.index)
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
I am effectively reading in a dataframe from Oracle, writing it to a pickle file, and then reading in the pickle file, also reading in yesterdays pickle file, and then doing a join on the primary key.
Why on earth would Linux generate an error about a missing "_data" attribute, when the code runs fine on the exact same data set in Windows?!
Reading in the pickle file in Linux, the columns are as expected.
>>> df.columns
Index(['AS_OF_DT', 'VARIATION_REQUEST_ID', 'LU_NUMBER', 'LU_TITLE', 'COUNTRY',
'ARCHIVED', 'APPLIED', 'LU_DESCRIPTION', 'HA_LU_REF_NO', 'REMARKS',
'LU_CATEGORY', 'VARIATION_TYPE', 'INSERT_UPDATE_TIME',
'INSERT_UPDATE_USER', 'MERGED', 'REVISION_NUMBER', 'VERSION_SEQ',
'RECORD_ID', 'IMPLEMENTED_SEQ', 'RMS_VERSION_SEQ',
'REASON_FOR_LOCAL_UPDATE', 'C_ECTD_SEQUENCE_NO', 'INSERT_TIME',
'ARCHIVED_DATE', 'REASON_FOR_MERGE', 'SCRN_NO'],
dtype='object')
>>>
The function generating the issue is below:
def prepare_frames(logging, file_extracts):
# file_extracts is a tuple of dictionaries
# old_file
# new_file
# file_info
# file_info is a dict describing the file master record including the join keys
# {"file_id":file_id, "file_desc": r.FILE_DESC, "file_prefix": r.FILE_PREFIX, "compare_col": r.COMPARE_COL}
# old_file and new_file dictionaries describes the file name of the older snapshot file to be compared
# old_file["new_old"] = "old"
# old_file["extract_id"] = extract_id
# old_file["file_id"] = file_id
# old_file["file_name"] = file_name
# old_file["snapshot_id"] = snapshot_id
# old_file["num_records"] = num_records
# Strip columns which we know will be different, to remove false positives such as AS_OF_DT
logging.info("Start: Reading in DataFrames for analysis from pickle files.")
data = []
for extract in file_extracts:
old_file = extract[0]
new_file = extract[1]
file_info = extract[2] # the dictionary
old_file_name = old_file["file_name"]
new_file_name = new_file["file_name"]
logging.info("Reading in old snapshot from pickle file: %s", old_file_name)
df_old = pd.read_pickle('snapshots/' + old_file_name)
logging.info("Reading in new snapshot from pickle file: %s", new_file_name)
df_new = pd.read_pickle('snapshots/' + new_file_name)
logging.info("df_old has %d records", len(df_old))
logging.info("df_new has %d records", len(df_new))
# before we do any comparisons we need to remove as_of_dt type values as this will produce false deltas
#if "AS_OF_DT" in df_new.columns:
# del df_new["AS_OF_DT"]
# del df_old["AS_OF_DT"]
#if "AS_OF_DATE" in df_new.columns:
# del df_new["AS_OF_DATE"]
# del df_old["AS_OF_DATE"]
data.append((df_old, df_new, old_file, new_file, file_info))
logging.info("End: Reading in DataFrames for analysis from pickle files.")
return data
Line 606 is this one:
logging.info("df_old has %d records", len(df_old))
df_old and df_new are basically pickle files read into a dataframe. I copy the same pickle files to windows, and no issue at all
UPDATE: Looks like it was a logic error, the dataframe was actually empty!
I had the same issue. I was using pandas=1.0.4 within conda environment. Updating pandas to 1.1.0 solved my problem.
Hope that works.
Is there a way to force pandas to write an empty DataFrame to an HDF file?
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx')
df2 = pd.read_hdf('temp.h5', 'xxx')
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 740, in select
return it.get_result()
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1518, in get_result
results = self.func(self.start, self.stop, where)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 733, in func
columns=columns)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2986, in read
idx=i), start=_start, stop=_stop)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2575, in read_index
_, index = self.read_index_node(getattr(self.group, key), **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 2676, in read_index_node
data = node[start:stop]
File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 675, in __getitem__
return self.read(start, stop, step)
File ".../Python-3.6.3/lib/python3.6/site-packages/tables/vlarray.py", line 811, in read
listarr = self._read_array(start, stop, step)
File "tables/hdf5extension.pyx", line 2106, in tables.hdf5extension.VLArray._read_array (tables/hdf5extension.c:24649)
ValueError: cannot set WRITEABLE flag to True of this array
Writing with format='table':
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
df.to_hdf('temp.h5', 'xxx', format='table')
df2 = pd.read_hdf('temp.h5', 'xxx')
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 389, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File ".../Python-3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 722, in select
raise KeyError('No object named {key} in the file'.format(key=key))
KeyError: 'No object named xxx in the file'
Pandas version: 0.24.2
Thank you for your help!
Putting empty DataFrame into HDFStore in fixed format should work (maybe you need to check versions of other packages, e.g. tables):
# Versions
pd.__version__
tables.__version__
# DF
df = pd.DataFrame(columns=['x','y'])
df
# Dump in fixed format
with pd.HDFStore('temp.h5') as store:
store.put('df', df, format='f')
print('Read:')
store.select('df')
>>> '0.24.2'
>>> '3.5.1'
>>> x y
>>>
>>> Read:
>>> x y
Pytable really forbids to do so (at least it was), but for fixed pandas has its workaround.
But as discussed in same github issue there are made some efforts to fix this behavior for table as well. But looks like solution is still 'hangs in the air' because it was so at the end of march.