Pandas failure in Linux, not occuring in Windows - missing _data attribute - python

I run my python scripts in RHEL Linux, and I get the following error:
Traceback (most recent call last):
File "main.py", line 162, in <module>
find_deltas(logging, snapshot_id)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 71, in find_deltas
data = prepare_frames(logging, file_extracts)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 606, in prepare_frames
logging.info("df_old has %d records", len(df_old))
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1041, in __len__
return len(self.index)
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
Traceback (most recent call last):
File "main.py", line 162, in <module>
find_deltas(logging, snapshot_id)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 71, in find_deltas
data = prepare_frames(logging, file_extracts)
File "/ariel/python_scripts/ariel_deltas/deltas.py", line 606, in prepare_frames
logging.info("df_old has %d records", len(df_old))
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1041, in __len__
return len(self.index)
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/ariel/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
I am effectively reading in a dataframe from Oracle, writing it to a pickle file, and then reading in the pickle file, also reading in yesterdays pickle file, and then doing a join on the primary key.
Why on earth would Linux generate an error about a missing "_data" attribute, when the code runs fine on the exact same data set in Windows?!
Reading in the pickle file in Linux, the columns are as expected.
>>> df.columns
Index(['AS_OF_DT', 'VARIATION_REQUEST_ID', 'LU_NUMBER', 'LU_TITLE', 'COUNTRY',
'ARCHIVED', 'APPLIED', 'LU_DESCRIPTION', 'HA_LU_REF_NO', 'REMARKS',
'LU_CATEGORY', 'VARIATION_TYPE', 'INSERT_UPDATE_TIME',
'INSERT_UPDATE_USER', 'MERGED', 'REVISION_NUMBER', 'VERSION_SEQ',
'RECORD_ID', 'IMPLEMENTED_SEQ', 'RMS_VERSION_SEQ',
'REASON_FOR_LOCAL_UPDATE', 'C_ECTD_SEQUENCE_NO', 'INSERT_TIME',
'ARCHIVED_DATE', 'REASON_FOR_MERGE', 'SCRN_NO'],
dtype='object')
>>>
The function generating the issue is below:
def prepare_frames(logging, file_extracts):
# file_extracts is a tuple of dictionaries
# old_file
# new_file
# file_info
# file_info is a dict describing the file master record including the join keys
# {"file_id":file_id, "file_desc": r.FILE_DESC, "file_prefix": r.FILE_PREFIX, "compare_col": r.COMPARE_COL}
# old_file and new_file dictionaries describes the file name of the older snapshot file to be compared
# old_file["new_old"] = "old"
# old_file["extract_id"] = extract_id
# old_file["file_id"] = file_id
# old_file["file_name"] = file_name
# old_file["snapshot_id"] = snapshot_id
# old_file["num_records"] = num_records
# Strip columns which we know will be different, to remove false positives such as AS_OF_DT
logging.info("Start: Reading in DataFrames for analysis from pickle files.")
data = []
for extract in file_extracts:
old_file = extract[0]
new_file = extract[1]
file_info = extract[2] # the dictionary
old_file_name = old_file["file_name"]
new_file_name = new_file["file_name"]
logging.info("Reading in old snapshot from pickle file: %s", old_file_name)
df_old = pd.read_pickle('snapshots/' + old_file_name)
logging.info("Reading in new snapshot from pickle file: %s", new_file_name)
df_new = pd.read_pickle('snapshots/' + new_file_name)
logging.info("df_old has %d records", len(df_old))
logging.info("df_new has %d records", len(df_new))
# before we do any comparisons we need to remove as_of_dt type values as this will produce false deltas
#if "AS_OF_DT" in df_new.columns:
# del df_new["AS_OF_DT"]
# del df_old["AS_OF_DT"]
#if "AS_OF_DATE" in df_new.columns:
# del df_new["AS_OF_DATE"]
# del df_old["AS_OF_DATE"]
data.append((df_old, df_new, old_file, new_file, file_info))
logging.info("End: Reading in DataFrames for analysis from pickle files.")
return data
Line 606 is this one:
logging.info("df_old has %d records", len(df_old))
df_old and df_new are basically pickle files read into a dataframe. I copy the same pickle files to windows, and no issue at all
UPDATE: Looks like it was a logic error, the dataframe was actually empty!

I had the same issue. I was using pandas=1.0.4 within conda environment. Updating pandas to 1.1.0 solved my problem.
Hope that works.

Related

TypeError: expected <class 'openpyxl.styles.fills.Fill'>

I`m trying to download and then open excel file (report) generated by marketplace with openpyxl.
import requests
import config
import openpyxl
link = 'https://api.telegram.org/file/bot' + config.TOKEN + '/documents/file_66.xlsx'
def save_open(link):
filename = link.split('/')[-1]
r = requests.get(link)
with open(filename, 'wb') as new_file:
new_file.write(r.content)
wb = openpyxl.open ('file_66.xlsx')
ws = wb.active
cell = ws['B2'].value
print (cell)
save_open(link)
After running this code I got the above:
Traceback (most recent call last):
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 55, in _convert
value = expected_type(value)
TypeError: Fill() takes no arguments
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 20, in <module>
save_open(link)
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 14, in save_open
wb = openpyxl.open ('file_66.xlsx')
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 317, in load_workbook
reader.read()
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\serialisable.py", line 103, in from_tree
return cls(**attrib)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 74, in __init__
self.fills = fills
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in __set__
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in <listcomp>
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 57, in _convert
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'openpyxl.styles.fills.Fill'>
[Finished in 1.6s]
If you run file properties/details you can see that this file was generated by "Go Exelize" (author: xuri). To run this file you need to separate code in two parts. First: download file. Then you need to manually open it with MS Excel, save file and close it (after this "Go Excelize" switch to "Microsoft Excel"). And only after that you can run the second part of the code correctly with no errors. Can anyone help me to handle this problem?
I had the same problem, "TypeError('expected ' + str(expected_type))", using pandas.read_excel, which uses openpyxl. If I open the file, save and close it, it will work with both, pandas and openpyxl.
Upon further attempts I could open the file using the "read_only=True" in openpyxl, but while iterating over the rows I would still get the error, but only when all the rows ended, in the end of the file.
I belive it could be something in the EOF (end of file) and openpyxl don't have ways of treating it.
Here is the code that I used to test and worked for me:
import openpyxl
wb = openpyxl.load_workbook(my_file_name, read_only=True)
ws = wb.worksheets[0]
lis = []
try:
for row in ws.iter_rows():
lis.append([cell.value for cell in row])
except TypeError:
print('Skip error in EOF')
Used openpyxl==3.0.10

Pandas AttributeError: 'DataFrame' object has no attribute 'Timestamp'

so i want to get the monthly sum with my script but i always get an AttributeError, which i dont understand. The column Timestamp does indeed exist on my combined_csv. I know for sure that this line is causing the problem since i tested al of my other code before.
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
I'll appreciate every kind of help i can get - thanks
import os
import glob
import pandas as pd
# set working directory
os.chdir("Path to CSVs")
# find all csv files in the folder
# use glob pattern matching -> extension = 'csv'
# save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames])
# Format CSV
# Transform Timestamp column into datetime
combined_csv['Timestamp'] = pd.to_datetime(combined_csv.Timestamp)
# Read out first entry of every day of every month
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
# To get the yield of day i have to subtract day 2 HtmDht_Energy - day 1 HtmDht_Energy
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
# combined_csv.reset_index()
# combined_csv.index.set_names(["year", "month"], inplace=True)
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
Output of combined_csv.columns
Index(['Timestamp', 'teHst0101', 'teHst0102', 'teHst0103', 'teHst0104',
'teHst0105', 'teHst0106', 'teHst0107', 'teHst0201', 'teHst0202',
'teHst0203', 'teHst0204', 'teHst0301', 'teHst0302', 'teHst0303',
'teHst0304', 'teAmb', 'teSolFloHexHst', 'teSolRetHexHst',
'teSolCol0501', 'teSolCol1001', 'teSolCol1501', 'vfSol', 'prSolRetSuc',
'rdGlobalColAngle', 'gSolPump01_roActual', 'gSolPump02_roActual',
'gHstPump03_roActual', 'gHstPump04_roActual', 'gDhtPump06_roActual',
'gMB01_isOpened', 'gMB02_isOpened', 'gCV01_posActual',
'gCV02_posActual', 'HtmDht_Energy', 'HtmDht_Flow', 'HtmDht_Power',
'HtmDht_Volume', 'HtmDht_teFlow', 'HtmDht_teReturn', 'HtmHst_Energy',
'HtmHst_Flow', 'HtmHst_Power', 'HtmHst_Volume', 'HtmHst_teFlow',
'HtmHst_teReturn', 'teSolColDes', 'teHstFloDes'],
dtype='object')
Traceback:
When i select it with
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
Traceback (most recent call last):
File "D:\Users\wink\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Timestamp'
traceback with mustafas solution
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3862, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\util\_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4176, in reindex
return super().reindex(**kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4022, in _reindex_axes
frame = frame._reindex_index(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4038, in _reindex_index
new_index, indexer = self.index.reindex(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2492, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 175, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 531, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas\_libs\lib.pyx", line 2527, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3888, in _sanitize_column
value = reindexer(value).T
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3870, in reindexer
raise TypeError(
TypeError: incompatible index of inserted column with frame index
This line makes the Timestamp column the index of the combined_csv:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
and therefore you get an error when you try to access .Timestamp.
Remedy is to reset_index, so instead of above line, you can try this:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first'])).reset_index()
which will take the Timestamp column back into normal columns from the index and you can then access it.
Side note:
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
is equivalent to
combined_csv["dailyYield"] = combined_csv["first"].diff()

Converting a dataframe into a config file

def load_config_report(config_file_path):
config = configparser.ConfigParser()
pharmacy_settings = pd.read_excel(config_file_path,
sheet_name='pharmacy_settings')
for each in pharmacy_settings['facility_name']:
config[each]['facility_alias'] = pharmacy_settings['facility_alias']
config[each]['facility_group_id'] = pharmacy_settings['facility_group_id']
config[each]['invoice_num'] = pharmacy_settings['invoice_num']
with open('X:\\Reports\\Invoices\\config.ini', 'w') as configfile:
config.write(configfile)
Trying to convert the contents of an excel file into a .ini file. First column is the [section], remaining columns are variables in that section. Currently getting a KeyError due to how I'm iterating/slicing the dataframe. Is this a good approach to achieve this?
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.752.0_x64__qbz5n2kfra8p0\lib\tkinter\__init__.py", line 1892, in __call__
return self.func(*args)
File "X:\Python Dev\REFACTOR\invoicerefactor.py", line 41, in read_config
options.load_config_report(config_file_path.get())
File "X:\Python Dev\REFACTOR\options.py", line 10, in load_config_report
config[each]['facility_alias'] = pharmacy_settings['facility_alias']
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.752.0_x64__qbz5n2kfra8p0\lib\configparser.py", line 960, in __getitem__
raise KeyError(key)
KeyError: 'ALL CARE HEALTH SOLUTIONS'
You need to initialize an empty dictionary for config[each] before completing the data.
for each in pharmacy_settings['facility_name']:
config[each] = {}
config[each]['facility_alias'] = pharmacy_settings['facility_alias']
#...
That's how the examples in the docs are doing it.

using replace for a column of a pandas df TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'

How should I fix this?
import pandas as pd
csv_file = 'sample.csv'
count = 1
my_filtered_csv = pd.read_csv(csv_file, usecols=['subDirectory_filePath', 'expression'])
emotion_map = { '0':'6', '1':'3', '2':'4', '3':'5', '4':'2', '5':'1', '6':'0'}
my_filtered_csv['expression'] = my_filtered_csv['expression'].replace(emotion_map)
print(my_filtered_csv)
Error is:
Traceback (most recent call last):
File "/Users/mona/CS585/project/affnet/emotion_map.py", line 11, in <module>
my_filtered_csv['expression'] = my_filtered_csv['expression'].replace(emotion_map)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3836, in replace
limit=limit, regex=regex)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3885, in replace
regex=regex)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3259, in replace_list
masks = [comp(s) for i, s in enumerate(src_list)]
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3259, in <listcomp>
masks = [comp(s) for i, s in enumerate(src_list)]
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3247, in comp
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 4619, in _maybe_compare
raise TypeError("Cannot compare types %r and %r" % tuple(type_names))
TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'
Process finished with exit code 1
A few lines of the csv file looks like:
,subDirectory_filePath,expression
0,689/737db2483489148d783ef278f43f486c0a97e140fc4b6b61b84363ca.jpg,1
1,392/c4db2f9b7e4b422d14b6e038f0cdc3ecee239b55326e9181ee4520f9.jpg,0
2,468/21772b68dc8c2a11678c8739eca33adb6ccc658600e4da2224080603.jpg,0
3,944/06e9ae8d3b240eb68fa60534783eacafce2def60a86042f9b7d59544.jpg,1
4,993/02e06ee5521958b4042dd73abb444220609d96f57b1689abbe87c024.jpg,8
5,979/f675c6a88cdef99a6d8b0261741217a0319387fcf1571a174f99ac81.jpg,6
6,637/94b769d8e880cbbea8eaa1350cb8c094a03d27f9fef44e1f4c0fb2ae.jpg,9
7,997/b81f843f08ce3bb0c48b270dc58d2ab8bf5bea3e2262e50bbcadbec2.jpg,6
8,358/21a32dd1c1ecd57d3e8964621c911df1c0b3348a4ae5203b4a243230.JPG,9
Changing the emotion_map to the following fixed the problem:
emotion_map = { 0:6, 1:3, 2:4, 3:5, 4:2, 5:1, 6:0}
Another possiblity which can also create this error is:
you have already run this code once and the data is already replaced.
To solve this problem, you can go back and load the data set again

PySpark loading CSV AttributeError: 'RDD' object has no attribute '_get_object_id'

I'm trying to load a CSV file into a spark DataFrame. This is what I have done so far:
# sc is an SparkContext.
appName = "testSpark"
master = "local"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
# csv path
text_file = sc.textFile("hdfs:///path/to/sensordata20171008223515.csv")
df = sqlContext.load(source="com.databricks.spark.csv", header = 'true', path = text_file)
print df.schema()
Here's the trace:
Traceback (most recent call last):
File "/home/centos/main.py", line 16, in <module>
df = sc.textFile(text_file).map(lambda line: (line.split(';')[0], line.split(';')[1])).collect()
File "/usr/hdp/2.5.6.0-40/spark/python/lib/pyspark.zip/pyspark/context.py", line 474, in textFile
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 804, in __call__
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 278, in get_command_part
AttributeError: 'RDD' object has no attribute '_get_object_id'
I'm new to spark. So if anyone could tell me what I've done wrong this would be very helpful.
You cannot pass RDD to csv reader. You should use path directly:
df = sqlContext.load(source="com.databricks.spark.csv",
header = 'true', path = "hdfs:///path/to/sensordata20171008223515.csv")
Only a limited number of formats (notably JSON) supports RDD as an input argument.

Categories

Resources