Related
Im trying to convert this json file into a flat data frames including all the columns.
However, I keeps receiving this error:
Traceback (most recent call last):
File "./readjsonfile.py", line 19, in <module>
print(data[fields])
File "../anaconda/envs/myenv/lib/python3.9/site-packages/pandas/core/frame.py", line 3030, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "../anaconda/envs/myenv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "../anaconda/envs/myenv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1316, in _validate_read_indexer
raise KeyError(f"{not_found} not in index")
KeyError: "['utterance', 'turns.frames.actions.act'] not in index"
I reviewed this link to learn how to do that. And this is my code:
f = open('./dialogues_001.json')
jsondata = json.load(f)
fields = ['dialogue_id', 'services', 'turns.frames.actions.act', 'turns.utterance']
data = pd.json_normalize(jsondata)
print(data[fields])
so i want to get the monthly sum with my script but i always get an AttributeError, which i dont understand. The column Timestamp does indeed exist on my combined_csv. I know for sure that this line is causing the problem since i tested al of my other code before.
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
I'll appreciate every kind of help i can get - thanks
import os
import glob
import pandas as pd
# set working directory
os.chdir("Path to CSVs")
# find all csv files in the folder
# use glob pattern matching -> extension = 'csv'
# save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames])
# Format CSV
# Transform Timestamp column into datetime
combined_csv['Timestamp'] = pd.to_datetime(combined_csv.Timestamp)
# Read out first entry of every day of every month
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
# To get the yield of day i have to subtract day 2 HtmDht_Energy - day 1 HtmDht_Energy
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
# combined_csv.reset_index()
# combined_csv.index.set_names(["year", "month"], inplace=True)
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
Output of combined_csv.columns
Index(['Timestamp', 'teHst0101', 'teHst0102', 'teHst0103', 'teHst0104',
'teHst0105', 'teHst0106', 'teHst0107', 'teHst0201', 'teHst0202',
'teHst0203', 'teHst0204', 'teHst0301', 'teHst0302', 'teHst0303',
'teHst0304', 'teAmb', 'teSolFloHexHst', 'teSolRetHexHst',
'teSolCol0501', 'teSolCol1001', 'teSolCol1501', 'vfSol', 'prSolRetSuc',
'rdGlobalColAngle', 'gSolPump01_roActual', 'gSolPump02_roActual',
'gHstPump03_roActual', 'gHstPump04_roActual', 'gDhtPump06_roActual',
'gMB01_isOpened', 'gMB02_isOpened', 'gCV01_posActual',
'gCV02_posActual', 'HtmDht_Energy', 'HtmDht_Flow', 'HtmDht_Power',
'HtmDht_Volume', 'HtmDht_teFlow', 'HtmDht_teReturn', 'HtmHst_Energy',
'HtmHst_Flow', 'HtmHst_Power', 'HtmHst_Volume', 'HtmHst_teFlow',
'HtmHst_teReturn', 'teSolColDes', 'teHstFloDes'],
dtype='object')
Traceback:
When i select it with
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
Traceback (most recent call last):
File "D:\Users\wink\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Timestamp'
traceback with mustafas solution
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3862, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\util\_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4176, in reindex
return super().reindex(**kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4022, in _reindex_axes
frame = frame._reindex_index(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4038, in _reindex_index
new_index, indexer = self.index.reindex(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2492, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 175, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 531, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas\_libs\lib.pyx", line 2527, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3888, in _sanitize_column
value = reindexer(value).T
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3870, in reindexer
raise TypeError(
TypeError: incompatible index of inserted column with frame index
This line makes the Timestamp column the index of the combined_csv:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
and therefore you get an error when you try to access .Timestamp.
Remedy is to reset_index, so instead of above line, you can try this:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first'])).reset_index()
which will take the Timestamp column back into normal columns from the index and you can then access it.
Side note:
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
is equivalent to
combined_csv["dailyYield"] = combined_csv["first"].diff()
I have a DataFrame of 320000 rows and 18 columns.
Two of the columns are the project start date and project end date.
I simply want to add a column with the duration of the project in days.
df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']
The data is imported from a SQL Server.
The dates are formated (yyyy-mm-dd).
When I run the code above, I get this error:
Traceback (most recent call last):
File "pandas_libs\tslibs\timedeltas.pyx", line 234, in
pandas._libs.tslibs.timedeltas.array_to_timedelta64
TypeError: Expected unicode, got datetime.timedelta
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops\common.py",
line 64, in new_method
return method(self, other)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py",
line 502, in wrapper
return _construct_result(left, result, index=left.index, name=res_name)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py",
line 475, in _construct_result
out = left._constructor(result, index=index)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\series.py",
line 305, in init
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\construction.py",
line 424, in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\construction.py",
line 537, in _try_cast
subarr = maybe_cast_to_datetime(arr, dtype)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1346, in maybe_cast_to_datetime
value = maybe_infer_to_datetimelike(value)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1198, in maybe_infer_to_datetimelike
value = try_timedelta(v)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1187, in try_timedelta
return to_timedelta(v)._ndarray_values.reshape(shape)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py",
line 102, in to_timedelta
return _convert_listlike(arg, unit=unit, errors=errors)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py",
line 140, in _convert_listlike
value = sequence_to_td64ns(arg, unit=unit, errors=errors, copy=False)[0]
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py",
line 943, in sequence_to_td64ns
data = objects_to_td64ns(data, unit=unit, errors=errors)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py",
line 1052, in objects_to_td64ns
result = array_to_timedelta64(values, unit=unit, errors=errors)
File "pandas_libs\tslibs\timedeltas.pyx", line 239, in
pandas._libs.tslibs.timedeltas.array_to_timedelta64
File "pandas_libs\tslibs\timedeltas.pyx", line 198, in
pandas._libs.tslibs.timedeltas.convert_to_timedelta64
File "pandas_libs\tslibs\timedeltas.pyx", line 143, in
pandas._libs.tslibs.timedeltas.delta_to_nanoseconds
OverflowError: int too big to convert
I suspect that there is a problem in the formatting of the dates. I tried:
a = df.head(50000)['END_FORMATED']
b = df.head(50000)['START_FORMATED']
c = a-b
and got the same error. However, when I ran it for the last 50000 rows, it worked perfectly:
x = df.tail(50000)['END_FORMATED']
y = df.tail(50000)['START_FORMATED']
z = x-y
This shows that the problem does not exist in all of the dataset and only in some of the rows.
Any idea how I can solve the problem?
Thanks!
Seems like you have a date in your SQL dataset set as 1009-01-06. pandas only understand dates between 1677-09-21 and 2262-04-11, as per this oficial documentation.
Try to cast each Series into a datetime object to catch if some entry is not in the expected format, with infer_datetime_format = True and errors = 'coerce' as follows:
df['START_FORMATED'] = ['2020-05-05', '2020-05-06', '2020-05-07', 1009-01-06]
df['END_FORMATED'] = ['2020-06-05', '2020-06-06', '2020-06-07', '2020-06-08']
df['proj_duration'] = pd.to_datetime(df['END_FORMATED'], infer_datetime_format = True, errors = 'coerce') - pd.to_datetime(df['START_FORMATED'], infer_datetime_format=True, errors = 'coerce')
This will set NaT value when impossible to use pd.to_datetime(), which resulted in this df:
START_FORMATED END_FORMATED proj_duration
0 2020-05-05 2020-06-05 31 days
1 2020-05-06 2020-06-06 31 days
2 2020-05-07 2020-06-07 31 days
3 1009-01-06 2020-06-08 NaT
I need to import a csv file using pandas that have a date field in the format 'year.decimal day' such as '1980.042' which would be in the format 'DD/MM/YYYY', '11/02/1980'.
File sample:
data
1980.042
1980.125
1980.208
1980.292
1980.375
1980.458
1980.542
1980.625
1980.708
Using pd.to_datetime I can transform it like this:
d = '1980.042'
print(pd.to_datetime(d, format = '%Y.%j'))
Output:
1980-02-11 00:00:00
My first attempt was to read the file and convert the dataframe column:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = p
d.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 4, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
The second attempt was to transform the column into a str and then a date:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].astype(str)
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 6, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
Then I realized that for some internal floating point issue the data was getting more than three decimal places. So I rounded it up to just three decimal places before convert:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].round(3).astype(str)
print(df.dtypes, '\n\n', df.head())
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
Output:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
data object
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 8, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
Finally, I looking at the pandas documentation and in some forums that I could define the data type when reading the file and also apply a lambda function:
import pandas as pd
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df.dtypes, '\n\n', df.head())
Output:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 377, in _convert_listlike
values, tz = conversion.datetime_to_datetime64(arg)
File "pandas/_libs/tslibs/conversion.pyx", line 188, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "datas.py", line 5, in <module>
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1921, in read
names, data = self._do_date_conversions(names, data)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1675, in _do_date_conversions
self.index_names, names, keep_date_col=self.keep_date_col)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3066, in _process_date_conversion
data_dict[colspec] = converter(data_dict[colspec])
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3033, in converter
return generic_parser(date_parser, *date_cols)
File "/usr/lib/python3/dist-packages/pandas/io/date_converters.py", line 39, in generic_parser
results[i] = parse_func(*args)
File "datas.py", line 3, in <lambda>
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 469, in to_datetime
result = _convert_listlike(np.array([arg]), box, format)[0]
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 380, in _convert_listlike
raise e
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 347, in _convert_listlike
errors=errors)
File "pandas/_libs/tslibs/strptime.pyx", line 163, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains: 5
Anyway, nothing works, has anyone been there? Any suggestions for doing the file reading with the correct data type or for converting the column on the dataframe?
I really hadn't realized the problem with the data.
Removing those with decimal parts greater than 365, I tested Tuhin Sharma's idea.
Unfortunately, it returns the value of the first line for all dataframe lines.
But I used the datetime module, as suggested by Tuhin Sharma, in a lambda function when reading the file as follows:
Sample file:
data
1980.042
1980.125
1980.208
1980.292
Code:
import pandas as pd
import datetime
date_parser = lambda col: datetime.datetime.strptime(col, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)
Output:
data
0 1980-02-11
1 1980-05-04
2 1980-07-26
3 1980-10-18
You could try using datetime module. You can try the following code:-
import pandas as pd
import numpy as np
import datetime
import pandas as pd
df = pd.read_csv('datas.csv',dtype=str)
df["data"] = df["data"].map(lambda x: datetime.datetime.strptime(x,'%Y.%j'))
However this code will fail. Because your data has problem.
1980.375
1980.458
1980.542
1980.625
1980.708
For these values if you see the number of days is greater than 365 (3 decimal places), which is not possible and thats why it will throw error.
Hope this helps!!
You can try the following code as well which is a lot cleaner:-
import pandas as pd
import datetime
date_parser = lambda x: datetime.datetime.strptime(x, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)
I have a short code where I want to be able to change values of a csv.
My Code is the following:
import pandas as
import os
if os.path.exists('annotation.csv'):
df = pd.read_csv('annotation.csv')
label = 'unknown'
img_number = 99
if str(df.at[img_number, 'label_2']) == 'nan':
df.at[img_number, 'label_2'] = label
else:
continue
My file looks like this:
label_1,label_2
,
,
,
,
,
(and so on)
I am able to change 'label_1' in df.at[img_number, 'label_2']
but if I try replace it 'label_2' I get the following error.
Traceback (most recent call last):
File "D:/develop/mbuchwald/machinelearning/Organ_Annotation/test.py", line 11, in <module>
df.at[img_number, 'label_2'] = label
File "C:\Users\mbuchwald\AppData\Local\Continuum\anaconda3\envs\newEnv\lib\site-packages\pandas\core\indexing.py", line 2159, in __setitem__
self.obj._set_value(*key, takeable=self._takeable)
File "C:\Users\mbuchwald\AppData\Local\Continuum\anaconda3\envs\newEnv\lib\site-packages\pandas\core\frame.py", line 2582, in _set_value
engine.set_value(series._values, index, value)
File "pandas\_libs\index.pyx", line 124, in pandas._libs.index.IndexEngine.set_value
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.set_value
File "pandas/_libs/src\util.pxd", line 150, in util.set_value_at
File "pandas/_libs/src\util.pxd", line 142, in util.set_value_at_unsafe
ValueError: could not convert string to float: 'unknown'
Has anyone a clue. I canĀ“t figure it out. Thank you!