Cannot concatenate object when adding to a DataFrame - python

I am trying to add a sentence as well as a coin(like a label in this case I guess) to a DataFrame. Although I keep getting this error:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\filter.py", line 132, in <module>
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 2877, in append
return concat(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 294, in concat
op = _Concatenator(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 384, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid
Here is the code:
data = pd.read_csv('C:\\Users\\gjohn\\Documents\\code\\machineLearning\\trading_bot\\testreviews.csv')
df = data['review'] # Create a dataframe of the reviews.
classes = data['class'] # Create a dataframe of the classes.
for sentence in sentences:
coin = find_coin(common_words, sentence)
if len(sentence) > 0 and coin != None:
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
I can't find how to fix this and I really need help, it would be great if you could help me out. Thanks!
Also sorry for the messy code :D

What is the sentence you use to construct the dictionary?
Perhaps you should check if the dictionary is constructed correctly?

Related

Pandas – ValueError: Cannot setitem on a Categorical with a new category, set the categories first

I've been searching for a solution to this for the past few hours now. Relevant pandas documentation is unhelpful and this solution gives me the same error.
I am trying to order my dataframe using a categorical in the following manner:
metabolites_order = CategoricalDtype(['Header', 'Metabolite', 'Unknown'], ordered=True)
df2['Feature type'] = df2['Feature type'].astype(metabolites_order)
df2 = df2.sort_values('Feature type')
The "Feature type" column is populated with the categories correctly. This code runs perfectly in Jupyter Notebooks, but when I run it in Pycharm, I get the following error:
Traceback (most recent call last):
File "/Users/wasim.sandhu/Documents/MSDIALPostProcessor/postprocessor.py", line 138, in process_alignment_file
df2.loc[4] = list(df2.columns)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1635, in _setitem_with_indexer
self._setitem_with_indexer_split_path(indexer, value, name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1700, in _setitem_with_indexer_split_path
self._setitem_single_column(loc, v, pi)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1813, in _setitem_single_column
ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 568, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 1846, in setitem
self.values[indexer] = value
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/_mixins.py", line 211, in __setitem__
value = self._validate_setitem_value(value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/categorical.py", line 1898, in _validate_setitem_value
raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
What could be causing this? I believe that I've set the categories correctly...
I'd suggest just mapping these categories to integers then sorting on that column instead.
categories = ['Header', 'Metabolite', 'Unknown']
feature_map = {categories[i]:i for i in range(len(categories))}
df['Feature Order'] = df['Feature Type'].map(feature_map)
df.sort_values('Feature Order')
Figured it out literally minutes after I posted the question. The header column in this dataset is in the 5th row. I checked the "Feature type" column and "Feature type" is one of its values, which threw this error.
Solved by adding the column header name into the categories.
metabolites_order = CategoricalDtype(['Header', 'Feature type', 'Metabolite', 'Unknown'], ordered=True)

pandas.DataFrame.agg does not work with np.std?

I am trying to use the pandas.DataFrame.agg function on the first column of a dataframe with the agg function is numpy.std.
I dont know why it works with numpy.mean but not numpy.std?
Can someone tell me in what circumstance that happens.
This is very strange ##
The following describes what I am facing.
My source is like this:
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.mean]})
print(agg_df)
then it shows the result like this
<class 'pandas.core.frame.DataFrame'>
ax
0 -98.06
1 -97.81
2 -96.00
3 -93.44
4 -92.94
ax
mean -98.06
now I change the function from np.mean into np.std (without changing anything else)
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.std]})
print(agg_df)
it shows the errors
Traceback (most recent call last):
File "C:\prediction_framework_django\predictions\predictor.py", line 112, in pre_aggregated_unseen_data
agg_df = dataframe.agg({axis: [np.std]})
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7578, in aggregate
result, how = self._aggregate(func, axis, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7609, in _aggregate
return aggregate(self, arg, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 582, in aggregate
return agg_dict_like(obj, arg, _axis), True
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in agg_dict_like
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in <dictcomp>
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\series.py", line 3974, in aggregate
result, how = aggregate(self, func, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 586, in aggregate
return agg_list_like(obj, arg, _axis=_axis), None
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 672, in agg_list_like
raise ValueError("no results")
ValueError: no results
So the error is
in agg_list_like raise ValueError("no results") ValueError: no results
Thank you for your time and help.
Simply use the pandas builtin:
# Note the use of string to denote the function here
df.agg({first_col: ['mean', 'std']})
# You can also simply use the following
df[first_col].mean()
df[first_col].std()
[EDIT]: The error that you are getting is probably resulting from mixed types. You can check that all dtypes are float by looking at df.dtypes. If you have one that is object, then convert the problematic ones (probably empty strings) into whatever you need and np.std and pandas' builtin std should work

Object of type 'float' has no len() error when slicing pandas dataframe json column

I have data that looks like this. In each column, there are value/keys of varying different lengths. Some rows are also NaN.
like match
0 [{'timestamp', 'type'}] [{'timestamp', 'type'}]
1 [{'timestamp', 'comment', 'type'}] [{'timestamp', 'type'}]
2 NaN NaN
I want to split these lists into their own columns. I want to keep all the data (and make it NaN if it is missing). I've tried following this tutorial and doing this:
df1 = pd.DataFrame(df['like'].values.tolist())
df1.columns = 'like_'+ df1.columns
df2 = pd.DataFrame(df['match'].values.tolist())
df2.columns = 'match_'+ df2.columns
col = df.columns.difference(['like','match'])
df = pd.concat([df[col], df1, df2],axis=1)
I get this error.
Traceback (most recent call last):
File "link to my file", line 12, in <module>
df1 = pd.DataFrame(df['like'].values.tolist())
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 509, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 524, in to_arrays
return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 561, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/lib.pyx", line 2448, in pandas._libs.lib.to_object_array
TypeError: object of type 'float' has no len()
Can someone help me understand what I'm doing wrong?
You can't perform values.tolist() on NaN. If you delete that row of NaNs, you can get past this issue. but then your prefix line fails. See this for prefixes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_prefix.html

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements
Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

Categories

Resources