DataFrame with DateTimeIndex from Django model - python

My goal is to create a pandas dataframe with a datetimeindex from a django model. I am using the django-pandas package for this purpose, specifically, the 'to_timeseries()' method.
First, I used the .values() method on my qs. This still returns a qs, but it contains dictionaries. I then used to_timeseries() to create my dataframe. Everything here worked as expected: the pivot, the values, etc. But my index is just a list of strings. I don't know why.
I have been able to find a great many manipulations in the pandas documentation, including how to turn a column or series into datetime objects. However, my index is not a Series, it is an Index, and none of these methods work. How do I make this happen? Thank you.
df = mclv.to_timeseries(index='day',
pivot_columns='medicine',
values='takentoday',
storage='wide')
df = df['day'].astype(Timestamp)
raise TypeError(f"dtype '{dtype}' not understood")
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood
AttributeError: 'DataFrame' object has no attribute 'DateTimeIndex'
df = pd.DatetimeIndex(df, inplace=True)
TypeError: __new__() got an unexpected keyword argument 'inplace'
TypeError: Cannot cast DatetimeIndex to dtype
etc...

Correction & update: django-pandas did work as its authors expected. The problem was my misunderstanding of what it was doing, and how.

Related

Joining two dataframe of one column generated with spark

I'm working with pyspark and pandas in Databricks. I'm generating the two following dataframe:
start_date = datetime.today() - timedelta(days=60)
end_date = datetime.today()
date_list = pd.date_range(start=start_date,end=end_date).strftime('%Y-%m-%d').tolist()
date_df = spark.createDataFrame(date_list, 'string').toDF("date")
and
random_list = np.random.normal(loc=50, scale=10, size=61)
random_list = [round(i) for i in random_list]
integer_df = spark.createDataFrame(random_list, 'integer').toDF("value")
so I have two dataframes of one column each ("date" and "value") of the same length and I'd like to "merge" them into one dataframe.
I've tried this:
integer_df=pd.concat(date_df)
which is returning the following error first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
and this
test_df = pd.concat([integer_df, date_df], axis=1, join='inner')
which is returning the following error cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
Mostly I'd like to understand these errors.
From what i could see you are not transitioning the objects correctly, for example you are trying to concatenate a sparkdf object to a pandasdf object.
first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
This guy was caused because because, you passed the wrong type object. To concatenate. You should try using pandas on spark object or just pandas objects, if you are going to use pandas.
So to fix your first error, i would just follow the convention. Work with the objects of the given library.
Something like this (or maybe just use pd.Series() or pd.DataFrame)
date_df = spark.createDataFrame(date_list, 'string').toPandas()
# toDF("date") is redundant, either use createDataFrame or toDf not both
integer_df = spark.createDataFrame(random_list, 'integer').toPandas()
After that try utilizing pd.concat([]), with the give results.
Your second error, was caused because pandas has a given condition to only accept type Series object (something similar to your list), since you are passing a pyspark df well i guess pandas gets confused and read it as a list.
So to fix it again utilize the correct object of the library, or transform it to numpy if you want something more efficient
Hope this helps.

Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?
As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

Filtering pandas dataframe based on attribute of object in a column

This is something that I can do with a roundabout measure, but I'm wondering if there's something offered by Pandas that makes this possible which might be missing.
So I have a column, "objects", which contains objects that have attributes. One of those attributes is something called "key". I'm trying to filter my dataframe to only include objects whose key belongs in a certain list:
df2 = df[df["object"].key.isin(list_of_keys)]
The error this returns is
AttributeError: 'Series' object has no attribute 'key'
I tried something like this too, but it didn't work:
df2 = df[df["object"].map(lambda x: x.key).isin(list_of_keys)]
This returns an even more inscrutable error:
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Try comparing directly within the lambda function:
df2 = df[df["object"].map(lambda x: x.key in list_of_keys)]

Pandas merge not working due to a wrong type

I'm trying to merge two dataframes using
grouped_data = pd.merge(grouped_data, df['Pattern'].str[7:11]
,how='left',left_on='Calc_DRILLING_Holes',
right_on='Calc_DRILLING_Holes')
But I get an error saying can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
What could be the issue here. The original dataframe that I'm trying to merge to was created from a much larger dataset with the following code:
import pandas as pd
raw_data = pd.read_csv(r"C:\Users\cherp2\Desktop\test.csv")
data_drill = raw_data.query('Activity =="DRILL"')
grouped_data = data_drill.groupby([data_drill[
'PeriodStartDate'].str[:10], 'Blast'])[
'Calc_DRILLING_Holes'].sum().reset_index(
).sort_values('PeriodStartDate')
What do I need to change here to make it a regular normal dataframe?
If I try to convert either of them to a dataframe using .to_frame() I get an error saying that 'DataFrame' object has no attribute 'to_frame'
I'm so confused at to what kind of data type it is.
Both objects in a call to pd.merge need to be DataFrame objects. Is grouped_data a Series? If so, try promoting it to a DataFrame by passing pd.DataFrame(grouped_data) instead of just grouped_data.

Adding an np.array as a column in a pandas.DataFrame

I have a pandas data frame and a numpy nd array with one dimension. Effectively it is a list.
How do I add a new column to the DataFrame with the values from the array?
test['preds'] = preds gives SettingWithCopyWarning
And a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
And when I try pd.DataFrame({test,preds}) I get TypeError: unhashable type: 'list'
Thanks to EdChum the problem was this
test= DataFrame(test)
test['preds']=preds
It works!
This is not a pandas error, this error is because you are trying to instantiate a set with two lists.
{test,preds}
#TypeError: unhashable type: 'list'
A set is a container which needs all its content to be hashable, since sets may not contain the same element twice.
That being said, handing pandas a set will not work for your desired result.
Handing pandas a dict however, will work, like this:
pd.DataFrame({"test":test,"preds":preds})

Categories

Resources