Adding an np.array as a column in a pandas.DataFrame

Adding an np.array as a column in a pandas.DataFrame - python

I have a pandas data frame and a numpy nd array with one dimension. Effectively it is a list.
How do I add a new column to the DataFrame with the values from the array?
test['preds'] = preds gives SettingWithCopyWarning
And a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
And when I try pd.DataFrame({test,preds}) I get TypeError: unhashable type: 'list'

Thanks to EdChum the problem was this
test= DataFrame(test)
test['preds']=preds
It works!

This is not a pandas error, this error is because you are trying to instantiate a set with two lists.
{test,preds}
#TypeError: unhashable type: 'list'
A set is a container which needs all its content to be hashable, since sets may not contain the same element twice.
That being said, handing pandas a set will not work for your desired result.
Handing pandas a dict however, will work, like this:
pd.DataFrame({"test":test,"preds":preds})

Related

DataFrame with DateTimeIndex from Django model

My goal is to create a pandas dataframe with a datetimeindex from a django model. I am using the django-pandas package for this purpose, specifically, the 'to_timeseries()' method.
First, I used the .values() method on my qs. This still returns a qs, but it contains dictionaries. I then used to_timeseries() to create my dataframe. Everything here worked as expected: the pivot, the values, etc. But my index is just a list of strings. I don't know why.
I have been able to find a great many manipulations in the pandas documentation, including how to turn a column or series into datetime objects. However, my index is not a Series, it is an Index, and none of these methods work. How do I make this happen? Thank you.
df = mclv.to_timeseries(index='day',
pivot_columns='medicine',
values='takentoday',
storage='wide')
df = df['day'].astype(Timestamp)
raise TypeError(f"dtype '{dtype}' not understood")
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood
AttributeError: 'DataFrame' object has no attribute 'DateTimeIndex'
df = pd.DatetimeIndex(df, inplace=True)
TypeError: __new__() got an unexpected keyword argument 'inplace'
TypeError: Cannot cast DatetimeIndex to dtype
etc...

Correction & update: django-pandas did work as its authors expected. The problem was my misunderstanding of what it was doing, and how.

Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?

As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

Getting an error when converting to float to get top 10 largest values

I am trying to use the nlargest function to return top 10 values using code below as,
df['roi'].astype(float).nlargest(3, 'roi')
But get an error of
ValueError: keep must be either "first", "last" or "all"
the roi column is an object, which is why I use the astype float but am still getting an error
When I try the keep = all or keep = first or last filter in the nlargest function I get an error of TypeError: nlargest() got multiple values for argument 'keep'
Thanks!

To use the method as you want, you must change your code to:
df.astype(float).nlargest(3, 'roi')
Since this syntax works only for pandas.DataFrames. If you want to specify the colum by its key, as in a dictionary, then you'll be working with pandas.Series, and the correct syntax would be
df['roi'].astype(float).nlargest(3)
The docs for both methods are here, for DataFrames, and here, for Series

For a one-liner you'll need to convert "roi" to a float type first, and then perform nlargest:
Passing a dictionary to .astype allows us to return the entire DataFrame making selective changes to specific columns' dtypes, and then we can perform .nlargest on that returned DataFrame (instead of just having a Series).
df.astype({"roi": float}).nlargest(3, columns="roi")

using .columns in Pandas

Hi, I am using .columns attribute in pandas and I am getting INDEX at the beginning, can someone please let me know that why INDEX is mentioned at the beginning.

Answered before Index objects in pandas--why pd.columns returns index rather than list, official documentation .index.
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects
If you are just expecting values then you've to go a little bit more. As .columns returns an Index , .columns.values returns an array and a helper function .tolist returns a list of column names.
car_sales.columns.values.tolist()
You can use this car_sales.columns.tolist() too but won't perform good in large dataframes.

rbindlist equivalent R's function in python

First i am creating empty lists based on length of the num_vars and storing the output of each loop in one list.
After that I want to combine the all the outputs and convert that as pandas data frame.
for this we can simply use rbindlist in R, for combine the list objects.
for that i used the following python code.
ests_list=[[] for i in range(num_vars)]
for i in list(range(0,num_vars)):
for j in list(range(1,num_vars+1))
ests_list[i]=pd.merge(df1,
df2,
how='left',
on=eval('combine%s'%j+'_lvl'))
pd.concat(ests_list)
when i tried the above syntax it throws the following error:
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Please anyone can help me to solve this issue.
Thanks in advance.

I found a solution for my problem:
ests_list=[]
for i in list(range(1,num_vars)):
ests_list.append(df1.merge(df2,how='left',on=eval("combine%s"%i+"_lvl")))
pd.concat(ests_list)
I am creating an empty list and and I added each loop output to it.
Then I am combining all the list by using the pd.concat function, so it gives me the output in pandas data frame format.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding an np.array as a column in a pandas.DataFrame - python

Thanks to EdChum the problem was this test= DataFrame(test) test['preds']=preds It works!

Related

DataFrame with DateTimeIndex from Django model

Why does dask throw an error when setting a String column as an index?

Getting an error when converting to float to get top 10 largest values

using .columns in Pandas

rbindlist equivalent R's function in python

Categories

Resources