dask dataframe drop duplicate index values - python

I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.
When using pandas i would use
df = df[~df.index.duplicated(keep = "first")]
And it works
When trying to do the same with dask dataframe i get
AttributeError: 'Index' object has no attribute 'duplicated'
I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible
I could use df.compute() and than drop the duplicated index values but this df is too big for memory.
How can i drop the duplicated index values from my dataframe using dask dataframe?

I think you need convert index to Series by to_series, keep='first' should be omit, because default parameter in duplicated:
df = df[~df.index.to_series().duplicated()]

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?
From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]
Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

Count values in a columns depending on value of another one column

I have this issues: I have to sum all the values present in a dataframe column based on the value that I have in another column. In the specific I have a column "App" and a columns "n_of_Installs".
What I need is count all "n. of Installs" for each App.
I tried this code: dataframe.groupby('App').sum()['n_of_Installs'] but it doesn't work.
Following line will do what you want:
dataframe.groupby('App')['n_of_installs'].sum() # returns pandas Series
Note that above code returns you a pandas Series. If you wish to get a pandas DataFrame instead, use the as_index=False option of groupby
dataframe.groupby('App', as_index=False)['n_of_installs'].sum() # returns pandas DataFrame

How to use an existing column as index in Spark's Dataframe

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.
add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))
This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?
I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe
Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

Add pandas Series to a DataFrame, preserving index

I have been having some problems adding the contents of a pandas Series to a pandas DataFrame. I start with an empty DataFrame, initialised with several columns (corresponding to consecutive dates).
I would like to then sequentially fill the DataFrame using different pandas Series, each one corresponding to a different date. However, each Series has a (potentially) different index.
I would like the resulting DataFrame to have an index that is essentially the union of each of the Series indices.
I have been doing this so far:
for date in dates:
df[date] = series_for_date
However, my df index corresponds to that of the first Series and so any data in successive Series that correspond to an index 'key' not in the first Series are lost.
Any help would be much appreciated!
Ben
If i understand you can use concat:
pd.concat([series1,series2,series3],axis=1)

Categories

Resources