How to use an existing column as index in Spark's Dataframe

How to use an existing column as index in Spark's Dataframe - python

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.

add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))

This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?

From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]

Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

Python: Create dataframe with 'uneven' column entries

I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you

If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.

Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.

You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task

You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?

I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe

Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

Adding DataFrame columns in Python pandas

I have pandas DataFrame that has a number of columns (about 20) containing string objects. I'm looking for a simple method to add all of the columns together into one new column, but have so far been unsuccessful e.g.:
for i in df.columns:
df[‘newcolumn’] = df[‘newcolumn’] + ‘/‘ + df.ix[:,i]
This results in an empty DataFrame column ‘newcolumn’ instead of the concatenated column.
I’m new to pandas, so any help would be much appreciated.

df['newcolumn'] = df.apply(''.join, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use an existing column as index in Spark's Dataframe - python

add index to pyspark dataframe as a column and use it rdd_df = df.rdd.zipWithIndex() df_index = rdd_df.toDF() #and extract the columns df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA")) df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))

This is not how it works with Spark. No such concept exists. One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

Python: Create dataframe with 'uneven' column entries

How to find if a values exists in all rows of a dataframe?

Dask: subset (or drop) rows from Dataframe by index

Adding DataFrame columns in Python pandas

Categories

Resources