issue to replace null values in pyspark dataframe - python

I am experiencing issue to replace null values by 0 in some PySpark dataframe.
Let df1 and df2 two dataframes. After a join procedure on col1, I get a dataframe df, which contains two columns with same column name (maybe with different values) inherited from df1 and df2, let say df1.dup_col and df2.dup_col. I have null values on each of them, I want to replace them by 0 in df1.dup_col.
So, first I drop the df2.dup_col columns, then I call
df.fillna({"df1.dup_col":'0'})
but I still get the null values. So I tried,
df.select("df1.dup_col").na.fill(0)
with the same result. So I tried
df = df.withColumn("df1.dup_col", when(df["df1.dup_col"].isNull(), 0).otherwise(
df["df1.dup_col"]))
with no better result.
Am I missing something ?

You should do something like :
df = df.fillna("0", subset = ["dup_col"]) # This is the string 0
df = df.fillna(0, subset = ["dup_col"]) # This is the number 0

df = df.fillna({'colName':'value_to_replace'})

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?
From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]
Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

How to check if two pandas dataframes have same values and concatenate those rows?

I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()

How to merge 2 dataframe rows in a new dataframe row with pandas?

I have 2 variables (dataframes) one is 47 colums wide and the other is 87, they are DF2 and DF2.
Then I have a variable (dataframe) called full_data. Df1 and DF2 are two different subset of data I want to merge together once I find 2 rows are equal.
I am doing everything I want so far besides appending the right value to the new dataframe.
below is the line of code I have been playing around with:
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
once I find the rows in both Df1 and DF2 are equal I am trying to read both those rows and put them one after the other as a single row in the variable full_data. What is happening right now is that the line of code is writting 2 rows and no one as I want.
what I want is full_data.append(Df1 DF2) and right now I am getting
full_data(i)=DF1
full_data(i+1)=DF2
Any help would be apreciated.
EM
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
In the end I solved my problem. Probably I was not clear enough but my question but what was happening when concatenating is that I was getting duplicated or multiple rows when the expected result was getting a single row concatenation.
The issues was found to be with the indexing. Indexing had to be reset because of the way pandas works.
I found an example and explanation here
My solution here:
df3 = df2[j:j+1].copy()
df4 = df1[i:i+1].copy()
full_data = full_data.append(
pd.concat([df4.reset_index(drop=True), df3.reset_index(drop=True)], axis=1),
ignore_index = True
)
I first created a copy of my variables and then reset the indexes.

Pandas column1 values, column2 names, is there a way to group and rearrange the data so that column2 becomes row header

Data is an unique value, id is repeated multiple times in an excel file. Data is column 1 and id's are column 2. I would like to group the unique data values to an id without losing any. Then set the column index as the id, and paste the data values associated below. Then do the same thing with the second id and paste that id's values below 1 cell to the left of the first id column. Could anyone help me sort it out to such layout?
You can't have varying length columns in a dataframe. So NaNs are unavoidable.
import pandas as pd
df = pd.DataFrame({'col1':[2,3,3,4,2,1,3,4], 'col2':[1,1,1,1,2,2,2,3]})
# First problem
df2 = df.pivot(columns='col2')["col1"]
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
print(df2)
# Second problem
def concat(s):
return s.tolist()
df3 = df.groupby('col2').agg(concat)["col1"].apply(pd.Series)
print(df3)

Select pandas dataframe column that has NaN or NULL values and fill it with 0's

I have a dataframe that has some missing data in a few different columns. How do I write a function that identifies the columns with missing (i.e. NaN or NULL values) data and fills them with 0's?
I currently have this for inputting specific columns where I already know there is missing data; however I'm trying to come up with a function that finds columns with missing data on its own.
def fill_blanks(dataframe, column):
dataframe[column] = dataframe[column].fillna(0)
you can just use .fillna()
df = df.fillna(0)
or
df.fillna(0, inplace=True)
You can use fillna(0) on entire dataframe:
dataframe = dataframe.fillna(0)
or:
dataframe.fillna(0, inplace=True)
Is this what you are trying to do?

Categories

Resources