How to find the number of null elements in a pandas DataFrame - python

I want a way to find the number of null elements in a DataFrame which gives just 1 number not another series or anything like that

You can simply get all null values from the dataframe and count them:
df.isnull().sum()
Or you can use individual column as well:
df['col_name'].isnull().sum()

you could use pd.isnull() and sum:
df = pd.DataFrame([[1,1,1], [2,2,np.nan], [3,np.nan, np.nan]])
pd.isnull(df).values.sum()
which gives: 3

This code chunk will help
# df is your dataframe
print(df.isnull().sum())

Related

Pandas Dataframe: How to drop_duplicates() based on index subset?

Wondering if someone could please help me on this:
Have a pandas df with a rather large amount of columns (over 50). I'd like to remove duplicates based on a subset (column 2 to 50).
Been trying to use df.drop_duplicates(subset=["col1","col2",....]), but wondering if there is a way to pass the column index instead so I don't have to actually write out all the column headers to consider for the drop but instead can do something along the lines of df.drop_duplicates(subset = [2:])
Thanks upfront
You can slice df.columns like:
df.drop_duplicates(subset = df.columns[2:])
Or:
df.drop_duplicates(subset = df.columns[2:].tolist())

Splitting values from column into several columns

I have a pandas dataframe that I want the numbers of the column C to be added together and created a new column D.
For example
Thanks in advance.
Use Series.str.extractall for get numbers separately, convert to integers and last sum per first level of MultiIndex:
df['D'] = df['C'].str.extractall('(\d)').astype(int).sum(level=0)

issue to replace null values in pyspark dataframe

I am experiencing issue to replace null values by 0 in some PySpark dataframe.
Let df1 and df2 two dataframes. After a join procedure on col1, I get a dataframe df, which contains two columns with same column name (maybe with different values) inherited from df1 and df2, let say df1.dup_col and df2.dup_col. I have null values on each of them, I want to replace them by 0 in df1.dup_col.
So, first I drop the df2.dup_col columns, then I call
df.fillna({"df1.dup_col":'0'})
but I still get the null values. So I tried,
df.select("df1.dup_col").na.fill(0)
with the same result. So I tried
df = df.withColumn("df1.dup_col", when(df["df1.dup_col"].isNull(), 0).otherwise(
df["df1.dup_col"]))
with no better result.
Am I missing something ?
You should do something like :
df = df.fillna("0", subset = ["dup_col"]) # This is the string 0
df = df.fillna(0, subset = ["dup_col"]) # This is the number 0
df = df.fillna({'colName':'value_to_replace'})

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Total zero count across all columns in a pyspark dataframe

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?
P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.
"How to find the count of zero across each columns in the dataframe?"
First:
import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])
Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):
df_zero.limit(2).toPandas().head()
Enjoy! :)
Use this code to find number of 0 in a column of a table.
Just replace Tablename and "column name" with the appropriate values:
Tablename.filter(col("column name")==0).count()

Categories

Resources