Total zero count across all columns in a pyspark dataframe - python

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?
P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.

"How to find the count of zero across each columns in the dataframe?"
First:
import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])
Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):
df_zero.limit(2).toPandas().head()
Enjoy! :)

Use this code to find number of 0 in a column of a table.
Just replace Tablename and "column name" with the appropriate values:
Tablename.filter(col("column name")==0).count()

Related

How to filter a Pandas dataframe to keep entire rows/colums if a criterium is fullfilled?

I am learning Python Pandas and I am having some trouble with data filtering. I have gone through multiple examples and I cannot seem to find an approach that fits my particular need:
In a dataframe with numerical values, I would like to filter rows and columns by the following criterium:
"If ANY value in a row is above a threshold, include the WHOLE row (including values that are below the threshold). Else, discard the row."
This should apply to all rows. Subsequently, I would repeat for columns.
Any help is highly appreciated.
Use:
value = 123
df[df.gt(value).any(axis=1)]
For columns, this would be:
value = 123
df.loc[:, df.gt(value).any(axis=0)]

How to find the number of null elements in a pandas DataFrame

I want a way to find the number of null elements in a DataFrame which gives just 1 number not another series or anything like that
You can simply get all null values from the dataframe and count them:
df.isnull().sum()
Or you can use individual column as well:
df['col_name'].isnull().sum()
you could use pd.isnull() and sum:
df = pd.DataFrame([[1,1,1], [2,2,np.nan], [3,np.nan, np.nan]])
pd.isnull(df).values.sum()
which gives: 3
This code chunk will help
# df is your dataframe
print(df.isnull().sum())

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Sort DataFrame columns according to value in [last row, first column]

I have a df and I am trying to sort the columns in descending order based on the value in the last row (basically the value in the most left, down cell in the df).
How can I do that in Python?
I have tried to search for a similar problem to mine but I have not managed to find a solution.
Thank you.
The resulting df should have in cell df[-1,1] the greatest value in the row, in df[-1,2] the second greatest...and so on.
you could use argsort() on the negative values of the relevant row for this
import pandas as pd
import numpy as np
df.iloc[:,np.argsort(-df.iloc[-1,:])]
My understanding is you're trying to sort data based on the last column. I imagine this last column would be dynamic so you can't just use the column's name. For sorting any dataframe in descending order, you can use
df.sort_values([column name], ascending=False)
for finding the last column, you can use
df.columns[-1]
Putting that together, I think you want something like this.
df.sort_values(df.columns[-1], ascending=False)

Analyze data of One Column Dataframe based on criteria

I have a Dataframe with ten columns and more than a 1000 rows of data. I am cycling through each of the headers and calculating the difference between two headers. This results in a dataframe that is one column and N rows. I would like to then run statistics based on certain criteria. like give me statistics for a group of data that is greater than zero.
so the sample is something like this.
TempDF = df[Header] - df[SecondHeader]
if (TempDF.median()>TempDF.mean()):
print (df(TempDF[]>0).describe())
This generates a Key Error : True and doesnt show me anything.. please help.. I am trying to generate statistics on the resultant dataframe based on certain criteria.
I want to know how to accomplish that. Thank you.
You are not filtering your Series object correctly. Here is an example of how to do it:
from pandas import DataFrame
df = DataFrame([[1111,22,33],[140,25,36],[47,58,69]])
df.columns=['Header','SecondHeader','ThirdHeader']
TempDF = df['Header'] - df['SecondHeader']
if TempDF.median() < TempDF.mean():
print TempDF[TempDF>0].describe()

Categories

Resources