Analyze data of One Column Dataframe based on criteria - python

I have a Dataframe with ten columns and more than a 1000 rows of data. I am cycling through each of the headers and calculating the difference between two headers. This results in a dataframe that is one column and N rows. I would like to then run statistics based on certain criteria. like give me statistics for a group of data that is greater than zero.
so the sample is something like this.
TempDF = df[Header] - df[SecondHeader]
if (TempDF.median()>TempDF.mean()):
print (df(TempDF[]>0).describe())
This generates a Key Error : True and doesnt show me anything.. please help.. I am trying to generate statistics on the resultant dataframe based on certain criteria.
I want to know how to accomplish that. Thank you.

You are not filtering your Series object correctly. Here is an example of how to do it:
from pandas import DataFrame
df = DataFrame([[1111,22,33],[140,25,36],[47,58,69]])
df.columns=['Header','SecondHeader','ThirdHeader']
TempDF = df['Header'] - df['SecondHeader']
if TempDF.median() < TempDF.mean():
print TempDF[TempDF>0].describe()

Related

How to filter a Pandas dataframe to keep entire rows/colums if a criterium is fullfilled?

I am learning Python Pandas and I am having some trouble with data filtering. I have gone through multiple examples and I cannot seem to find an approach that fits my particular need:
In a dataframe with numerical values, I would like to filter rows and columns by the following criterium:
"If ANY value in a row is above a threshold, include the WHOLE row (including values that are below the threshold). Else, discard the row."
This should apply to all rows. Subsequently, I would repeat for columns.
Any help is highly appreciated.
Use:
value = 123
df[df.gt(value).any(axis=1)]
For columns, this would be:
value = 123
df.loc[:, df.gt(value).any(axis=0)]

How can I compare two different datasets and return values based in a column based on a specific criteria?

I am trying to create a new column in the "schedule" dataframe after comparing two separate columns in the "other" dataframe. Here is how my code looks now:
import pandas as pd
import numpy as np
schedule = pd.read_excel('schedule.xlsx')
other = pd.read_excel('other.xlsx')
other['New Column'] = np.where(other['Termination Date'] >= schedule['Beginning'] & (other['Termination Date'] <= schedule['End'], schedule['Pay Date']))
but it is returning this error:
ValueError: Can only compare identically-labeled Series objects
Here is what a typical example would look like in this scenario:
if
other['Termination Date']= "5/22/2021"
then it would return "6/11/2021" because it would look at
schedule['Beginning']
and
schedule['End']
to meet the criteria.
Note that the two data frames do not have any similar data to merge on. Basically, I just need to compare from one data frame and return values on another. Let me know if you have any questions and thank you all in advance!
schedule df
other df
It looks like Python will only compare the values if it looking at columns with the same labels from the two tables. You are asking Python to compare the same column twice.
Perhaps try an if else type statement in place of the &...

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Total zero count across all columns in a pyspark dataframe

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?
P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.
"How to find the count of zero across each columns in the dataframe?"
First:
import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])
Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):
df_zero.limit(2).toPandas().head()
Enjoy! :)
Use this code to find number of 0 in a column of a table.
Just replace Tablename and "column name" with the appropriate values:
Tablename.filter(col("column name")==0).count()

Python pandas: fill a dataframe with data from another

I have an empty pandas dataframe as displayed in the first picture.
What I like first dataframe
So, many, many Pfam IDs as columns and many different gene IDs as indices. Then I have a second dataframe like this.
second dataframe
Now what I would like to do is getting the data from the second into the first, doing this I simply like to write a 0 in each Pfam column that has no entry for a particular gene ID, and a 1 in each case a gene has a Pfam.
Any help would be highly appreciated.
assume the first dataframe is named d1 and the second is d2
d1.fillna(d2.groupby([d2.index, 'Pfam']).size().mul(0).unstack())

Categories

Resources