How to count the column name that having missing - python

I have a large datasets. I partitioned the data into training and test.
I found the missing values of the independent variable.
I want to calculate the number of columns that have the missing value. in this case, I should get 12 names. I was only able to sum the whole column
Here is my attempt:
finding_missing_values = data.train.isnull().sum()
finding_missing_values
finding_missing_values.sum()
is there a way I can count the number of column that has a missing value?

Take data list to and then count non zero values as follows.
finding_missing_values = (data.train.isnull().sum()).to_list()
number of missing value columns = sum(k>0 for k in finding_missing_values )
print(number of missing value columns)
should Give #
12

You wrote
finding_missing_values.sum()
You were looking for
(finding_missing_values > 0).values.sum()
From .values we get a numpy array.
The comparison gives us False / True values,
which conveniently are treated as 0 / 1 by .sum()

Related

Python Pandas: Get the two largest values from a set of numbers but ensure they are x amount of values apart

Is it possible to use .nlargest to get the two highest numbers in a set of number, but ensure that they are x amount of rows apart?
For examples, in the following code I would want to find the largest values but ensure that they are more than 5 values apart from each other. Is there an easy way to do this?
data = {'Pressure' : [100,112,114,120,123,420,1222,132,123,333,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
}
If I understand the question correctly, you need to output the largest value, and then the next largest value that's at least X rows apart from it (based on the index).
First value is just data.Pressure.max(). Its index is data.Pressure.idxmax()
Second value is either before or after the first value's index:
max_before = df.Pressure.loc[:df.Pressure.idxmax() - X].max()
max_after = df.Pressure.loc[df.Pressure.idxmax() + X:].max()
second_value = max(max_before, max_after)

Most efficient counting numpy

I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:
Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times
The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.
My attempt:
n = df.shape[0]/10
variable = []
condition_1 = []
condition_2 = []
for i in df:
variable.append(i)
condition_1.append(df[i].unique().shape[0] < n)
condition_2.append(most_common_value_count/second_most_common_value_count > 20)
result = pd.DataFrame({"Variables": variable,
"Condition_1": condition_1,
"Condition_2": condition_2})
The dataset df contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.
Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.
As noted in the comments, you may want to use np.unique (or pd.unique). You can set return_counts=True to get the value counts. These will be the second item in the tuple returned by np.unique, hence the [1] index below. After sorting them, the most common count will be the last value, and the second most common count will be the next to last value, so you can get them both by indexing with [-2:].
You could then construct a Boolean list indicating which columns meet your condition #2 (or rather the opposite). This list can then be used as a mask to reduce the dataframe:
def counts_ratio(s):
"""Take a pandas series s and return the
count of its most common value /
count of its second most common value."""
counts = np.sort(np.unique(s, return_counts=True)[1])[-2:]
return counts[1] / counts[0]
condition2 = [counts_ratio(df[col]) <= 20
for col in df.columns]
df_reduced = df[df.columns[condition2]]

I am trying to rescale numeric columns in a dataframe that is slightly different from the MinMaxscaler approach

I have a dataframe called df that contains 6 numeric columns.
What I want to do is rescale each attribute by subtracting the max
value of a given column from the column value and then divide
that by the difference of the max and min value for the given
column.
I'm also including some multiplication and addition to this
rescaling.
The rescaling is slightly different from the MinMaxScaler class.
numeric_attribs=list(df)
analyMaxColValues = df.max()
analMinColValues = df.min()
df_max= analyMaxColValues.to_frame()
df_min= analyMinColValues.to_frame()
for i, j, k in zip(numeric_attribs , df_max.index,
df_min.index):
df[i]=(.999*((df[i]-df_max[j].values)/df_max[j].values-
df_min[k].values)))+ .0005
You mentioned:
"What I want to do is rescale each attribute by subtracting the max value of a given column from the column value and then divide that by the difference of the max and min value for the given column. I'm also including some multiplication and addition to this rescaling."
Based off of this, correct me if I'm not getting this correctly but, the steps that you're trying to accomplish seem to be:
subtract max value of a column from all rows in that column
divide the result of step 1 by the difference of the max and min value in each respective column
multiply 0.999 and add 0.0005 to the result of step 2 (from your code)
This can be easily accomplished by the following two lines of code:
for i in df.columns:
df[i] = 0.999 * ( df[i].values - df[i].values.max() ) / (df[i].values.max() -df[i].values.min() ) + 0.0005

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources