How can I keep NAN values while I am extracting a subset? - python

I have a dataframe and there is a column called 'budget'.
There are some NaN values in this column and I'd like to keep them.
However when I try to exclude a sub-frame in which the budget value is bigger than (or equal to) 38750, I lost my NaN values in the new data frame!
df2 = df[df.budget >= 38750]

I would use a double condition, where the second condition checks whether the values in the budget column are missing:
inds = (df['budget'] >= 38750) | (df['budget'].isnull())
df2 = df.loc[inds, :]

Related

Fill nan values in one column based on other columns

I am working on a dataset which consists of average age of marriage. On this dataset I am doing data cleaning job. While performing this process, I came across a feature where I had to fill the 'NaN' values in the location column. But in location column there are multiple unique values and I want to fill the nan values in location. I need some suggestion on how to fill these Nan values in column which had many unique values.
I have attached the dataset for reference, DataSet
I suggest doing it in 3 steps:
Fill in the missing values of location with either the most common location or with a separate value "Unknown";
Fill in the missing values of "age_of_marriage" with a mean value of this feature by location;
If there are any missing values of "age_of_marriage" left, fill them in with the average value.
df = pd.read_csv('https://raw.githubusercontent.com/atharva07/Age-of-marriage/main/age_of_marriage_data.csv', sep=',')
df['location'] = df['location'].fillna('Unknown')
df['age_of_marriage'] = df.groupby(['location'])['age_of_marriage'].apply(lambda x: x.fillna(x.median()))
df['age_of_marriage'] = df['age_of_marriage'].fillna(df['age_of_marriage'].mean())

How to change DataFrame column values so that mean is modified accordingly?

I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.

Fetch previous rows based on if condition and Shift function - Python dataframe

I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay

Pandas filter pivot table by value

This might be a duplicate of Pandas: Filtering pivot table rows where count is fewer than specified value but I keep getting a NaN error
I have a data frame (df) of orders, order values, customer Id and dates:
id, date, order_count, daily_order_value
I want to view the total spend of guests who order more than once, three and ten times over the duration.
Pnon_merch = pivot_table(dffilter, index =["guest_id"],
values=['ct_order','order_value'],
aggfunc= {'ct_order':np.sum,
'order_value': [np.sum, np.mean]})
Printing Pnon_merch:
ct_order order_value
sum mean sum
guest_id
4813 1 2020.6400 2020.64
Produces a table, but when I try:
Pnon_merch_is1 = Pnon_merch[Pnon_merch["ct_order"]==1]
I get a list of NaN,
ct_order order_value
sum mean sum
guest_id
4813 NaN NaN NaN
truefalse = [Pnon_merch["ct_order"]==1]
Gives a list of True / False
sum
guest_id
4813 True
6517 True
7876 False
Why can the True/false, be returning NaN?
This example Filtering based on the "rows" data after creating a pivot table in python pandas seems to only filter on the index NOT the values.
(groupby level = 0 does not yield the correct results either)
First i would rename columns (after aggregation) like that:
Pnon_merch.columns = ['ct_order_sum','order_value_mean','order_value_sum']
now you can simply do this:
Pnon_merch_is1 = Pnon_merch[Pnon_merch["ct_order_sum"]==1]

Pandas - Delete Rows with only NaN values

I have a DataFrame containing many NaN values. I want to delete rows that contain too many NaN values; specifically: 7 or more.
I tried using the dropna function several ways but it seems clear that it greedily deletes columns or rows that contain any NaN values.
This question (Slice Pandas DataFrame by Row), shows me that if I can just compile a list of the rows that have too many NaN values, I can delete them all with a simple
df.drop(rows)
I know I can count non-null values using the count function which I could them subtract from the total and get the NaN count that way (Is there a direct way to count NaN values in a row?). But even so, I am not sure how to write a loop that goes through a DataFrame row-by-row.
Here's some pseudo-code that I think is on the right track:
### LOOP FOR ADDRESSING EACH row:
m = total - row.count()
if (m > 7):
df.drop(row)
I am still new to Pandas so I'm very open to other ways of solving this problem; whether they're simpler or more complex.
Basically the way to do this is determine the number of cols, set the minimum number of non-nan values and drop the rows that don't meet this criteria:
df.dropna(thresh=(len(df) - 7))
See the docs
The optional thresh argument of df.dropna lets you give it the minimum number of non-NA values in order to keep the row.
df.dropna(thresh=df.shape[1]-7)

Categories

Resources