Data Preprocessing handling null values using fillna - python

Recently while pre processing a raw csv file to replace the missing values I have used the mean of the values in that col, the code snippet is (assume df is the DataFrame and time is one of the column or attribute where the values are in float format representing hours)
df.time = df.time.fillna(df.time.mean())
after that the null value were successfully replaced with mean now I want to print only the rows which have been affected by the command instead of displaying the entire DataFrame how can I do that?

You can't get these rows after you have performed the operation as this information is lost.
But you can get the indices of all the rows that have null values first, and then print all of those rows afterwards:
na_rows = df.time.isnull()
df.time = df.time.fillna(df.time.mean())
print(df[na_rows])

Related

Pandas .sort_values() function returning data frame with scattered values

I'm using pandas to load a short_desc.csv with the following columns: ["report_id", "when","what"]
with
#read csv
shortDesc = pd.read_csv('short_desc.csv')
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
#convert 'when' from UNIX timestamp to datetime
shortDesc['when'] = pd.to_datetime(shortDesc['when'],unit='s')
which results in the following:
I'm trying to remove rows that have duplicate 'report_id's by sorting by
date and getting the newest date where that 'report_id' is present with the following:
shortDesc = shortDesc.sort_values(by='when').drop_duplicates(['report_id'], keep='last')
the problem is that when I use .sort_values() in this particular dataframe the values of 'what' come out scattered across all columns, and the 'report_id' values disappear:
shortDesc = shortDesc.sort_values(by=['when'], inplace=False)
I'm not sure why this is happening in this particular instance since I was able to achieve the correct results by another dataframe with the same shape and using the same code (P.S it's not a mistake, I dropped the 'what' column in the second pic):
similar shape dataframe
desired results example with similar shape DF
I found out that:
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
was only checking if a value was not null and probably overwriting the str.isdigit() check, which caused the field "report_id" to not drop nonnumeric values. I changed this to two separate lines
shortDesc = shortDesc[shortDesc['report_id'].notnull()]
shortDesc = shortDesc[shortDesc['report_id'].str.isnumeric()]
which allowed
shortDesc.sort_values(by='when', inplace=True)
to work as intended, I am still confused as to why .sort_values(by="when") was affected by the column "report_id". So if anyone knows please enlighten me.

Python Compare the last two non-null values in dataframe column

I have a dataframe with columns that are a string of blanks (null/nan set to 0) with sporadic number values.
I am tying to compare the last two non-zero values in a data frame column.
Something like :
df['Column_c'] = df[column_a'].last_non_zero_value > df[column_a'].second_to_last_non_zero_value
This is what the columns look like in excel
You could drop all the rows with missing data using pd.df.dropna() and then access the last row in the dataframe index and have it return the values as an array which should be easy to find the last two elements in.

Collapsing values of a Pandas column based on Non-NA value of other column

I have a data like this in a csv file which I am importing to pandas df
I want to collapse the values of Type column by concatenating its strings to one sentence and keeping it at the first row next to date value while keeping rest all rows and values same.
As shown below.
Edit:
You can try ffill + transform
df1=df.copy()
df1[['Number', 'Date']]=df1[['Number', 'Date']].ffill()
df1.Type=df1.Type.fillna('')
s=df1.groupby(['Number', 'Date']).Type.transform(' '.join)
df.loc[df.Date.notnull(),'Type']=s
df.loc[df.Date.isnull(),'Type']=''

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

set index to Date column residing in level 3 multiindex dataframe

For a dataframe which looks like this:
I want to simply set the index to be the Date column which you see as first column.
The dataframe comes from an api where i save the data into csv:
data.to_csv('stocks.csv', header=True ,sep=',',mode='a')
data = pd.read_csv('stocks.csv',header=[0,1,2])
data
Preferably i would also like to get rid of the "Unnamed:.." labels you see in the picture.
Thanks.
I solved it by specifying header=[0,1] ,index_col=0 in the read_csv function and after convert dataframe to numeric since the datatype got distorted but not necessary always i believe:
data = pd.read_csv('stocks.csv', header=[0,1] ,index_col=0)
data = data.apply(pd.to_numeric, errors='coerce')
# eventually:
data = data.dropna()
In this fashion I get exactly what I want, namely write e.g.
data['AGN.AS']['High']
and get the high values for a specific stock.

Categories

Resources