Find dropped rows in Pandas - python

I have a Pandas DataFrame of roughly 64,000 rows. It looks roughly like this:
values
asn country
12345 US ...
12345 MX ...
I was running into an error saying that the MultiIndex could not contain non-unique values. This lead me to suspect that I had some NaN value in my index. So I tried the following to verify:
df = # my data frame
rows = df.shape[0]
df = df.reindex(df.index.dropna())
if df.shape[0] < rows:
print "Dropped %s NaN rows!" % (rows - df.shape[0])
As expected, this printed out "Dropped 10 NaN rows!"... although now I'd like to find out which rows were dropped so I can investigate how they got into my DataFrame in the first place.
How can I do this? I've tried looking through the Pandas docs for something like df.index.isna() (no dice) and I've tried taking the "before" and "after" data frames and computing their difference, but wasn't sure how to do this and my attempts led to indexing errors.

You can use MultiIndex.to_frame to get a DataFrame equivalent to your index, then combine isna and any to determine the null rows:
idxr = df.index.to_frame().isna().any(axis=1)
You can now use this to filter your DataFrame via df[idxr] to restrict to rows with a null value in the MultiIndex.
Note: with older versions of pandas you will need to use isnull instead of isna.

Related

How to select rows with no missing data in python?

I can only find questions on here that are for selecting rows with missing data using pandas in python.
How can I select rows that are complete and have no missing values?
I am trying to use:
data.notnull() which gives me true or false values per row but I don't know how to do the actual selection of only rows where all values are true for not being NA. Also unsure if notnull() is just considering rows with zeros as false whereas I would accept a zero in a row as a value, I am just looking to find rows with no NAs.
Without seeing your data, if it's in a dataframe df, and you want to drop rows with any missing values, try
newdf = df.dropna(how = 'any')
This is what pandas does by default, so should actually be the same as
newdf = df.dropna()

issue to replace null values in pyspark dataframe

I am experiencing issue to replace null values by 0 in some PySpark dataframe.
Let df1 and df2 two dataframes. After a join procedure on col1, I get a dataframe df, which contains two columns with same column name (maybe with different values) inherited from df1 and df2, let say df1.dup_col and df2.dup_col. I have null values on each of them, I want to replace them by 0 in df1.dup_col.
So, first I drop the df2.dup_col columns, then I call
df.fillna({"df1.dup_col":'0'})
but I still get the null values. So I tried,
df.select("df1.dup_col").na.fill(0)
with the same result. So I tried
df = df.withColumn("df1.dup_col", when(df["df1.dup_col"].isNull(), 0).otherwise(
df["df1.dup_col"]))
with no better result.
Am I missing something ?
You should do something like :
df = df.fillna("0", subset = ["dup_col"]) # This is the string 0
df = df.fillna(0, subset = ["dup_col"]) # This is the number 0
df = df.fillna({'colName':'value_to_replace'})

How do I dynamically update a column in pandas with the value of the column to its left?

I have a dataframe with a series of columns that contain boolean values, one column for each month of the year. Here's a snippet of the df:
df
I'm trying to update the 2019.04_flag, 2019.05_flag, etc columns with the last valid value. I know that I can use df[2019.04_flag].fillna(2019.03_flag), but I don't want to write 11 fillna lines. Is there a means of updating the value dynamically? I've tried to use the fillna method with the ffill parameter here df with ffill, but as you can see it doesn't propagate across the row.
Edited
I would look into the pandas fillna method, documentation is here. It has different methods for filling NaN -- I think "ffill" would suit your needs. It fills the NaN with the last valid entry. Try the following:
df = df.fillna(method = "ffill", axis = 1)
Setting axis = 1 will perform the imputation across the columns, the axis I believe you want (a single row across columns).

What's the best way to replace NaN values (in a Pandas DataFrame) with values from a separate Pandas Series?

I started with a Pandas DataFrame which has a column with many NaN values.
I split this Pandas DataFrame into two DataFrames: non-NaN and NaN.
I estimated a linear regression model to try to fill in the NaN values (as a function of the other columns).
So I now have a separate Pandas Series that has the estimated values. Its length is the same length as the NaN DataFrame.
I now want to put these estimated values back into the NaN DataFrame, so that I can then ultimately pd.concat() these two DataFrames into one DataFrame that I can then use for my analysis.
I cannot figure out a way to put these values back into the NaN DataFrame into the correct rows. Every time I tried, only some of the NaNs get filled (and probably in the wrong order). It seems to be something to do with the way they're indexed.
df_nan["Column"] = y_predicted
This is the way I've tried to do it, but it only fills in some of the rows, and incorrectly. Something to do with indices maybe?
I think a way of doing this could be the following: you keep your raw dataframe and use apply on the column you want to impute.
df['imputed_column'] = df.apply(lambda x: x.Column if(pd.notnull(x.Column)) else y_predicted[x.name],axis=1)
The following line will get the estimated value if it has a null value (with x.name being the index of the row). Otherwise, it will keep the same value.

pandas dataframe: Changing from single index to multi-column index

In python pandas I have a dataframe
df_aaa:
date data otherdata symbol
2015/1/1 11 12 aaa
2015/2/1 21 22 aaa
2015/3/1 31 31 aaa
df_all:
2015/1/1 31 31 bbb
Currently the index of both is by date .
I want to append df_aaa to df_all, and have them with a composite index of both symbol and date.
How do I do that?
Basically the following are all one question: How do I set a multi-index and use it when appending. Can I do it with different column order? Do I need to refresh? Etc.:
I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?
Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols
Do I have to "drop" the existing index before creating the new one?
Is there such a thing as a dataframe with data but no index?
Must a (single) index be of unique values?
When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?
And what is the default when I give these methods an array. Python docs are daunting, and I can't find my hands or legs in them. Giving some good examples would help...
Do I have to add anything (like axis=1) when setting the index?
How do I set the index to be the data in a column. (And why does sometimes using ['symbol', 'date'] as a parameter, give me a new column with those two values, instead of setting the index on the existing values of the columns with those two names?)
After I append and assuming the old index is correct do I need to 'update' the index (perhaps using reindex?) or since I told the dataframe that the index is in a certain column, is my data correctly indexed?
And since my dataframes (will) have indices on the same column name, can I do an append of df_aaa on df_all even if df_all was defined to have the columns originally in a different order. (say: ['symbol', 'date', 'data', 'otherdata'] with symbol the first column)?
You can just concatenate them and then set the index.
df_aaa = df_aaa.reset_index()
df_all = df_all.reset_index()
df = df_aaa.append(df_all).set_index(['symbol', 'date'])
Note that this would work only if your dataframes have the same column.s
If you must perform multiple appends in the future, the best thing to do would be to get one of them in the shape of the other, perform the concatenation, and reset index as needed.
I'll answer all your questions one by one.
I'm not sure if a multi-index is an index that has multiple 'columns'
(or rows), or is it the ability to have more than one index (and any
of them could be for multiple columns or rows). Or are both correct?
It depends on what axis you're referring to. Along the row (0th axis), you have 2 or more columns forming a MultiIndex. Similarly for along the columns (1st axis).
Must I first set the index of both dataframes to a multi-index, so the
append will work? (otherwise I'll have duplicates for different
symbols
No need. Although you could, not doing so would be simpler in this case.
Do I have to "drop" the existing index before creating the new one?
No, just that the columns must align (column name and number of columns should be the same).
Is there such a thing as a dataframe with data but no index?
No. All rows are indexed. Even if there is no column as the index, the index is a monotonically increasing number. The model followed here is similar to that in RDBMs.
Must a (single) index be of unique values?
In general, the must, so rows can be uniquely identified. If you have a MultiIndex, each combination of values that make up the index must be unique.
When do I use which of the following dataframe methods: set_index(),
reindex(), reset_index(), set_level, reset_level?
This is a broad question. It depends, when do you want to operate on the index and if so, what do you want to do with it? Look at the documentation for each one carefully.
Just append df's and reset_index() to be able to set_index() with keys argument. Here's oneliner:
df_all = df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
And here is full working sample.
In [1]: import pandas as pd
...: from io import StringIO
...:
In [2]: df_aaa = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 11 12 aaa
...: 2015/2/1 21 22 aaa
...: 2015/3/1 31 31 aaa
...: """), sep="\s+", index_col='date')
...:
In [3]: df_all = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 31 31 bbb"""), sep="\s+", index_col='date')
...:
In [4]: df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
Out[4]:
data otherdata
symbol date
bbb 2015/1/1 31 31
aaa 2015/1/1 11 12
2015/2/1 21 22
2015/3/1 31 31
Here is what I gather from the answers and dragging through the docs:
There is a "default index" which is a "row-number" for each row, and which is not part of any of the columns.
When merging with that index, there (seems to be) no need to re-index.
But if I want to change the index after it was made "non-standard" I have to "reset_index()" and turn it back to the default, and then from there I can create the new multi index (as explained in the revisioned answer below)
A multi-index is one that has more than one key (i.e. if indexing the rows, then more than one column will be used).
I'm still not sure if you have to re-index a column after a merge, but according to this it seems you get an automatically generated new "default index" and have to save the old one, remove the index before merge (reset_index) and set it again when done.
The other question about the index replacing a column - I'll check and get back here.
This is a follow-up.

Categories

Resources