I have a DataFrame with a 259399 rows and one column. It is called hfreq. In one single row I have a NaN value and I want to find it. I thought this is easy and tried hfreq[hfreq.isnull()]. But as you can see it doesn't help:
What am I doing wrong and how is it done correctly?
Edit: For clarity: That is how my DataFrame looks like:
There is only one NaN value hidden somewhere in the middle and I want to learn where it is so I want to get its index.
use the following code.
hfreq.loc[hfreq['value'].isnull()]
Related
I'm learning Python 3 and a small goal of mine has been to make a script that tells me if there are duplicates between 3 columns with 171 rows. I want it to catch "Apple" in any of the rows / columns if there are more than 1 basically. The data in the cells are strings, which I want to be put in a list.
The following code tells me Empty DataFrame and only returns my columns. I am able to see my rows when I use df.head(x) and I can see a Boolean return False for all of the strings. I think it's trying to see if there are duplicates per row for the 3 columns but I need to work out iteration for returning any string in any column that is greater than 1.
import pandas as pd
df = pd.read_excel(r"C:\Repo\Repo\My_Excel_File.xlsx")
dups=df[df.duplicated()]
print(dups)
If anyone would able to give me a hint towards the right direction to figuring this out, I don't necessary want an answer, it would be really appreciated.
I will return back with progress if interested.
I am doing some stuff using pandas in python and came across one dataset that needs to be clean.
It looks likes this-
I need to merge rows of index 0,1,2,3 into 1 single row by avoiding NaN value. Also, after doing this, I want to remove the default header and make the newly made row the default name for columns.
I tried to use groupby operation in pandas but nothing is happening.
Any idea on this?
Thanks
Akhi
pd.DataFrame([df.aggregate(lambda x:" ".join(str(e).strip() for e in x if not pd.isna(e))).tolist()],columns=[<add_new_list here>])
I want to check if either column contains a nan and if so the value in my new column to be "na" else 1.
I was able to get my code to work when checking a single column using .isnull() but I am unsure how to combine two. I tried using or as seen below. But it did not work. I know I can make the code a bit messy by testing one condition and then checking the next and from that producing the outcome I want but was hoping to make the code a bit cleaner by using some sort of any, or etc function instead but I do not know how to do that.
temp_df["xl"] = np.where(temp_df["x"].isnull() or temp_df["y"].isnull(), "na",1)
Let us do any more detail explain the situation you have link
temp_df["xl"] = np.where(temp_df[['x','y']].isnull().any(1), "na",1)
I am looking to delete a row in a dataframe that is imported into python by pandas.
if you see the sheet below, the first column has same name multiple times. So the condition is, if the first column value re-appears in a next row, delete that row. If not keep that frame in the dataframe.
My final output should look like the following:
Presently I am doing it by converting each column into a list and deleting them by index values. I am hoping there would be an easy way. Rather than this workaround/
df.drop_duplicates([df.columns[0])
should do the trick.
Try the following code;
df.drop_duplicates(subset='columnName', keep=’first’, inplace=true)
I have a dataframe as below.
I want p-value of Mann-whitney u test by comparing each column.
As an example, I tried below.
from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])
This results in the following values.
MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)
I wondered whether NaN affected the result, thus I made the following df2 and df3 dataframes as described in the figure and tried below.
mannwhitneyu(df2, df3)
This resulted in
MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)
So I think NaN values affected the result.
Does anyone know how to ignore NaN values in the dataframe?
you can use df.dropna() you can find extensive documentation here dropna
As per your example, the syntax would go something like this:
mannwhitneyu(df['A'].dropna(),df['B'])
As you can see, there is no argument in the mannwhitneyu function allowing you to specify its behavior when it encounters NaN values, but if you inspect its source code, you can see that it doesn't take NaN values into account when calculating some of the key values (n1, n2, ranked, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna() as suggested in the other answer.