Python - Calculating ranks related to a dataframe column that includes blank cells

Python - Calculating ranks related to a dataframe column that includes blank cells - python

I have a Panda dataframe and want to produce an extra column that holds the ranks of an original column in the pd. However, the column has empty cells. The ranks for those empty cells should be empty as well.
When I use
df['RRanked'] = df['R'].rank(ascending=1,na_option='keep')
it still produces a rank for the empty cells. In this case, the empty cell will get the highest rank.
How to produce empty ranks for those empty cells?
Thx!

I would coerce the column to a numeric value, then you can use rank with na_option='keep' which will not rank NaNs.
r = pd.to_numeric(df.R, errors='coerce')
rnk = r.rank(na_option='keep')

Well, I solved it in a not so "clean" way. I managed to replace all those cells by NaN. Then I used the kind answer by Yefet: df['R'].apply(lambda x: pd.NA if x in ["NaN"] else x).rank(ascending=1). Later, I just replace the NaNs in the Ranks by "". That works.

Related

how to get ride of equal column in our panda data frame?

I have three data frames with sizes of df1=(176, 5766) df2=(8, 5766) df3=(16, 5766), in my columns despite the different columns name there are similar data values (all the columns in each three are equal), but when I using for example
df1.T.drop_duplicates().T
df2.T.drop_duplicates().T
df3.T.drop_duplicates().T
although it must create the same output column but it convert the
df1=(176, 581) df2=(8, 632) df3=(16, 622)
how I can get ride of this?

Syntax: df.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset:
Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep:
keep is to control how to consider duplicate value. It has only three distinct values - first, last and false. default is ‘first’. ‘first’, it considers the first value as unique and rest of the same values as duplicate.
inplace:
Boolean values, removes rows with duplicates if True.
# dropping duplicate values
df1.drop_duplicates(keep=False,inplace=True)
df2.drop_duplicates(keep=False,inplace=True)
df3.drop_duplicates(keep=False,inplace=True)

How do I test for blanks in array (dtype object)?

I have a df that has some blanks in it. Whenever there is a blank I want to be able to substitute it with "0000". However, I can't find a way to replace the blanks in my array. Some of the values in the array are numbers, but sometimes the element has multiple numbers delimited by ";". Can anyone help?

If by 'blank' you mean 'NaN' and by '0000' you mean a string of '0000' and not the integer '0', the following should work:
df.loc[df['Numbers'].isnull(), 'Numbers'] = '0000'
What's going on here is the following:
.isnull() is a built-in pandas function that will detect if an entry is a NaN.
[df['Numbers'].isnull() will return only the locations in the dataframe for which there is a NaN in the 'Numbers' column.
df.loc[index, column] will select all the rows (index) that meet the condition -- in this case isnull() == True -- and then replace the values in the column in those rows with whatever is specified, in this case the string '0000'.
Welcome to Pandas!

Empty cells on dataframe after use explode()

So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()

c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.

Why does reindexing a pandas DataFrame give me an empty DataFrame?

I have a dataset with information on cities in the United States and I want to give it a two-level index with the state and the city. I've been trying to use the MultiIndex approach in the documentation that goes something like this.
lists = [list(df['state'],list(df['city'])]
tuples = list(zip(*lists))
index = pd.MultiIndex.from_tuples(tuples)
new_df = pd.DataFrame(df,index=index)
The output is a new DataFrame with the correct index but it's full of np.nan values. Any idea what's going on?

When you reindex a DataFrame with a new index, Pandas operates roughly
the following way:
Iterates over the current index.
Checks whether this index value occurs in the new index.
From the "old" (existing) rows, leaves only those with index values
present in the new index.
There can be reordering of rows, to align with the order of the new
index.
If the new index contains values absent in the DataFrame, then
the coresponding row has only NaN values.
Maybe your DataFrame has initially a "standard" index (a sequence
of integers starting from 0)?
In this case no item of the old index is present in the new
index (actualy MultiIndex), so the resulting DataFrame has
all rows full of NaNs.
Maybe you should set the index to the two columns of interest,
i.e. run:
df.set_index(['state', 'city'], inplace=True)

Remove all rows containing a piece of a string in multiple columns in pandas

I have a very large dataframe with many columns. I want to check all the columns and remove any row containing any instance of the string 'MU', and there are some columns that have 'MU#1' or 'MU#2', and they will sometimes switch places (like 'MU#1 would be in column 1 at index 0 and 'MU#2' will be in column 1 at index 1). Initially, I tried removing them with this but it becomes far too cumbersome if I try to do this for both strings above:
df_slice = df[(df.phase_2 != 'MU#1') & (df.phase_3 != 'MU#1') & (df.phase_1 != 'MU#1') & (df.phase_4 != 'MU#1') ]
This may work, but I have to repeat this slice a few times with other dataframes and I imagine there is a much simpler route. I also have more columns than what is shown above, but that is just a snippet.
Simply put, all columns need to be checked for 'MU' and the rows with 'MU' need to be removed. Thanks!

You could also try .str.contains() and apply to the dataframe. This avoids hardcoding the columns in just in case
df[df.apply(lambda x: (~x.str.contains('MU', case=True, regex=True)))].dropna()
or
df[~df.stack().str.contains('MU').any(level=0)]
How it works
Option 1
when used in df.apply(), x.str.contains, #is a wild card for any column in the datframe that contains
x.str.contains('MU', case=True, regex=True) is a wild card for any column in the datframe that contains 'MU', case sensitive and regular expression implied
~ Reverses, hence you end up with rows that do not have MU
Resulting dataframe returns NaN where the condition is not met. .dropna() hence eliminates the rows with NaN
Option 2
df.stack()# Stacks the dataframe
df.stack().str.contains('MU')#boolean selects rows with the string 'MU'
df.stack().str.contains('MU').any(level=0)# Selects the index
~df.stack().str.contains('MU').any(level=0)# Reverses the selection taking only those without string 'MU'

What we do with all
df = df[df[['phase_1','phase_2','phase_3','phase_4']].ne('MU#1').all(1)]
Update
df = df[(~df[['phase_1','phase_2','phase_3','phase_4']].isin(['MU#1','MU#2'])).all(1)]

This works fine with me.
df[~df.stack().str.contains('Any String').any(level=0)]
Even when searching specific string in the dataframe
df[df.stack().str.contains('Any String').any(level=0)]
Thanks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Calculating ranks related to a dataframe column that includes blank cells - python

I would coerce the column to a numeric value, then you can use rank with na_option='keep' which will not rank NaNs. r = pd.to_numeric(df.R, errors='coerce') rnk = r.rank(na_option='keep')

Well, I solved it in a not so "clean" way. I managed to replace all those cells by NaN. Then I used the kind answer by Yefet: df['R'].apply(lambda x: pd.NA if x in ["NaN"] else x).rank(ascending=1). Later, I just replace the NaNs in the Ranks by "". That works.

Related

how to get ride of equal column in our panda data frame?

How do I test for blanks in array (dtype object)?

Empty cells on dataframe after use explode()

Why does reindexing a pandas DataFrame give me an empty DataFrame?

Remove all rows containing a piece of a string in multiple columns in pandas

Categories

Resources