I have a Panda dataframe and want to produce an extra column that holds the ranks of an original column in the pd. However, the column has empty cells. The ranks for those empty cells should be empty as well.
When I use
df['RRanked'] = df['R'].rank(ascending=1,na_option='keep')
it still produces a rank for the empty cells. In this case, the empty cell will get the highest rank.
How to produce empty ranks for those empty cells?
Thx!
I would coerce the column to a numeric value, then you can use rank with na_option='keep' which will not rank NaNs.
r = pd.to_numeric(df.R, errors='coerce')
rnk = r.rank(na_option='keep')
Well, I solved it in a not so "clean" way. I managed to replace all those cells by NaN. Then I used the kind answer by Yefet: df['R'].apply(lambda x: pd.NA if x in ["NaN"] else x).rank(ascending=1). Later, I just replace the NaNs in the Ranks by "". That works.
Related
I have three data frames with sizes of df1=(176, 5766) df2=(8, 5766) df3=(16, 5766), in my columns despite the different columns name there are similar data values (all the columns in each three are equal), but when I using for example
df1.T.drop_duplicates().T
df2.T.drop_duplicates().T
df3.T.drop_duplicates().T
although it must create the same output column but it convert the
df1=(176, 581) df2=(8, 632) df3=(16, 622)
how I can get ride of this?
Syntax: df.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset:
Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep:
keep is to control how to consider duplicate value. It has only three distinct values - first, last and false. default is ‘first’. ‘first’, it considers the first value as unique and rest of the same values as duplicate.
inplace:
Boolean values, removes rows with duplicates if True.
# dropping duplicate values
df1.drop_duplicates(keep=False,inplace=True)
df2.drop_duplicates(keep=False,inplace=True)
df3.drop_duplicates(keep=False,inplace=True)
I have a df that has some blanks in it. Whenever there is a blank I want to be able to substitute it with "0000". However, I can't find a way to replace the blanks in my array. Some of the values in the array are numbers, but sometimes the element has multiple numbers delimited by ";". Can anyone help?
If by 'blank' you mean 'NaN' and by '0000' you mean a string of '0000' and not the integer '0', the following should work:
df.loc[df['Numbers'].isnull(), 'Numbers'] = '0000'
What's going on here is the following:
.isnull() is a built-in pandas function that will detect if an entry is a NaN.
[df['Numbers'].isnull() will return only the locations in the dataframe for which there is a NaN in the 'Numbers' column.
df.loc[index, column] will select all the rows (index) that meet the condition -- in this case isnull() == True -- and then replace the values in the column in those rows with whatever is specified, in this case the string '0000'.
Welcome to Pandas!
So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()
c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.
I have a dataset with information on cities in the United States and I want to give it a two-level index with the state and the city. I've been trying to use the MultiIndex approach in the documentation that goes something like this.
lists = [list(df['state'],list(df['city'])]
tuples = list(zip(*lists))
index = pd.MultiIndex.from_tuples(tuples)
new_df = pd.DataFrame(df,index=index)
The output is a new DataFrame with the correct index but it's full of np.nan values. Any idea what's going on?
When you reindex a DataFrame with a new index, Pandas operates roughly
the following way:
Iterates over the current index.
Checks whether this index value occurs in the new index.
From the "old" (existing) rows, leaves only those with index values
present in the new index.
There can be reordering of rows, to align with the order of the new
index.
If the new index contains values absent in the DataFrame, then
the coresponding row has only NaN values.
Maybe your DataFrame has initially a "standard" index (a sequence
of integers starting from 0)?
In this case no item of the old index is present in the new
index (actualy MultiIndex), so the resulting DataFrame has
all rows full of NaNs.
Maybe you should set the index to the two columns of interest,
i.e. run:
df.set_index(['state', 'city'], inplace=True)
I have a very large dataframe with many columns. I want to check all the columns and remove any row containing any instance of the string 'MU', and there are some columns that have 'MU#1' or 'MU#2', and they will sometimes switch places (like 'MU#1 would be in column 1 at index 0 and 'MU#2' will be in column 1 at index 1). Initially, I tried removing them with this but it becomes far too cumbersome if I try to do this for both strings above:
df_slice = df[(df.phase_2 != 'MU#1') & (df.phase_3 != 'MU#1') & (df.phase_1 != 'MU#1') & (df.phase_4 != 'MU#1') ]
This may work, but I have to repeat this slice a few times with other dataframes and I imagine there is a much simpler route. I also have more columns than what is shown above, but that is just a snippet.
Simply put, all columns need to be checked for 'MU' and the rows with 'MU' need to be removed. Thanks!
You could also try .str.contains() and apply to the dataframe. This avoids hardcoding the columns in just in case
df[df.apply(lambda x: (~x.str.contains('MU', case=True, regex=True)))].dropna()
or
df[~df.stack().str.contains('MU').any(level=0)]
How it works
Option 1
when used in df.apply(), x.str.contains, #is a wild card for any column in the datframe that contains
x.str.contains('MU', case=True, regex=True) is a wild card for any column in the datframe that contains 'MU', case sensitive and regular expression implied
~ Reverses, hence you end up with rows that do not have MU
Resulting dataframe returns NaN where the condition is not met. .dropna() hence eliminates the rows with NaN
Option 2
df.stack()# Stacks the dataframe
df.stack().str.contains('MU')#boolean selects rows with the string 'MU'
df.stack().str.contains('MU').any(level=0)# Selects the index
~df.stack().str.contains('MU').any(level=0)# Reverses the selection taking only those without string 'MU'
What we do with all
df = df[df[['phase_1','phase_2','phase_3','phase_4']].ne('MU#1').all(1)]
Update
df = df[(~df[['phase_1','phase_2','phase_3','phase_4']].isin(['MU#1','MU#2'])).all(1)]
This works fine with me.
df[~df.stack().str.contains('Any String').any(level=0)]
Even when searching specific string in the dataframe
df[df.stack().str.contains('Any String').any(level=0)]
Thanks.