Selecting first row with groupby and NaN columns - python

I'm trying to select the first row of each group of a data frame.
import pandas as pd
import numpy as np
x = [{'id':"a",'val':np.nan, 'val2':-1},{'id':"a",'val':'TREE','val2':15}]
df = pd.DataFrame(x)
# id val val2
# 0 a NaN -1
# 1 a TREE 15
When I try to do this with groupby, I get
df.groupby('id', as_index=False).first()
# id val val2
# 0 a TREE -1
The row returned to me is nowhere in the original data frame. Do I need to do something special with NaN values in columns other than the groupby columns?

Found the following that appears to be a workaround on the Pandas github site. Uses the nth() method
instead of first()
df.groupby('id', as_index=False).nth(0,dropna=False)
I didn't dig into it much. It seems odd that first() would actually use the val from a different row but I haven't actually found the documentation on first to check if this is by design.

Related

How do I replace items on one DataFrame with items from other DataFrame?

I have two DataFrames:
df = pd.DataFrame({'ID': ['bumgm001', 'lestj001',
'tanam001', 'hellj001', 'chacj001']})
df1 = pd.DataFrame({'playerID': ['bumgama01', 'lestejo01',
'tanakama01', 'hellije01', 'chacijh01'],
'retroID': ['bumgm001', 'lestj001', 'tanam001', 'hellj001', 'chacj001']})
OR
df df1
ID playerID retroID
'bumgm001' 'bumgama01' 'bumgm001'
'lestj001' 'lestejo01' 'lestj001'
'tanam001' 'tanakama01' 'tanam001'
'hellj001' 'hellije01' 'hellj001'
'chacj001' 'chacijh01' 'chacj001'
Now, my actual DataFrames are a little more complicated than this, but I simplified it here so it's clearer what I'm trying to do.
I would like to take all of the ID's in df and replace them with the corresponding playerID's in df1.
My final df should look like this:
df
**ID**
'bumgama01'
'lestejo01'
'tanakama01'
'hellije01'
'chacijh01'
I have tried to do it using the following method:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
The code seems to run just fine. But my retroID's in df have been changed to NaN rather than the proper playerIDs.
This strikes me as a datatype problem, but I'm not familiar enough with Pandas to diagnose any further.
EDIT:
Unfortunately, I made my example too simplistic. I edited to better represent the issue I'm having. I'm trying to look up the item from one DataFrame in a second DataFrame, then I want to replace the item from the first Dataframe with an item from the corresponding row of the second Dataframe. The columns DO NOT have the same name.
You can use the second dataframe as a dictionary for replacement:
to_replace = df1.set_index('retroID')['playerID'].to_dict()
df['retroID'].replace(to_replace, inplace=True)
According to your example, this is what you want:
df['ID'] = df1['playerID']
If data is not in order (row 1 from df is not the same as row 1 from df1) then use
df['ID']=df1.set_index('retroID').reindex(df['ID'])['playerID'].values
Credit to Wen for second approach
Output
ID
0 bumgama01
1 lestejo01
2 tanakama01
3 hellije01
4 chacijh01
Let me know if it's correct
OK, I've figured out a solution. As it turns out, my problem was a type problem. I updated my code from:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
to:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']].values[0]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
This works because "playerID" is now a scalar object(thanks to .values[0]) rather than some other datatype which is not compatible with a DataFrame.

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

I am trying to filter an input dataframe (df_in) against a list of indices. The indices list contains duplicates and I want my output df_out to contain all occurrences of a particular index. As expected, isin() gives me only a single entry for every index.
How do I try and not ignore duplicates and get output similar to df_out_desired?
import pandas as pd
import numpy as np
df_in = pd.DataFrame(index=np.arange(4), data={'A':[1,2,3,4],'B':[10,20,30,40]})
indices_needed_list = pd.Series([1,2,3,3,3])
# In the output df, I do not particularly care about the 'index' from the df_in
df_out = df_in[df_in.index.isin(indices_needed_list)].reset_index()
# With isin, as expected, I only get a single entry for each occurence of index in indices_needed_list
# What I am trying to get is an output df that has many rows and occurences of df_in index as in the indices_needed_list
temp = df_out[df_out['index'] == 3]
# This is what I would like to try and get
df_out_desired = pd.concat([df_out, df_out[df_out['index']==3], df_out[df_out['index']==3]])
Thanks!
Check reindex
df_out_desired = df_in.reindex(indices_needed_list)
df_out_desired
Out[177]:
A B
1 2 20
2 3 30
3 4 40
3 4 40
3 4 40

How to add prefix to rows of a columns if (conditions met)

I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5
Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object

Data selection using pandas

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

How can you check if a column in a DataFrame is stale?

What is the fastest way to query for staleness (unvarying data) in a DataFrame column, so that it would return the 'Stale' column?
As example:
from pandas import DataFrame
from numpy.random import randn
df = DataFrame(randn(50, 5))
df['Stale'] = 100.0
will yield a df similar to the following:
0 1 2 3 4 Stale
0 -0.064293 1.226319 -1.162909 -0.574240 -0.547402 100.0
1 0.529428 0.587148 0.367549 0.066041 -0.071709 100.0
2 -0.112633 0.217315 0.810061 -0.610718 0.179225 100.0
3 0.513706 -2.300195 -0.895974 0.853926 -1.604018 100.0
4 0.410546 0.641980 0.611272 1.121002 -1.082460 100.0
And I'd like to get the 'Stale' column returned. Right now I am doing:
df.columns[df.std() == 0.0] which works, but which is probably not very efficient.
This:
df.columns[df.std() == 0.0]
returns the 'Stale' index because the standard deviation of the stale column would be zero.
If you define "staleness" as unvarying data, df.var() == 0 is slightly faster (probably because you don't need to take the square root). It also occurred to me to check df.max() == df.min() but that's actually slower.
To return the column using this information, do this:
df[df.columns[df.var() == 0.0]]
How about:
if 'Stale' in df.columns: #test if you have a column named 'Stale'
_df = df.ix[:,df.columns!='Stale']
#do something on the DataFrame without the 'Stale' column
else:
#_df = df
#do something to the DataFrame directly.
You have the following options that I can think of:
df.ix[:,df.columns!='Stale'] will return a view of the DataFrame without the 'Stale' column and
df.ix[:,df.columns=='Stale'] will return 'Stale' column as a DataFrame, if it is in the dataframe. An empty DataFrame otherwise.
df.get['Stale'] returns 'Stale' column as a Series, if the column is not there, it will return None.
You can't just do df['Stale'], because if the column is not there, an keyError will be raised.
I suggest to use the shift method of the pandas data frame:
df == df.shift()
Note: almost never comment on stackoverflow.

Categories

Resources