Can't evaluate column for empty values - python

I have read 20+ threads on this, and am still coming up empty (no pun intended).
I have a pandas dataframe df_s, which has a column that contains dates at iloc[:,8]. I am trying to add a new column to the dataframe with a value (yes/no) based on whether there is a value in the other column or not.
This is what I have been trying:
CDRFormUp = []
for row in df_s.iloc[:,8]:
if row=="":
CDRFormUp.append('No')
else:
CDRFormUp.append('Yes')
df_s['CDR Form Up'] = CDRFormUp
CDRFormUp would be the new column. I'm running every row in the dataframe, and checking to see if the value in the column is anything.
I have tried...
if row <>"":
if row == "":
if row is None:
if row:
if row>0:
Nothing is working. The column contains dates and empty cells and text. For example, the value in this column in the first row is "CDF Form", in the second row it is blank, in the third row it is "4865" or something like that.
If I set the iloc to a different column that just contains Country names, and set the condition to "Country = "Italy", it properly adds the "Yes" or "No" to the new column for each row...so it's not a wrong iloc or something else.
Any help would be incredibly appreciated.
Thanks!

You need to use np.where with Pandas dataframes.
df_s = pd.DataFrame(np.random.randint(1,10,(5,10)))
df_s.iloc[1,8] = ''
df_s.iloc[3,8] = np.nan
df_s['CDRFormUp'] = np.where(df_s.iloc[:,8].mask(df_s.iloc[:,8].str.len()==0).isnull(),'Yes','No')
print(df_s)
Output:
0 1 2 3 4 5 6 7 8 9 CDRFormUp
0 6 5 5 5 9 3 3 5 3 9 No
1 5 4 7 3 9 6 8 9 9 Yes
2 5 2 2 7 7 6 3 2 5 2 No
3 8 2 1 9 7 3 7 8 NaN 8 Yes
4 4 4 1 5 3 5 9 4 4 9 No

I suspect you have elements with white space.
Consider the datafame df_s
df_s = pd.DataFrame([
[1, 'a', 'Yes'],
[2, '', 'No'],
[3, ' ', 'No']
])
df_s
0 1 2
0 1 a Yes
1 2 No
2 3 No
Both rows 1 and 2 in column 1 have what look like blank strings. But they aren't
df_s.iloc[:, 1] == ''
0 False
1 True
2 False
Name: 1, dtype: bool
You may want to consider seeing if the entire thing is white space or stripping white space first.
Option 1
all white space
df_s.iloc[:, 1].str.match('^\s*$')
0 False
1 True
2 True
Name: 1, dtype: bool
Which we can convert to yes/no with
df_s.iloc[:, 1].str.match('^\s*$').map({True: 'no', False: 'yes'})
0 yes
1 no
2 no
Name: 1, dtype: object
Add a new column
df_s.assign(
CDRFormUp=df_s.iloc[:, 1].str.match('^\s*$').map({True: 'no', False: 'yes'})
)
0 1 2 CDRFormUp
0 1 a Yes yes
1 2 No no
2 3 No no
Option 2
strip white space then check if empty
df_s.iloc[:, 1].str.strip() == ''
0 False
1 True
2 True
Name: 1, dtype: bool
Add new column
df_s.assign(
CDRFormUp=df_s.iloc[:, 1].str.strip().eq('').map({True: 'no', False: 'yes'})
)
0 1 2 CDRFormUp
0 1 a Yes yes
1 2 No no
2 3 No no

Related

pandas ascend sort multiple columns but reverse sort one column

I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7

Duplicate a single row at index?

In the past hour I was searching here and couldn't find a very simple thing I need to do, duplicate a single row at index x, and just put in on index x+1.
df
a b
0 3 8
1 2 4
2 9 0
3 5 1
copy index 2 and insert it as is in the next row:
a b
0 3 8
1 2 4
2 9 0
3 9 0 # new row
4 5 1
What I tried is concat(with my own columns names) which make a mess.
line = pd.DataFrame({"date": date, "event": None}, index=[index+1])
return pd.concat([df.iloc[:index], line, df.iloc[index:]]).reset_index(drop=True)
How to simply duplicate a full row at a given index ?
You can use repeat(). Fill in the dictionary with the index and the key, and how many extra rows you would like to add as the value. This can work for multiple values.
d = {2:1}
df.loc[df.index.repeat(df.index.map(d).fillna(0)+1)].reset_index()
Output:
index a b
0 0 3 8
1 1 2 4
2 2 9 0
3 2 9 0
4 3 5 1
Got it.
df.loc[index+0.5] = df.loc[index].values
return df.sort_index().reset_index(drop = True)

How to search subset of a pandas dataframe for the row in which a value occurs

I have two dataframes, e.g.
import pandas as pd
import numpy as np
from random import shuffle
df_data = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(10,3)), columns=['A', 'B', 'C'])
keys = np.arange(0, 10)
shuffle(keys)
df_data['keys'] = keys
key_data = pd.DataFrame(data=np.reshape(np.arange(1,10), (3,3)), columns=['Key_col1', 'Key_col2', 'Key_col3'])
key_data['Timestamp'], key_data['Info'] = ['Mon', 'Wed', 'Fri'], [13, 2, 47]
Which returns, something like this:
A B C keys
0 3 9 2 5
1 7 9 4 7
2 9 6 6 0
3 9 9 0 9
4 8 5 8 6
5 2 5 7 3
6 5 1 2 4
7 3 9 6 2
8 4 2 3 8
9 6 5 5 1
and this:
Key_col1 Key_col2 Key_col3 Timestamp Info
0 1 2 3 Mon 13
1 4 5 6 Wed 2
2 7 8 9 Fri 47
I'd like to use the 'keys' column in the first dataframe to search the only the Key columns in the second dataframe (i.e. Key_col1, Key_col2, Key_col3) (because the 'info' column may contain values that much keys).
I'll then add the columns Timestamp and Info to the row in which the there is a match for key.
Expected output for row 0 would be this:
A B C keys Timestamp Info
0 3 9 2 5 Wed 2
My approach is to first a subset of my key_df for a value:
key_data.iloc[:, 0:3] == 2
OUT
Key_col1 Key_col2 Key_col3
0 False True False
1 False False False
2 False False False
In my next step I try to return only the row where the value True occurs using df.loc
key_data.loc[:, key_data.iloc[:, 0:3] == 2]
But this results in the error ValueError: Cannot index with multidimensional key
Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
Thanks
EDIT: The keys are unique and all of them are present in exactly 1 of the 3 key columns.
This works for you, just rename the columns:
new_df = pd.merge(df_data, key_data, how= 'right', left_on=['keys','keys','keys'], right_on = ['Key_col1','Key_col2','Key_col3'])
new_df =new_df.dropna(axis=1, how='all')
Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
The answer to this question is key_data.loc[(key_data.iloc[:, 0:3] == 2).any(axis=1)], but for your larger goal, doing something with merge as Rahul Agarwal suggests would be better.

Python Pandas: Create Column That Acts As A Conditional Running Variable

I'm trying to create a new dataframe column that acts as a running variable that resets to zero or "passes" under certain conditions. Below is a simplified example of what I'm looking to accomplish. Let's say I'm trying to quit drinking coffee and I'm tracking the number of days in a row i've gone without drinking any. On days where I forgot to make note of whether I drank coffee, I put "forgot", and my tally does not get influenced.
Below is how i'm currently accomplishing this, though I suspect there's a much more efficient way of going about it.
Thanks in advance!
import pandas as pd
Day = [1,2,3,4,5,6,7,8,9,10,11]
DrankCoffee = ['no','no','forgot','yes','no','no','no','no','no','yes','no']
df = pd.DataFrame(list(zip(Day,DrankCoffee)), columns=['Day','DrankCoffee'])
df['Streak'] = 0
s = 0
for (index,row) in df.iterrows():
if row['DrankCoffee'] == 'no':
s += 1
if row['DrankCoffee'] == 'yes':
s = 0
else:
pass
df.loc[index,'Streak'] = s
you can use groupby.transform
for each streak, what you're looking for is something like this:
def my_func(group):
return (group == 'no').cumsum()
you can divide the different streak with simple comparison and cumsum
streak = (df['DrankCoffee'] == 'yes').cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
then apply the transform
df['Streak'] = df.groupby(streak)['DrankCoffee'].transform(my_func)
You need firstly map you DrankCoffee to [0,1](Base on my understanding yes and forgot should be 0 and no is 1), then we just do groupby cumsum to create the group key , when there is yes we start a new round for count those evens
df.DrankCoffee.replace({'no':1,'forgot':0,'yes':0}).groupby((df.DrankCoffee=='yes').cumsum()).cumsum()
Out[111]:
0 1
1 2
2 2
3 0
4 1
5 2
6 3
7 4
8 5
9 0
10 1
Name: DrankCoffee, dtype: int64
Use:
df['Streak'] = df.assign(streak=df['DrankCoffee'].eq('no'))\
.groupby(df['DrankCoffee'].eq('yes').cumsum())['streak'].cumsum().astype(int)
Output:
Day DrankCoffee Streak
0 1 no 1
1 2 no 2
2 3 forgot 2
3 4 yes 0
4 5 no 1
5 6 no 2
6 7 no 3
7 8 no 4
8 9 no 5
9 10 yes 0
10 11 no 1
First, create streak increment when 'no' then True.
Next, create streak when 'yes' start a new streak using cumsum().
Lastly, use cumsum to count streak increment in streaks with
cumsum().

Select row in DataFrame based on values in multiple rows

I've got a DataFrame and I'd like to select rows where in one column they have a certain value, AND the row above has a certain value in another column. How do I do this without a for loop?
For example:
df = pd.DataFrame({'one': [1,2,3,4,1,2,3,4], 'two': [1,2,3,4,5,6,7,8]})
Where I'd like to find the row where df.one on that row equals 1, and df.two on the row above equals 4, so in the example row number 4 with values [1,5].
You can try shift with boolean indexing:
print df
one two
0 1 1
1 2 2
2 3 3
3 4 4
4 1 5
5 2 6
6 3 7
7 4 8
print (df.one == 1) & (df.two.shift() == 4)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
dtype: bool
print df[(df.one == 1) & (df.two.shift() == 4)]
one two
4 1 5

Categories

Resources