Removing blank rows - python

I have a input data as below
Case ID
Name
1
1
rohit
1
Sakshi
2
2
2
So basically the input data has two(2) type of Case IDs, one where there is both blank and non-blank values(rows) for a Case ID and another where there is just blank value(rows) for the case.
I am trying to get below output :-
Case ID
Name
1
rohit
1
Sakshi
2
i.e., if a case has both blank and non-blank values then for that Case ID just show the non-blank values and for the case where all values are blank then just have a single row/record with blank value in the column 'Name'

one way (not efficient but flexible) way is to use the split-apply-combine approach with a custom function:
def drop_empty(df0):
df0 = df0.copy() # lose a value is trying to be set on a slice warning
if df0['Name'].count()!=0:
df0.dropna(thresh=2, inplace=True)
else:
df0.drop_duplicates(inplace=True)
return df0[['Name']]
df.groupby('Case ID').apply(drop_empty).reset_index()[['Case ID', 'Name']]

you can also try something like this:
indx = df.groupby('Case ID')['Name'].apply(lambda x: x.dropna() if x.count() else x.head(1))
df = df.loc[indx.index.get_level_values(1)]
>>> df
'''
Case ID Name
1 1 rohit
2 1 Sakshi
3 2 NaN
suppose your input dataframe looks like:
Case ID Name
0 1 NaN
1 1 rohit
2 1 Sakshi
3 2 NaN
4 2 NaN
5 2 NaN

Related

Pandas: How to delete rows where 2 conditions in 2 different columns need to be met

Let's say I have a data frame that looks like this. I want to delete everything with a certain ID if all of its Name values are empty. Like in this example, every name value is missing in the rows where ID is 2. Even if I have 100 rows with the ID 3 and only one name values is present, I want to keep it.
ID
Name
1
NaN
1
Banana
1
NaN
2
NaN
2
NaN
2
NaN
3
Apple
3
NaN
So the desired output looks like this:
ID
Name
1
NaN
1
Banana
1
NaN
3
Apple
3
NaN
Everything I tried so far was wrong. In this attempt, I tried to count every NaN Value that belongs to an ID, but it still returns me too many rows. This is the closest I got to my desired outcome.
df = df[(df['ID']) & (df['Name'].isna().sum()) != 0]
You want to exclude rows from IDs that have as many NaNs as they have rows. Therefore, you can group by ID and count their number of rows and number of NaNs.
Based on this result, you can get the IDs from people whose row count equals their NaN count and exclude them from your original dataframe.
# Declare column that indicates if `Name` is NaN
df['isna'] = df['Name'].isna().astype(int)
# Declare a dataframe that counts the rows and NaNs per `ID`
counter = df.groupby('ID').agg({'Name':'size', 'isna':'sum'})
# Get ID's from people who have as many NaNs as they have rows
exclude = counter[counter['Name'] == counter['isna']].index.values
# Exclude these IDs from your data
df = df[~df['ID'].isin(exclude)]
Using .groupby and .query
ids = df.groupby(["ID", "Name"]).agg(Count=("Name", "count")).reset_index()["ID"].tolist()
df = df.query("ID.isin(#ids)").reset_index(drop=True)
print(df)
Output:
ID Name
0 1 NaN
1 1 Banana
2 1 NaN
3 3 Apple
4 3 NaN

Filter rows of one column which is alphabet, numbers or hyphen in Pandas

Given a dataframe as follows, I need to check room column:
id room
0 1 A-102
1 2 201
2 3 B309
3 4 C·102
4 5 E_1089
The correct format of this column should be numbers, alphabet or hyphen, otherwise, fill check column with incorrect
The expected result is like this:
id room check
0 1 A-102 NaN
1 2 201 NaN
2 3 B309 NaN
3 4 C·102 incorrect
4 5 E_1089 incorrect
Here informal syntax can be:
df.loc[<filter1> | (<filter2>) | (<filter3>), 'check'] = 'incorrect'
Thanks for your help at advance.
Use str.match to force all characters:
df['check'] = np.where(df.room.str.match('^[a-zA-Z\d\-]*$'), np.NaN, 'incorrect')
Or str.contains with negation pattern:
df['check'] = np.where(df.room.str.contains('([^a-zA-Z\d\-])'), 'incorrect', np.NaN)
Output:
id room check
0 1 A-102 nan
1 2 201 nan
2 3 B309 nan
3 4 C·102 incorrect
4 5 E_1089 incorrect
If you want to update the existing check column, use loc access. For example:
df.loc[df.room.str.contains('([^a-zA-Z\d\-])'), 'check'] = 'incorrect'
# or safer when `NaN` presents
# df.loc[df.room.str.contains('([^a-zA-Z\d\-])') == True, 'check'] = 'incorrect'

comparing two columns and replace NaN with numbers

for i in range(len(df1)-1):
if (df1['overall_rating'][i]==np.nan) and (df1['recommended'][i]==0):
df1['overall_rating']=df1['overall_rating'][i].replace(np.nan,1)
else:
df1['overall_rating']
print(df1['overall_rating'])
I am comparing overall rating columns and recommended column in a pandas dataframe. If both column values happens to be true then i should replace nan in rating column to be 1 . But I am not getting answer as well error.Anyone please let me know where I am going wrong.
Use DataFrame.loc for set 1 by 2 conditions, for test missing values is used Series.isna function:
df1 = pd.DataFrame({'overall_rating':[np.nan,2,4,np.nan],
'recommended':[0,0,1,1]})
df1.loc[df1['overall_rating'].isna() & (df1['recommended']==0), 'overall_rating'] = 1
print (df1)
overall_rating recommended
0 1.0 0
1 2.0 0
2 4.0 1
3 NaN 1

accessing Groupby Sum results [duplicate]

I have a dataframe with 2 index levels:
value
Trial measurement
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
Which I want to turn into this:
Trial measurement value
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
How can I best do this?
I need this because I want to aggregate the data as instructed here, but I can't select my columns like that if they are in use as indices.
The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).
All you have to do call .reset_index() after the name of the DataFrame:
df = df.reset_index()
This doesn't really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one's multindex have the same name like this:
value
Trial Trial
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.
So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:
value
Trial measurement
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
And then df.reset_index(inplace=True) will work like a charm.
I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.
There may be situations when df.reset_index() cannot be used (e.g., when you need the index, too). In this case, use index.get_level_values() to access index values directly:
df['Trial'] = df.index.get_level_values(0)
df['measurement'] = df.index.get_level_values(1)
This will assign index values to individual columns and keep the index.
See the docs for further info.
As #cs95 mentioned in a comment, to drop only one level, use:
df.reset_index(level=[...])
This avoids having to redefine your desired index after reset.
I ran into Karl's issue as well. I just found myself renaming the aggregated column then resetting the index.
df = pd.DataFrame(df.groupby(['arms', 'success'])['success'].sum()).rename(columns={'success':'sum'})
df = df.reset_index()
Short and simple
df2 = pd.DataFrame({'test_col': df['test_col'].describe()})
df2 = df2.reset_index()
A solution that might be helpful in cases when not every column has multiple index levels:
df.columns = df.columns.map(''.join)
Similar to Alex solution in a more generalized form. It keeps the indexes untouched and adds index level as a new columns with its name.
for i in df.index.names:
df[i] = df.index.get_level_values(i)
which gives
value Trial measurement
Trial measurement
1 0 13 1 0
1 3 1 1
2 4 1 2
...

Pandas - group by id and drop duplicate with threshold

I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })
You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3
filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3
Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]
I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.

Categories

Resources