I have a pandas dataframe that looks something like this:
id
col1
col2
value1
value2
value3
1
123456
1234ABC
1
2
nan
1
123456
1234567
1
2
nan
1
124567
1234568
1
2
nan
1
124567
2345678
nan
2
nan
2
123456
1234564
nan
2
nan
2
123456
2132534
nan
2
nan
2
543210
10580701
nan
2
nan
I want to make a function that runs through the whole set and cleans it with these conditions:
For every unique id, do the following steps:
If col 1 has 6 digit code and col 2 has number and letter combination:
Then keep row
If col 1 has 6 digit code and col 2 has something else than number and letter combination
Then keep the first row with same 6 digit code in col1.
So in this table example, after running the function, these rows would still be in the dataset:
id
col1
col2
value1
value2
value3
1
123456
1234ABC
1
2
nan
1
123456
1234567
1
2
nan
1
124567
2345678
nan
2
nan
2
123456
1234564
nan
2
nan
2
543210
10580701
nan
2
nan
At first i tried something like this:
def process_df(df):
# Sort the dataframe by column 1 and column 2
df = df.sort_values(by=['col1', 'col2'])
# Create a new column that indicates whether a row has a letter in column 2
df['has_letter'] = df['col2'].str.contains('[a-zA-Z]')
# Group the dataframe by column 1 and apply the following function to each group
def group_func(group):
# If there are any rows with a letter in column 2, keep all of them
if group['has_letter'].any():
return group
# If there are no rows with a letter in column 2, keep the first row
else:
return group.iloc[0:1]
df = df.groupby('col1').apply(group_func)
# Drop the has_letter column
df = df.drop(columns=['has_letter'])
df=df.reset_index(drop=True)
return df
But it didn't work since every unique id might have rows where col1 6 digit code is same than some other ids col1 6 digit code
So somehow I have to make a function that does this to every id separately so it would work.
EDIT:
I edited the
df = df.groupby('col1').apply(group_func)
row to
df = df.groupby(['id', 'col1']).apply(group_func)
This seems? to do the job.
First I grouped by id, then I extracted the rows that had letters in col2, Iterating on the groups I further grouped by col1 and from these I extracted the first occurrence of the col1 value.
I hope this can help you.
# Sort the dataframe by column 1 and column 2
df4 = df4.sort_values(by=['col1', 'col2'])
# Create a new column that indicates whether a row has a letter in column 2
df4['has_letter'] = df4['col2'].str.contains('[a-zA-Z]')
# Fillna
df4['has_letter'] = df4['has_letter'].fillna(False)
grouped_id = df4.groupby('id')
output_df = pd.DataFrame()
# Iterate over group_id
for name, group_id in grouped_id:
output_df = output_df.append(group_id[group_id['has_letter']])
no_col2_letter = group_id[group_id['has_letter']==False]
grouped_col2 = no_col2_letter.groupby('col1')
for name, group_col2 in grouped_col2:
output_df = output_df.append(group_col2[:1])
output_df
Related
Let's say I have a data frame that looks like this. I want to delete everything with a certain ID if all of its Name values are empty. Like in this example, every name value is missing in the rows where ID is 2. Even if I have 100 rows with the ID 3 and only one name values is present, I want to keep it.
ID
Name
1
NaN
1
Banana
1
NaN
2
NaN
2
NaN
2
NaN
3
Apple
3
NaN
So the desired output looks like this:
ID
Name
1
NaN
1
Banana
1
NaN
3
Apple
3
NaN
Everything I tried so far was wrong. In this attempt, I tried to count every NaN Value that belongs to an ID, but it still returns me too many rows. This is the closest I got to my desired outcome.
df = df[(df['ID']) & (df['Name'].isna().sum()) != 0]
You want to exclude rows from IDs that have as many NaNs as they have rows. Therefore, you can group by ID and count their number of rows and number of NaNs.
Based on this result, you can get the IDs from people whose row count equals their NaN count and exclude them from your original dataframe.
# Declare column that indicates if `Name` is NaN
df['isna'] = df['Name'].isna().astype(int)
# Declare a dataframe that counts the rows and NaNs per `ID`
counter = df.groupby('ID').agg({'Name':'size', 'isna':'sum'})
# Get ID's from people who have as many NaNs as they have rows
exclude = counter[counter['Name'] == counter['isna']].index.values
# Exclude these IDs from your data
df = df[~df['ID'].isin(exclude)]
Using .groupby and .query
ids = df.groupby(["ID", "Name"]).agg(Count=("Name", "count")).reset_index()["ID"].tolist()
df = df.query("ID.isin(#ids)").reset_index(drop=True)
print(df)
Output:
ID Name
0 1 NaN
1 1 Banana
2 1 NaN
3 3 Apple
4 3 NaN
Assume the table below
Index
Col1
Col2
Col3
0
10.5
2.5
nan
1
s
2
2.9
3.2
a
3
#VAL
nan
2
4
3
5.6
4
Now what I'm trying to get is a summary dataframe which will give me a count of different datatypes/conditions as mentioned above
Index
Col1
Col2
Col3
Integer/Float
3
3
2
Blank
1
0
1
Nan
0
1
1
Text
1
1
1
I come from Excel so in Excel conditioning it would be pretty much simple
Integer/Float formula: I would use ISNUMBER and create an array of True and False values and sum the true ones
Blank: I would simply use COUNTIF(Column, "")
Text: Similar to ISNUMBER I would use ISTEXT above.
I have tried searching this on Stack Overflow however the best I've gotten is
pd.DataFrame(df["Col1"].apply(type).value_counts())
This does not however give me the exact output.
I also wanted to check if it was possible to filter out the values basis the above condition and get the fitting cells.
e.g. df[Col1==ISTEXT]
Use custom funstion for count each type separately:
def f(x):
a = pd.to_numeric(x, errors='coerce').notna().sum()
b = x.eq('').sum()
c = x.isna().sum()
d = len(x) - (a + b + c)
return pd.Series([a,b,c,d], ['Integer/Float','Blank','Nan','Text'])
df = df.apply(f)
print (df)
Col1 Col2 Col3
Integer/Float 3 3 2
Blank 1 0 1
Nan 0 1 1
Text 1 1 1
I have a input data as below
Case ID
Name
1
1
rohit
1
Sakshi
2
2
2
So basically the input data has two(2) type of Case IDs, one where there is both blank and non-blank values(rows) for a Case ID and another where there is just blank value(rows) for the case.
I am trying to get below output :-
Case ID
Name
1
rohit
1
Sakshi
2
i.e., if a case has both blank and non-blank values then for that Case ID just show the non-blank values and for the case where all values are blank then just have a single row/record with blank value in the column 'Name'
one way (not efficient but flexible) way is to use the split-apply-combine approach with a custom function:
def drop_empty(df0):
df0 = df0.copy() # lose a value is trying to be set on a slice warning
if df0['Name'].count()!=0:
df0.dropna(thresh=2, inplace=True)
else:
df0.drop_duplicates(inplace=True)
return df0[['Name']]
df.groupby('Case ID').apply(drop_empty).reset_index()[['Case ID', 'Name']]
you can also try something like this:
indx = df.groupby('Case ID')['Name'].apply(lambda x: x.dropna() if x.count() else x.head(1))
df = df.loc[indx.index.get_level_values(1)]
>>> df
'''
Case ID Name
1 1 rohit
2 1 Sakshi
3 2 NaN
suppose your input dataframe looks like:
Case ID Name
0 1 NaN
1 1 rohit
2 1 Sakshi
3 2 NaN
4 2 NaN
5 2 NaN
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
Take a line like:
df.dropna(thresh=2)
Instead of dropping the column based on the # of empty values, I want to fill that entire column with zeroes.
Thus, columns that only have one empty value will be untouched, while columns with 2+ empty values will be totally replaced with zeroes.
Iterate over the columns, and if the number of NA values is greater than 2, replace the columns with zeroes else keep the column as is:
for col in df.columns:
df[col] = 0 if df[col].isna().sum() > 2 else df[col]
try this
df=pd.DataFrame({"col1":[1,2,3,4,5,7],"col2":[1,2,3,4,np.NaN,2],"col3":[1,2,3,np.NaN,np.NaN,np.NaN]})
temp=(((df!=df).sum())>=2)#nothing much is happening here its simply just a filter for nan elements
temp1=df.columns
for x in range(len(temp)):
if temp[x]:
df[temp1[x]]=0
df
output:
col1 col2 col3
0 1 1.0 0
1 2 2.0 0
2 3 3.0 0
3 4 4.0 0
4 5 NaN 0
5 7 2.0 0