I have a dataframe with 2 columns: 'age' and 'name'. Which looks like this (when opened in notepad):
,age,name
0,18,Bill
1,22,Harry
2,Nan,Bill
4,5,William
(the first column is an index)
I need to drop any rows with Nan in the age column and also drop any rows which have the same name in the name column. For example, in the snippet of my data frame I would want to drop both rows with Bill in as one of the ages contains Nan
Currently i have this:
df_no_dups = dp[dp.isfinite(dp['age'])]
This is the first part but am stuck on removing the other rows with the same name as the row containing Nan
Any help would be great
Filter by boolean indexing with boolean mask created by transform for test if all values per groups have no missing value:
df1 = df[df['age'].notnull().groupby(df['name']).transform('all')]
Or check missing values, test if at least one True per group and last invert boolean mask by ~:
df1 = df[~df['age'].isnull().groupby(df['name']).transform('any')]
print (df1)
age name
1 22.0 Harry
3 5.0 William
Detail:
print (df['age'].notnull())
0 True
1 True
2 False
3 True
Name: age, dtype: bool
print (df['age'].notnull().groupby(df['name']).transform('all'))
0 False
1 True
2 False
3 True
Name: age, dtype: bool
try this,
df=df.drop_duplicates(subset=['name'],keep=False)
df[(df['age'].notnull()] #or df[(df['age']!='Nan')] (as your input Contains Nan as string)
Explanation:
First remove the duplicates and pass keep=False to remove all duplicates. Then filter for NaN.
Output:
age name
1 22 Harry
4 5 William
This works for me:
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.drop_duplicates(subset='name', keep=False)
df = df.dropna(subset=['age'])
Edit: this works for null values, if Nan is a string as pointed by #Mohamed then use the answer provided by him.
Related
I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com
I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df
Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0
I've got 2 data frames, one with 1 column currentWorkspaceGuid (workspacesDF) and another with 4 columns currentWorkspaceGuid, modelGuid, memoryUsage, lastModified (extrasDF) and I'm trying to get isin to result in a dataFrame that shows the values from the second dataframe only if the workspaceGuid exists in the workspacesDF . It's giving me an empty dataframe when I use the following code:
import pandas as pd
extrasDF = pd.read_csv("~/downloads/Extras.csv")
workspacesDF = pd.read_csv("~/downloads/workspaces.csv")
not_in_workspaces = extrasDF[(extrasDF.currentWorkspaceGuid.isin(workspacesDF))]
print(not_in_workspaces)
I tried adding in print statements to verify the column matches when it should and doesn't when it shouldn't but it's still returning nothing.
Once I can get this to work correctly, my end goal is to return a list of the items that don't exist in the workspacesDF which I think I can do just by adding ~ to the front of the isin statement which is why I'm not doing a join or merge.
EDIT:
Adding example data from both files for clarification:
from workspaces.csv:
currentWorkspaceGuid
8a81b09c56cdf89c0157345759d75644
8a81948240d60b1901417a266a536462
402882f738cf7433013b612dc5f60bbd
8a8194884c860a53014ca1f6596d54e9
8a8194884a34d3ff014a4f31bea3705a
from Extras.csv:
currentWorkspaceGuid,modelGuid,memoryUsage,lastModified
8a81b09c56cdf89c0157345759d75644,635D5FAAC46D4856AAFD21AC6386DDCA,1191785,"2018-08-08 17:57:45"
8a81948240d60b1901417a266a536462,4076B1A8B1E34D549FFFE9F5FFE4538A,5400000,"2016-09-13 18:32:50"
402882f738cf7433013b612dc5f60bbd,4CA3CDC12CD349ABA8658365480073CA,550000,"2017-11-23 16:26:10"
8a8194884c860a53014ca1f6596d54e9,15E3E6B6087A4CA6838616A418E9657A,830000,"2018-05-22 17:35:50"
8a8194884a34d3ff014a4f31bea3705a,C47D186A479140BFAB24AF8D24E8B2BA,816686,"2018-07-31 09:39:16"
I think need compare columns (Series):
mask = extrasDF['currentWorkspaceGuid'].isin(workspacesDF['currentWorkspaceGuid'])
in_workspaces = extrasDF[mask]
print (in_workspaces)
currentWorkspaceGuid modelGuid \
0 8a81b09c56cdf89c0157345759d75644 635D5FAAC46D4856AAFD21AC6386DDCA
1 8a81948240d60b1901417a266a536462 4076B1A8B1E34D549FFFE9F5FFE4538A
2 402882f738cf7433013b612dc5f60bbd 4CA3CDC12CD349ABA8658365480073CA
3 8a8194884c860a53014ca1f6596d54e9 15E3E6B6087A4CA6838616A418E9657A
4 8a8194884a34d3ff014a4f31bea3705a C47D186A479140BFAB24AF8D24E8B2BA
memoryUsage lastModified
0 1191785 2018-08-08 17:57:45
1 5400000 2016-09-13 18:32:50
2 550000 2017-11-23 16:26:10
3 830000 2018-05-22 17:35:50
4 816686 2018-07-31 09:39:16
For filter non matched values add ~ for invert boolean mask:
not_in_workspaces = extrasDF[~mask]
print (not_in_workspaces)
Empty DataFrame
Columns: [currentWorkspaceGuid, modelGuid, memoryUsage, lastModified]
Index: []
Details:
print (mask)
0 True
1 True
2 True
3 True
4 True
Name: currentWorkspaceGuid, dtype: bool
print (~mask)
0 False
1 False
2 False
3 False
4 False
Name: currentWorkspaceGuid, dtype: bool
I have a dataframe where some columns (not row) are like ["","","",""].
Those columns with that characteristic I would like to delete.
Is there an efficient way of doing that?
In pandas it would be del df['columnname'].
To delete columns where all values are empty, you first need to detect what columns contain only empty values.
So I made an example dataframe like this:
empty full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
Using the apply function, we can compare entire columns to the empty string and then aggregate down with the .all() method.
empties = (df.astype(str) == "").all()
empties
empty True
full False
nanvalues False
notempty False
dtype: bool
Now we can drop these columns
empty_mask = empties.index[empties]
df.drop(empty_mask, axis=1)
full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
I have a dataFrame in pandas and several of the columns have all null values. Is there a built in function which will let me remove those columns?
Yes, dropna. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna docstring:
Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
Returns
-------
dropped : DataFrame
The specific command to run would be:
df=df.dropna(axis=1,how='all')
Another solution would be to create a boolean dataframe with True values at not-null positions and then take the columns having at least one True value. This removes columns with all NaN values.
df = df.loc[:,df.notna().any(axis=0)]
If you want to remove columns having at least one missing (NaN) value;
df = df.loc[:,df.notna().all(axis=0)]
This approach is particularly useful in removing columns containing empty strings, zeros or basically any given value. For example;
df = df.loc[:,(df!='').all(axis=0)]
removes columns having at least one empty string.
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
Function for removing all null columns from the data frame:
def Remove_Null_Columns(df):
dff = pd.DataFrame()
for cl in fbinst:
if df[cl].isnull().sum() == len(df[cl]):
pass
else:
dff[cl] = df[cl]
return dff
This function will remove all Null columns from the df.