I have a dataframe where some columns (not row) are like ["","","",""].
Those columns with that characteristic I would like to delete.
Is there an efficient way of doing that?
In pandas it would be del df['columnname'].
To delete columns where all values are empty, you first need to detect what columns contain only empty values.
So I made an example dataframe like this:
empty full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
Using the apply function, we can compare entire columns to the empty string and then aggregate down with the .all() method.
empties = (df.astype(str) == "").all()
empties
empty True
full False
nanvalues False
notempty False
dtype: bool
Now we can drop these columns
empty_mask = empties.index[empties]
df.drop(empty_mask, axis=1)
full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
Related
I created a df from the data of my excel sheet and in a specific column I have a lot of values that are the same, but some of then are different. What I want to do is find in what row these different values are and associate each one with another value from the same row. I will give an example:
ColA ColB
'Ship' 5
'Ship' 5
'Car' 3
'Ship' 5
'Plane' 2
Following the example, is there a way to find where the values different from 5 are with the code giving me the respective value from ColA? In this case would be finding 3 and 2, returning for me 'Car' and 'Plane', respectively.
Any help is welcome! :)
It depends on exacty what you want to do, but you could use:
a filter - to filter for the value you seek.
.where - to show values which are False.
Given the above dataframe the following would work:
df['different'] = df['ColB']==5
df['type'] = df['ColA'].where(df['different']==False)
print(df)
Which returns this:
ColA ColB different type
0 Ship 5 True NaN
1 Ship 5 True NaN
2 Car 3 False Car
3 Ship 5 True NaN
4 Plane 2 False Plane
The 4th column has what you seek...
I have a list of 3 DataFrames x, where each DataFrame has 3 columns. It looks like
1 2 T/F
4 7 False
4 11 True
4 20 False
4 25 True
4 40 False
What I want to do is set the value of each row in column 'T/F' to False for each DataFrame in list x
I attempted to do this with the following code
rang = list(range(len(x))) # rang=[0,1,2]
for i in rang:
x[i].iloc[:len(x), 'T/F'] = False
The code compiled, but it didn't appear to work.
Much simpler. Just iterate over the actual dataframes and update the columns with:
for df in [df1, df2]:
df['T/F'] = False
Als note that DataFrame.iloc is a integer-location based indexing. If you want to index using the column names use .loc.
I have a dataframe with 2 columns: 'age' and 'name'. Which looks like this (when opened in notepad):
,age,name
0,18,Bill
1,22,Harry
2,Nan,Bill
4,5,William
(the first column is an index)
I need to drop any rows with Nan in the age column and also drop any rows which have the same name in the name column. For example, in the snippet of my data frame I would want to drop both rows with Bill in as one of the ages contains Nan
Currently i have this:
df_no_dups = dp[dp.isfinite(dp['age'])]
This is the first part but am stuck on removing the other rows with the same name as the row containing Nan
Any help would be great
Filter by boolean indexing with boolean mask created by transform for test if all values per groups have no missing value:
df1 = df[df['age'].notnull().groupby(df['name']).transform('all')]
Or check missing values, test if at least one True per group and last invert boolean mask by ~:
df1 = df[~df['age'].isnull().groupby(df['name']).transform('any')]
print (df1)
age name
1 22.0 Harry
3 5.0 William
Detail:
print (df['age'].notnull())
0 True
1 True
2 False
3 True
Name: age, dtype: bool
print (df['age'].notnull().groupby(df['name']).transform('all'))
0 False
1 True
2 False
3 True
Name: age, dtype: bool
try this,
df=df.drop_duplicates(subset=['name'],keep=False)
df[(df['age'].notnull()] #or df[(df['age']!='Nan')] (as your input Contains Nan as string)
Explanation:
First remove the duplicates and pass keep=False to remove all duplicates. Then filter for NaN.
Output:
age name
1 22 Harry
4 5 William
This works for me:
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.drop_duplicates(subset='name', keep=False)
df = df.dropna(subset=['age'])
Edit: this works for null values, if Nan is a string as pointed by #Mohamed then use the answer provided by him.
I've got 2 data frames, one with 1 column currentWorkspaceGuid (workspacesDF) and another with 4 columns currentWorkspaceGuid, modelGuid, memoryUsage, lastModified (extrasDF) and I'm trying to get isin to result in a dataFrame that shows the values from the second dataframe only if the workspaceGuid exists in the workspacesDF . It's giving me an empty dataframe when I use the following code:
import pandas as pd
extrasDF = pd.read_csv("~/downloads/Extras.csv")
workspacesDF = pd.read_csv("~/downloads/workspaces.csv")
not_in_workspaces = extrasDF[(extrasDF.currentWorkspaceGuid.isin(workspacesDF))]
print(not_in_workspaces)
I tried adding in print statements to verify the column matches when it should and doesn't when it shouldn't but it's still returning nothing.
Once I can get this to work correctly, my end goal is to return a list of the items that don't exist in the workspacesDF which I think I can do just by adding ~ to the front of the isin statement which is why I'm not doing a join or merge.
EDIT:
Adding example data from both files for clarification:
from workspaces.csv:
currentWorkspaceGuid
8a81b09c56cdf89c0157345759d75644
8a81948240d60b1901417a266a536462
402882f738cf7433013b612dc5f60bbd
8a8194884c860a53014ca1f6596d54e9
8a8194884a34d3ff014a4f31bea3705a
from Extras.csv:
currentWorkspaceGuid,modelGuid,memoryUsage,lastModified
8a81b09c56cdf89c0157345759d75644,635D5FAAC46D4856AAFD21AC6386DDCA,1191785,"2018-08-08 17:57:45"
8a81948240d60b1901417a266a536462,4076B1A8B1E34D549FFFE9F5FFE4538A,5400000,"2016-09-13 18:32:50"
402882f738cf7433013b612dc5f60bbd,4CA3CDC12CD349ABA8658365480073CA,550000,"2017-11-23 16:26:10"
8a8194884c860a53014ca1f6596d54e9,15E3E6B6087A4CA6838616A418E9657A,830000,"2018-05-22 17:35:50"
8a8194884a34d3ff014a4f31bea3705a,C47D186A479140BFAB24AF8D24E8B2BA,816686,"2018-07-31 09:39:16"
I think need compare columns (Series):
mask = extrasDF['currentWorkspaceGuid'].isin(workspacesDF['currentWorkspaceGuid'])
in_workspaces = extrasDF[mask]
print (in_workspaces)
currentWorkspaceGuid modelGuid \
0 8a81b09c56cdf89c0157345759d75644 635D5FAAC46D4856AAFD21AC6386DDCA
1 8a81948240d60b1901417a266a536462 4076B1A8B1E34D549FFFE9F5FFE4538A
2 402882f738cf7433013b612dc5f60bbd 4CA3CDC12CD349ABA8658365480073CA
3 8a8194884c860a53014ca1f6596d54e9 15E3E6B6087A4CA6838616A418E9657A
4 8a8194884a34d3ff014a4f31bea3705a C47D186A479140BFAB24AF8D24E8B2BA
memoryUsage lastModified
0 1191785 2018-08-08 17:57:45
1 5400000 2016-09-13 18:32:50
2 550000 2017-11-23 16:26:10
3 830000 2018-05-22 17:35:50
4 816686 2018-07-31 09:39:16
For filter non matched values add ~ for invert boolean mask:
not_in_workspaces = extrasDF[~mask]
print (not_in_workspaces)
Empty DataFrame
Columns: [currentWorkspaceGuid, modelGuid, memoryUsage, lastModified]
Index: []
Details:
print (mask)
0 True
1 True
2 True
3 True
4 True
Name: currentWorkspaceGuid, dtype: bool
print (~mask)
0 False
1 False
2 False
3 False
4 False
Name: currentWorkspaceGuid, dtype: bool
What is a graceful way to fail when I want to access a value from a dataframe based on multiple conditions:
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['column_one']>2004) & (df['column_two']==9)]
If not value satisfying above condition exists, then pandas returns a keyerror. How do I instead just store a nan value in newdf.
If instead of dropping rows where the condition is not met, you want pandas to return a dataframe with rows of NaN where the condition is False and the original values otherwis, you can do the following.
You can assign a list of booleans with length equal to the number of rows of the dataframe to a view on all rows of the dataframe. This will get you NaN on the rows which are False and the original values for rows which correspond to True. If the entire list is False, you just get a dataframe full of NaN.
P.S. One of the column names is probably off. Even if everything is False, it should just return an empty dataframe instead of keyError.
Input:
print df1
df1[:] = df1[(df1["a"]>2)&(df1["b"]>1).tolist()]
print df1
Output:
a b c
0 1 2 3
1 2 2 3
2 3 2 3
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 2.0 3.0