Removing intersection between data frame based on multiple columns - python

I have these 2 data frames:
df_test
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 -1 -1 -1 -1 -1
1 1177314888 238198786 5770904146 133207291 Exact
2 1177314888 238198786 5770904266 18395155770 Exact
3 1177314888 238198786 5770904266 19338210057 Exact
4 1177314888 238198786 5770904266 30907903234 Exact
and
df_merge
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 -1 -1 -1 -1 -1
1 1177314888 238198786 5770904146 133207291 Exact
I want to remove everything that is inside df_merge from df_test, based on the combinations of dimension1_id, dimension2_id, dimension3_id, dimension4_id and dimension5_id.
This is my code:
df_test = df_test[
(df_test['dimension5_id'].isin(df_merge.dimension5_id) == False) &
(df_test['dimension4_id'].isin(df_merge.dimension4_id) == False) & (df_test['dimension3_id'].isin(df_merge.dimension3_id) == False) & (df_test['dimension2_id'].isin(df_merge.dimension2_id) == False) &
(df_test['dimension1_id'].isin(df_merge.dimension1_id) == False)
]
But this code returns a empty data frame. How can I just remove the first and second line from df_test?

You can use logical indexing to mask the rows you want by applying a direct comparison. In this case, you can check for values in df_test which are in df_merge:
df_test.isin(df_merge)
The resulting logical index acts as a mask:
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 True True True True True True
1 True True True True True True
2 False False False False False False
3 False False False False False False
4 False False False False False False
True values map to matching rows so we can simply negate the index using ~ to return only the rows you in df_merge which are not df_test:
df_test[~df_test.isin(df_merge)]

Related

How to compare just the date or just date time ignoring seconds in a Python Pandas dataframe column of mixed data types?

In a pandas dataframe, I have a column of mixed data types, such as text, integers and datetimes. I need to find columns where datetimes match: (1) exact values in some cases, (2) only the date (ignoring time), or (3) only the date and time, but ignoring seconds.
In the following code example with a mixed data type dataframe column, there are three dates of varying imprecision. Mapping the conditions into a separate dataframe works for a precise value.
import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 0, 'mixCol': np.nan},
{'Id': 1, 'mixCol': "text"},
{'Id': 2, 'mixCol': 43831},
{'Id': 3, 'mixCol': pd.to_datetime("2020-01-01 00:00:00")},
{'Id': 4, 'mixCol': pd.to_datetime("2020-01-01 01:01:00")},
{'Id': 5, 'mixCol': pd.to_datetime("2020-01-01 01:01:01")}
]
df = pd.DataFrame(inp)
print(df.dtypes)
myMap = pd.DataFrame()
myMap["Exact"] = df["mixCol"] == pd.to_datetime("2020-01-01 01:01:01")
0 False
1 False
2 False
3 False
4 False
5 True
6 False
The output I need should be:
Id Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
0 False True True
5 True True True
6 False False False
BUT, mapping just the date, without time, maps as if the date had a time of 00:00:00.
myMap["DateOnly"] = df["mixCol"] == pd.to_datetime("2020-01-01")
Id Exact DateOnly
0 False False
1 False False
2 False False
3 False True
0 False False
5 True False
6 False False
Trying to convert values in the mixed column throws an AttributeError: 'Series' object has not attribute 'date'; and trying to use ">" and "<" to define the relevant range throws a TypeError: '>=' not supported between instances of 'str' and 'Timestamp'
myMap["DateOnly"] = df["mixCol"].date == pd.to_datetime("2020-01-01")
myMap["NoSeconds"] = (df["mixCol"] >= pd.to_datetime("2020-01-01 01:01:00")) & (df["mixCol"] < pd.to_datetime("2020-01-01 01:02:00"))
If I try to follow the solution for mix columns in pandas proposed here, both the np.nan and text value map true as dates.
df["IsDate"] = df.apply(pd.to_datetime, errors='coerce',axis=1).nunique(1).eq(1).map({True:True ,False:False})
I'm not sure how to proceed in this situation?
Use Series.dt.normalize for compare datetimes with remove times (set them to 00:00:00) or with Series.dt.floor by days or minutes for remove seconds:
#convert column to all datetimes with NaT
d = pd.to_datetime(df["mixCol"], errors='coerce')
myMap["DateOnly"] = d.dt.normalize() == pd.to_datetime("2020-01-01")
myMap["DateOnly"] = d.dt.floor('D') == pd.to_datetime("2020-01-01")
#alternative with dates
myMap["DateOnly"] = d.dt.date == pd.to_datetime("2020-01-01").date()
myMap['NoSeconds'] = d.dt.floor('Min') == pd.to_datetime("2020-01-01 01:01:00")
print (myMap)
Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
4 False True True
5 True True True

Pandas data precision [another]

I got an unexpected behaviour, at least for me, while working with pandas dataframe. I use a 2d array to check if the values are smaller than. If I check the entire Dataframe at once, the check is wrong at some values. But if I check explicitly the concerned cell the result is correct.
print(df.loc[5,397])
out: 14.4 #--> its actually 14.3999996185
print(df.loc[5,397] < 14.4)
out: True
print(df.loc[4:6,396:398] < 14.4)
out:
396 397 398
4 False False False
5 False False False #[5,397] should be True!
6 False False False
However, if I try to reproduce the error, I got the correct result?!
data = numpy.array([[15,15,15], [15,14.3999996185,15], [15,15,15]])
df = pd.DataFrame(data)
print(df.loc[1,1])
out: 14.3999996185
print(df.loc[1,1] < 14.4)
out: True
print(df < 14.4)
out:
0 1 2
0 False False False
1 False True False
2 False False False
Thank you

Filter Pandas DataFrame using a loop that is applying a dynamic conditional

I have a dataframe which contains two columns.
One column shows distance, the other column contains unique 'trackIds' that are associated with a set of distances.
Example:
trackId. distance
2. 17.452
2. 8.650
2. 10.392
2. 11.667
2. 23.551
2. 9.881
3. 6.052
3. 7.241
3. 8.459
3. 22.644
3. 126.890
3. 12.442
3. 5.891
4. 44.781
4. 7.657
4. 36.781
4. 224.001
What I am trying to do is eliminate any trackIds that contain a large spike in distance -- a spike that is > 75.
In this example case, track Ids 3 and 4 (and all their associated distances) would be removed from the dataframe because we see spikes in distance greater than 75, thus we would just be left with a dataframe containing trackId 2 and all of its associated distance values.
Here is my code:
i = 0
k = 1
length = len(dataframe)
while i < length:
if (dataframe.distance[k] - dataframe.distance[i]) > 75:
bad_id = dataframe.trackId[k]
condition = dataframe.trackid != bad_id
df2 = dataframe[condition]
i+=1
I tried to use a while loop that was able to go through all the different trackIds, subtract all the distance values and see if the result was > 75, if it was, then the program associated that trackId with the variable 'bad_id' and used that as a condition to filter the dataframe to only contain trackIds that are not equal to the bad_id(s).
I just keep getting nameErrors because I'm unsure of how to properly structure the loop and am in general not sure if this approach works anyways.
We can use diff to compute the difference between each row, then use groupby transform to check if there are any differences in the group gt 75. Then keep groups where there are not any matches:
m = ~(df['distance'].diff().gt(75).groupby(df['trackId']).transform('any'))
filtered_df = df.loc[m, df.columns]
filtered_df:
trackId distance
0 2.0 17.452
1 2.0 8.650
2 2.0 10.392
3 2.0 11.667
4 2.0 23.551
5 2.0 9.881
Breakdown of steps as a DataFrame:
breakdown = pd.DataFrame({'diff': df['distance'].diff()})
breakdown['gt 75'] = breakdown['diff'].gt(75)
breakdown['groupby any'] = (
breakdown['gt 75'].groupby(df['trackId']).transform('any')
)
breakdown['negation'] = ~breakdown['groupby any']
print(breakdown)
breakdown:
diff gt 75 groupby any negation
0 NaN False False True
1 -8.802 False False True
2 1.742 False False True
3 1.275 False False True
4 11.884 False False True
5 -13.670 False False True
6 -3.829 False True False
7 1.189 False True False
8 1.218 False True False
9 14.185 False True False
10 104.246 True True False # Spike of more than 75
11 -114.448 False True False
12 -6.551 False True False
13 38.890 False True False
14 -37.124 False True False
15 29.124 False True False
16 187.220 True True False # Spike of more than 75

Python - isnull().sum() vs isnull().count()

So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().

Python | count number of False statements in 3 rows

Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.
You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)
Try this : df[df==false].count()
As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)

Categories

Resources