Filtering data-frame on multiple criteria - python

I have a dataframe df which has a head that looks like:
Shop Opening date
0 London NaT
22 Brighton 01/03/2016
27 Manchester 01/31/2017
54 Bristol 03/31/2017
69 Glasgow 04/09/2017
I also have a variable startPeriod which is set to 1/04/2017 date and endPeriod variable that has a value of 30/06/17
I am trying to create a new dataframe based on df that filters out any rows that do not have a date (so removing any rows with an Opening date of NaT) and also filter out any rows with an opening date between the startPeriod and endPeriod. So in the above example I would be left with the following new dataframe:
Shop Opening date
22 Brighton 01/03/2016
69 Glasgow 04/09/2017
I have tried to filter out the 'NaT' using the following:
df1 = df['Opening date '] != 'NaT'
but am unsure how to also filter out any Opening date that are inside the startPeriod/endPeriod range.

You can use between with boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df = df[df['date'].between('2016-03-01', '2017-04-05')]
print (df)
Shop Opening date
2 27 Manchester 2017-01-31
3 54 Bristol 2017-03-31
I think filtering out NaNs is not necessary, but if need it chain new condition:
df = df[df['date'].between('2016-03-01', '2017-04-05') & df['date'].notnull()]

First of all, be careful with the space after date in df['Opening date ']
try this solution:
df1 = df[df['Opening date'] != 'NaT']
it would be much better if you create a copy of the subset you're making
df1 = df[df['Opening date'] != 'NaT'].copy()

Related

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

Python pandas column filtering substring

I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?
There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.
I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

how to filter dataframe for specific Dates and Name in Python?

I want to filter my Dataframe by 2 columns one is for date and other one is for name.
How can I filter out data from previous month only?. So if I run code today it will filter out data for previous month.
So date columns contains values
as(year,month,date): [202006, 202005, 202007,202107,20200601, 20200630 ]
etc.(Note that in some, date is absent)
And while filtering this, I also want to filter 2nd columns in which I only want to take those name which contains specific keywords.
Example:
Data=[[202006,Fuel oil],[202007, crude oil],[20200601, palm oil],[20200805, crude oil],[202007, Marine fuel]]
If i run the code it will automatically give me previous month's data and name which contains "oil" word.
First convert dates to datetimes, here is used 2 formats of dates by to_datetime with different formats and errors='coerce', missing values are replaced by Series.fillna:
df= pd.DataFrame({'date':[202006, 202005, 202007,202107,20200601, 20200630 ],
'fuel':['Fuel oil','crude oil','fuel oil',
'castor oil','crude oil', 'fuel']})
d1 = pd.to_datetime(df['date'], format='%Y%m', errors='coerce')
d2 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = d1.fillna(d2)
print (df)
date fuel
0 2020-06-01 Fuel oil
1 2020-05-01 crude oil
2 2020-07-01 fuel oil
3 2021-07-01 castor oil
4 2020-06-01 crude oil
5 2020-06-30 fuel
Then values are filtered by monh periods - Series.dt.to_period compared with subtracted today month for first condition, then is chained by & for bitwise AND second condition by Series.str.contains and filtering by boolean indexing:
now = pd.Timestamp('now').to_period('M')
df = df[df['date'].dt.to_period('M').eq(now - 1) & df['fuel'].str.contains('oil')]
print (df)
date fuel
0 2020-06-01 Fuel oil
4 2020-06-01 crude oil
Assuming your dataframe is
df = pd.DataFrame({
'date':[202006, 202005, 202007, 202107, 20200601, 20200630],
'fuel':['Fuel oil', 'crude oil', 'fuel oil', 'castor oil', 'crude oil', 'fuel']})
Then you can do the following code to filter it:
import time
# finding previous month and year
current_year= time.gmtime().tm_year
current_month= time.gmtime().tm_mon
# Adding a check if the current month is January
if current_month!=1:
prev_month= current_month-1
else:
prev_month=12
current_year -= 1
# extracting month,year info from the date column by converting it into strings
df[df.date.apply(lambda x: int(str(x)[4:6])==prev_month and int(str(x)[:4])== current_year) & df.fuel.apply(lambda x: 'oil' in x)]
Note:
df.date.apply(lambda x: int(str(x)[4:6]) extracts the month info which i use to compare with previous month and filter.
df.fuel.apply(lambda x: 'oil' in x) sees which element has the word oil in it.
Assuming the dataframe is called 'dataframe', and the date-column is the first column and the 'name' column is the second,
you can use this simple for loop to filter all the items and add them to a new dataframe.
dataframe_filter = pd.DataFrame()
month = 202006 #filter by this month
key_word = 'oil' #filter by this keyword
for i in range(0, len(dataframe)):
if dataframe.iloc[i,0] == month or dataframe.iloc[i,0]//100 == month: #date that doesnt includes day or date that includes day
if key_word in dataframe.iloc[i,1]:
dataframe_filter[i] = (dataframe.iloc[i]) #set to new column in dataframe_filter (remember to transpose dataframe to change back to correct format)
dataframe_filter = dataframe_filter.transpose() #transpose dataframe

Filter data-frame for rows dates outside a date range

I have a data-frame df where the head looks like:
identifier department organisation status change date
1 14 Finance Accounts 19/09/2018
2 19 Marketing Advertising 19/09/2016
22 288 Production IT 03/01/2017
27 352 Facilities Kitchen 31/01/2017
54 790 Relations Sales 31/03/2017
df has several thousand records in it. I also have 2 date variables - the start date and end date of a reference period as strings (arguments from the command line) called:
referencePeriodStartDate and referencePeriodEndDate
which currently equal:
referencePeriodStartDate = 01/01/2017
referencePeriodEndDate = 30/03/2017
I am trying to return and records from df which have a status change date that falls outside the reference period as setup by the referencePeriodStartDate and referencePeriodEndDate
In the example above records with identifier 14 and 19 would be returned as the status change dates they have 19/09/2018 and 19/09/2016 are after and before the reference window respectively.
Example output
identifier department organisation status change date
1 14 Finance Accounts 19/09/2018
2 19 Marketing Advertising 19/09/2016
I have tried the following
resultdf = (df['status change date'].dt.date > referencePeriodEndDate.dt.date) & (df['status change date'].dt.date < referencePeriodStartDate.dt.date)
Where I convert the string dates to type date and try and apply the the logic if the status change date is smaller than referencePeriodStartDate and status change date > referencePeriodEndDate then return the row.
My problem is that nothing is returned. Have I converted to type date incorrectly?
If want compare dates from column created by date with scalar date need date():
df['status change date'] = pd.to_datetime(df['status change date'])
referencePeriodStartDate = pd.to_datetime('01/01/2017')
referencePeriodEndDate = pd.to_datetime('30/03/2017')
resultdf = df[(df['status change date'].dt.date > referencePeriodEndDate.date()) |
(df['status change date'].dt.date < referencePeriodStartDate.date())]
print (resultdf)
identifier department organisation status change date
1 14 Finance Accounts 2018-09-19
2 19 Marketing Advertising 2016-09-19
54 790 Relations Sales 2017-03-31
Or for compare datetimes only remove dates or use between witn inverted condition by ~:
df['status change date'] = pd.to_datetime(df['status change date'])
referencePeriodStartDate = '01/01/2017'
referencePeriodEndDate = '30/03/2017'
resultdf = df[(df['status change date'] > referencePeriodEndDate) |
(df['status change date'] < referencePeriodStartDate)]
print (resultdf)
identifier department organisation status change date
1 14 Finance Accounts 2018-09-19
2 19 Marketing Advertising 2016-09-19
54 790 Relations Sales 2017-03-31
mask = ~df['status change date'].between(referencePeriodStartDate, referencePeriodEndDate)
resultdf = df[mask]
print (resultdf)
identifier department organisation status change date
1 14 Finance Accounts 2018-09-19
2 19 Marketing Advertising 2016-09-19
54 790 Relations Sales 2017-03-31
Like the code from Jezrael mentions, you're slicing using '&'. Your dates cannot be after x '&' at the same time before 'y'. Convert the string to datetype and then use 'or' OR '|'

Categories

Resources