Delete entire dataframe rows based on condition and stack remaining rows - python

I have the following dataframe:
Date Name Grade Hobby
01/01/2005 Albert 4 Drawing
08/04/1996 Martha 6 Horseback riding
03/03/2003 Jack 5 Singing
07/01/2001 Millie 5 Netflix
24/09/2000 Julie 7 Sleeping
...
I want to filter the df to only contain the rows for repeat dates, so where df['Date'].value_counts()>=2
And then groupby dates sorted in chronological order so that I can have something like:
Date Name Grade Hobby
08/08/1996 Martha 6 Horseback riding
Matt 4 Sleeping
Paul 5 Cooking
24/09/2000 Julie 7 Sleeping
Simone 4 Sleeping
...
I have tried some code, but I get stuck on the first step. I tried something like:
same=df['Date'].value_counts()
same=same.loc[lambda x:x >=2]
mult=same.index.to_list()
for i in df['Date']:
if i not in mult:
df.drop(df[df['Date'==i]].index)
I also tried
new=df.loc[df['Date'].isin(mult)]
plot=pd.pivot_table(new, index=['Date'],columns=['Name'])
But this only gets 1 of the rows per each repeat dates instead of all the rows with the same date

Think this should do the job
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df_new = df[df['Date'].duplicated(keep=False)].sort_values('Date')

Convert Date to datetimes by to_datetime, then filter rows in boolean indexing and last sorting by DataFrame.sort_values:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
same=df['Date'].value_counts()
df1 = df[df['Date'].map(same) >= 2].sort_values(['Date','Grade'], ascending=[True, False])
Or use Series.duplicated with keep=False for count 2 and more, what is same like duplicates:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df[df['Date'].duplicated(keep=False)].sort_values(['Date','Grade'], ascending=[True, False])

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Grouping dates together by year in Pandas

I have a dataset of property prices and they are currently listed by 'DATE_SOLD'. I'd like to be able to count them by year. The dataset looks like this -
SALE_DATE COUNTY SALE_PRICE
0 2010-01-01 Dublin 343000.0
1 2010-01-03 Laois 185000.0
2 2010-01-04 Dublin 438500.0
3 2010-01-04 Meath 400000.0
4 2010-01-04 Kilkenny 160000.0
This is the code I've tried -
by_year = property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
print(by_year)
I think I'm close but as a biblical noob it's quite frustrating!
Thank you for any help you can provide; this site has been awesome so far in finding little tips and tricks to make my life easier
You are close. As you did, you can use pd.to_datetime to convert your sale_date to a datetime column. Then groupby the year, using dt.year which gets the year of the datetime, and use size() on that which computes the size of each group, which in this case is the year.
property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
property_prices.groupby(property_prices.SALE_DATE.dt.year).size()
Which prints:
SALE_DATE
2010 5
dtype: int64
import pandas as pd
sample_dict = {'Date':['2010-01-11', '2020-01-22', '2010-03-12'], 'Price':[1000,2000,3500]}
df = pd.DataFrame(sample_dict)
# Creating 'year' column using the Date column
df['year'] = df.apply(lambda row: row.Date.split('-')[0], axis=1)
# Groupby function
df1 = df.groupby('Year')
# Print the first value in each group
df1.first()
Output:
Date x
year
2010 2010-01-11 1
2020 2020-01-22 2

Python pandas column filtering substring

I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?
There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.
I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

how to filter dataframe for specific Dates and Name in Python?

I want to filter my Dataframe by 2 columns one is for date and other one is for name.
How can I filter out data from previous month only?. So if I run code today it will filter out data for previous month.
So date columns contains values
as(year,month,date): [202006, 202005, 202007,202107,20200601, 20200630 ]
etc.(Note that in some, date is absent)
And while filtering this, I also want to filter 2nd columns in which I only want to take those name which contains specific keywords.
Example:
Data=[[202006,Fuel oil],[202007, crude oil],[20200601, palm oil],[20200805, crude oil],[202007, Marine fuel]]
If i run the code it will automatically give me previous month's data and name which contains "oil" word.
First convert dates to datetimes, here is used 2 formats of dates by to_datetime with different formats and errors='coerce', missing values are replaced by Series.fillna:
df= pd.DataFrame({'date':[202006, 202005, 202007,202107,20200601, 20200630 ],
'fuel':['Fuel oil','crude oil','fuel oil',
'castor oil','crude oil', 'fuel']})
d1 = pd.to_datetime(df['date'], format='%Y%m', errors='coerce')
d2 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = d1.fillna(d2)
print (df)
date fuel
0 2020-06-01 Fuel oil
1 2020-05-01 crude oil
2 2020-07-01 fuel oil
3 2021-07-01 castor oil
4 2020-06-01 crude oil
5 2020-06-30 fuel
Then values are filtered by monh periods - Series.dt.to_period compared with subtracted today month for first condition, then is chained by & for bitwise AND second condition by Series.str.contains and filtering by boolean indexing:
now = pd.Timestamp('now').to_period('M')
df = df[df['date'].dt.to_period('M').eq(now - 1) & df['fuel'].str.contains('oil')]
print (df)
date fuel
0 2020-06-01 Fuel oil
4 2020-06-01 crude oil
Assuming your dataframe is
df = pd.DataFrame({
'date':[202006, 202005, 202007, 202107, 20200601, 20200630],
'fuel':['Fuel oil', 'crude oil', 'fuel oil', 'castor oil', 'crude oil', 'fuel']})
Then you can do the following code to filter it:
import time
# finding previous month and year
current_year= time.gmtime().tm_year
current_month= time.gmtime().tm_mon
# Adding a check if the current month is January
if current_month!=1:
prev_month= current_month-1
else:
prev_month=12
current_year -= 1
# extracting month,year info from the date column by converting it into strings
df[df.date.apply(lambda x: int(str(x)[4:6])==prev_month and int(str(x)[:4])== current_year) & df.fuel.apply(lambda x: 'oil' in x)]
Note:
df.date.apply(lambda x: int(str(x)[4:6]) extracts the month info which i use to compare with previous month and filter.
df.fuel.apply(lambda x: 'oil' in x) sees which element has the word oil in it.
Assuming the dataframe is called 'dataframe', and the date-column is the first column and the 'name' column is the second,
you can use this simple for loop to filter all the items and add them to a new dataframe.
dataframe_filter = pd.DataFrame()
month = 202006 #filter by this month
key_word = 'oil' #filter by this keyword
for i in range(0, len(dataframe)):
if dataframe.iloc[i,0] == month or dataframe.iloc[i,0]//100 == month: #date that doesnt includes day or date that includes day
if key_word in dataframe.iloc[i,1]:
dataframe_filter[i] = (dataframe.iloc[i]) #set to new column in dataframe_filter (remember to transpose dataframe to change back to correct format)
dataframe_filter = dataframe_filter.transpose() #transpose dataframe

Finding number of months between overlapping periods - pandas

I have the data set of customers with their policies, I am trying to find the number of months the customer is with us. (tenure)
df
cust_no poly_no start_date end_date
1 1 2016-06-01 2016-08-31
1 2 2017-05-01 2018-05-31
1 3 2016-11-01 2018-05-31
output should look like,
cust_no no_of_months
1 22
So basically, it should get rid of the months where there is no policy and count the overlapping period once not twice. I have to do this for every customers, so group by cust_no, how can i do this?
Thanks.
One way to do this is to create date ranges for each records, then use stack to get all the months. Next, take the unique values only to count a month only once:
s = df.apply(lambda x: pd.Series(pd.date_range(x.start_date, x.end_date, freq='M').values), axis=1)
ss = s.stack().unique()
ss.shape[0]
Output:
22
For multiple customers you can use groupby. Continuing with #ScottBoston's answer:
df_range = df.apply(lambda r: pd.Series(
pd.date_range(start=r.start_date, end=r.end_date, freq='M')
.values), axis=1)
df_range.groupby('cust_no').apply(lambda x: x.stack().unique().shape[0])

Categories

Resources