Python pandas column filtering substring

Python pandas column filtering substring - python

I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?

There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.

I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.

You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])

You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?

Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0

Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

how to filter dataframe for specific Dates and Name in Python?

I want to filter my Dataframe by 2 columns one is for date and other one is for name.
How can I filter out data from previous month only?. So if I run code today it will filter out data for previous month.
So date columns contains values
as(year,month,date): [202006, 202005, 202007,202107,20200601, 20200630 ]
etc.(Note that in some, date is absent)
And while filtering this, I also want to filter 2nd columns in which I only want to take those name which contains specific keywords.
Example:
Data=[[202006,Fuel oil],[202007, crude oil],[20200601, palm oil],[20200805, crude oil],[202007, Marine fuel]]
If i run the code it will automatically give me previous month's data and name which contains "oil" word.

First convert dates to datetimes, here is used 2 formats of dates by to_datetime with different formats and errors='coerce', missing values are replaced by Series.fillna:
df= pd.DataFrame({'date':[202006, 202005, 202007,202107,20200601, 20200630 ],
'fuel':['Fuel oil','crude oil','fuel oil',
'castor oil','crude oil', 'fuel']})
d1 = pd.to_datetime(df['date'], format='%Y%m', errors='coerce')
d2 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = d1.fillna(d2)
print (df)
date fuel
0 2020-06-01 Fuel oil
1 2020-05-01 crude oil
2 2020-07-01 fuel oil
3 2021-07-01 castor oil
4 2020-06-01 crude oil
5 2020-06-30 fuel
Then values are filtered by monh periods - Series.dt.to_period compared with subtracted today month for first condition, then is chained by & for bitwise AND second condition by Series.str.contains and filtering by boolean indexing:
now = pd.Timestamp('now').to_period('M')
df = df[df['date'].dt.to_period('M').eq(now - 1) & df['fuel'].str.contains('oil')]
print (df)
date fuel
0 2020-06-01 Fuel oil
4 2020-06-01 crude oil

Assuming your dataframe is
df = pd.DataFrame({
'date':[202006, 202005, 202007, 202107, 20200601, 20200630],
'fuel':['Fuel oil', 'crude oil', 'fuel oil', 'castor oil', 'crude oil', 'fuel']})
Then you can do the following code to filter it:
import time
# finding previous month and year
current_year= time.gmtime().tm_year
current_month= time.gmtime().tm_mon
# Adding a check if the current month is January
if current_month!=1:
prev_month= current_month-1
else:
prev_month=12
current_year -= 1
# extracting month,year info from the date column by converting it into strings
df[df.date.apply(lambda x: int(str(x)[4:6])==prev_month and int(str(x)[:4])== current_year) & df.fuel.apply(lambda x: 'oil' in x)]
Note:
df.date.apply(lambda x: int(str(x)[4:6]) extracts the month info which i use to compare with previous month and filter.
df.fuel.apply(lambda x: 'oil' in x) sees which element has the word oil in it.

Assuming the dataframe is called 'dataframe', and the date-column is the first column and the 'name' column is the second,
you can use this simple for loop to filter all the items and add them to a new dataframe.
dataframe_filter = pd.DataFrame()
month = 202006 #filter by this month
key_word = 'oil' #filter by this keyword
for i in range(0, len(dataframe)):
if dataframe.iloc[i,0] == month or dataframe.iloc[i,0]//100 == month: #date that doesnt includes day or date that includes day
if key_word in dataframe.iloc[i,1]:
dataframe_filter[i] = (dataframe.iloc[i]) #set to new column in dataframe_filter (remember to transpose dataframe to change back to correct format)
dataframe_filter = dataframe_filter.transpose() #transpose dataframe

How to get all rows of a month irrespective of year in a pandas time series?

I was trying out time series analysis with pandas data frames and found that there were easy ways to select specific columns like all the rows of an year, between two dates, etc.
For example, consider
ind = pd.date_range('2004-01-01', '2019-08-13')
data = np.random.randn(len(ind))
df = pd.DataFrame(d, index=ind)
Here, we can select all the rows between and including the dates '2014-01-23' and '2014-06-18' with
df['2014-01-23':'2014-06-18']
and all the rows of the year '2015' with just
df['2015']
Is there a similar way to select all the rows belonging to a specific month but for all years?
I found ways to get all the rows of a particular month and a particular year with syntax like
df['01-2015'] #all rows of January 2015
I was hoping pandas would have a way with simple syntax to get all rows of a month irrespective of the year. Does such a way exist?

Use DatetimeIndex.month, compare and filter by with boolean indexing:
print (df[df.index.month == 1])
0
2004-01-01 2.398676
2004-01-02 2.074744
2004-01-03 0.106972
2004-01-04 0.294587
2004-01-05 0.243768
...
2019-01-27 -1.623171
2019-01-28 -0.043810
2019-01-29 -0.999764
2019-01-30 -0.928471
2019-01-31 -0.304730
[496 rows x 1 columns]

Sorting data by day and month (ignoring year) python pandas

I found many questions similar to mine, but none of them answer it exactly (this one comes closest, but it focusses on ruby).
I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2014-10-03', '2015-10-02', freq='1D'), 'Variable': np.random.randn(365)})
df.head()
Out[272]:
Date Variable
0 2014-10-03 0.637167
1 2014-10-04 0.562135
2 2014-10-05 -1.069769
3 2014-10-06 0.556997
4 2014-10-07 0.253468
I want to sort the data from January 1st to December 31st, ignoring the year component of the Date column. The background is that I want to track changes in Variable over the year, but my period starts and ends in October.
I thought of creating a seperate column for month and year and then sorting by those. But I am unsure how to do this in a "correct" and concise way.
Expected output:
Date Variable
0 01-01 0.637167 # (Placeholder-values)
1 01-02 0.562135
2 01-03 -1.069769
3 01-04 0.556997
4 01-05 0.253468

On way from argsort
yourdf=df.loc[df.Date.dt.strftime('%m%d').astype(int).argsort()]

You can create the day and month columns by simply doing the following
df = pd.DataFrame(data=pd.date_range('2014-10-03', '2015-10-02', freq='1D'), columns=['date'])
df['day'] = df['date'].apply(lambda x: x.day)
df['month'] = df['date'].apply(lambda x: x.month)
You could make it even more compact. But quick analysis, you can use the above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas column filtering substring - python

Related

How to get calendar years as column names and month and day as index for one timeseries

How to fill missing dates with corresponding NaN in other columns

how to filter dataframe for specific Dates and Name in Python?

How to get all rows of a month irrespective of year in a pandas time series?

Sorting data by day and month (ignoring year) python pandas

Categories

Resources