How do I delete specific dataframe rows based on a columns value? - python

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.

you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']

Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]

If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

Related

How to change str to date when year data inconsistent?

I've got a dataframe with a column names birthdates, they are all strings, most are saved as %d.%m.%Y, some are saved as %d.%m.%y.
How can I make this work?
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], format = "%d.%m.%Y")
If this can't work, do I need to filter the rows? How would I do it?
Thanks for taking time to answer!
I am not sure what is the expected output, but you can let to_datetime parse automatically the dates:
df = pd.DataFrame({"birthdates": ['01.01.2000', '01.02.00', '02.03.99',
'02.03.22', '01.01.71', '01.01.72']})
# as datetime
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], dayfirst=True)
# as custom string
df["birthdates_clean2"] = (pd.to_datetime(df["birthdates"], dayfirst=True)
.dt.strftime('%d.%m.%Y')
)
NB. the shift point is currently at 71/72. 71 gets evaluated as 2071 and 72 as 1972
output:
birthdates birthdates_clean birthdates_clean2
0 01.01.2000 2000-01-01 01.01.2000
1 01.02.00 2000-02-01 01.02.2000
2 02.03.99 1999-03-02 02.03.1999
3 02.03.22 2022-03-02 02.03.2022
4 01.01.71 2071-01-01 01.01.2071
5 01.01.72 1972-01-01 01.01.1972

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

pandas extra a new header from a already exist header?

df = DataFrame({'DATE' : ['2017-01-01','2017-01-02'],'Sexuality/us' :['femle','male'],'Height/us' :[190,195]})
DATE Sexuality/us Height/us
0 2017-01-01 female 190
1 2017-01-02 male 195
As you see this is a pandas DataFrame data
I want to transfer this DataFrame to csv
When I use df.to_csv('demo.csv') as blow
What I really want to get is like this(extract the us to another row, of course, there may be many countries, I want to extract countries to an row as header):
Any one can help me? Thanks very much
If you put DATE into the index then you can split your columns by / and create a multiindex.
df = df.set_index('DATE')
df.columns = df.columns.str.split('/', expand=True)
df.reset_index().swaplevel(axis=1)
us
DATE Height Sexuality
0 2017-01-01 190 femle
1 2017-01-02 195 male

Filling date gaps in pandas dataframe

I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day.
The problem is that I have gaps i.e. there is days for which I have no data at all.
What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.
thanks
Here is example 01-03 and 01-04 are missing :
In [60]: df['2015-01-06':'2015-01-01']
Out[60]:
Rate High (est) Low (est)
Date
2015-01-06 1.19643 0.0000 0.0000
2015-01-05 1.20368 1.2186 1.1889
2015-01-02 1.21163 1.2254 1.1980
2015-01-01 1.21469 1.2282 1.2014
Still experimenting but this seems to solve the problem :
df.set_index(pd.DatetimeIndex(df.Date),inplace=True)
and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means.
resample() is expecting : if isinstance(ax, DatetimeIndex): .....
Here is my final solution :
#make dates the index
self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
#fill the gaps
self.df = self.df.resample('D',fill_method='pad')
#fix the Date column
self.df.Date = self.df.index.values
I had to fix the Date column, because resample() just allow you to pad-it.
It fixes the index correctly though, so I could use it to fix the Date column.
Here is snipped of the data after correction :
2015-01-29 2015-01-29 1.13262 0.0000 0.0000
2015-01-30 2015-01-30 1.13161 1.1450 1.1184
2015-01-31 2015-01-31 1.13161 1.1450 1.1184
2015-02-01 2015-02-01 1.13161 1.1450 1.1184
01-30, 01-31 are the new generated data.
You'll could resample by day e.g. using mean if there are multiple entries per day:
df.resample('D', how='mean')
You can then ffill to replace NaNs with the previous days result.
See up and down sampling in the docs.

Categories

Resources