subset the pandas dataframe - python

I have a pandas dataframe
date Speed
1986-01-01 0.3
....
2017-03-01 0.4
where date is index of data frame.i want to create a data frame only having data of 1986,2000 and 2017 without date index like
index date speed
1 1986 0.3
....
13 2000 0.5

Assuming your 'date' index is already a datetime dtype:
df.reset_index(inplace = True)
df['date'] = df['date'].dt.year
df = df[df['date'].isin([1986,2000,2017])]
...and if not, add df['date'] = pd.to_datetime(df['date']) after the reset_index

Use the following steps:
# Step one -reindex
df = df.reindex()
# Step two - convert date column to date type
df['date'] = pd.to_datetime(df['date'])
# Step three - create year column using the date object
df['year'] = df.date.dt.year
# Step four - select target years
df[df.year.isin([1986,2000])]

Related

Python Date Time - Create New Column if Past Date change to Current Date MMM-YY

need your help on creating new column.
create new column
if Date < Current Month and Year = change to value to Current Month and Year (MMM-YY)
if Date >= Current Month and Year = no change
here is an example, all red are in the past hence in the new column the new date is this month (Aug-22)
I tried below but no luck:
string_input_with_date = df["Old Column"]
past = datetime.strptime(string_input_with_date, "%b-%y")
present = datetime.now()
past.date() < present.date()
[enter image description here]
Try this:
import pandas as pd
import datetime
df = pd.DataFrame({'Old Column': ['Jan-20', 'Feb-20', 'Jan-20', 'Feb-21', 'Dec-23', 'Aug-22', 'Mar-22', 'Nov-22', 'Oct-22']})
df['tmp'] = '01-' + df['Old Column']
df['tmp'] = pd.to_datetime(df['tmp'], format="%d-%b-%y", errors='coerce')
present = datetime.datetime.now()
condition = df['tmp'].dt.date < datetime.date(present.year, present.month, 1)
df['New Column'] = df['Old Column']
df.loc[condition, 'New Column'] = present.strftime('%b-%y')
df['New Column']
I added a tmp column, which is later transformed to_datetime
And I used df.loc to replace all rows corresponding to the condition with present.strftime('%b-%y')
This is the output
0 Aug-22
1 Aug-22
2 Aug-22
3 Aug-22
4 Dec-23
5 Aug-22
6 Aug-22
7 Nov-22
8 Oct-22
Name: New Column, dtype: object

pandas increment row based on how many times a date is in a dataframe

i have this list for example dates = ["2020-2-1", "2020-2-3", "2020-5-8"] now i want to make a dataframe which contains only the month and year then the count of how many times it appeared, the output should be like
Date
Count
2020-2
2
2020-5
1
Shorted code:
df['month_year'] = df['dates'].dt.to_period('M')
df1 = df.groupby('month_year')['dates'].count().reset_index(name="count")
print(df1)
month_year count
0 2020-02 2
1 2020-05 1
import pandas as pd
dates = ["2020-2-1", "2020-2-3", "2020-5-8"]
df = pd.DataFrame({'Date':dates})
df['Date'] = df['Date'].str.slice(0,6)
df['Count'] = 1
df = df.groupby('Date').sum().reset_index()
Note: you might want to use the format "2020-02-01" with padded zeros so that the first 7 characters are always the year and month
This will give you a "month" and "year" column with the count of the year/month
If you want you could just combine the month/year columns together, but this will give you the results you expect if not exactly cleaned up.
df = pd.DataFrame({'Column1' : ["2020-2-1", "2020-2-3", "2020-5-8"]})
df['Month'] = pd.to_datetime(df['Column1']).dt.month
df['Year'] = pd.to_datetime(df['Column1']).dt.year
df.groupby(['Month', 'Year']).agg('count').reset_index()

Add missing dates do datetime column in Pandas using last value

I've already checked out Add missing dates to pandas dataframe, but I don't want to fill in the new dates with a generic value.
My dataframe looks more or less like this:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
06/01/2000
d
So in this example, days 04/01/2000 and 05/01/2000 are missing. What I want to do is to insert them before the 6th, with a value of c, the last value before the missing days. So the "correct" df should look like:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
04/01/2000
c
05/01/2000
c
06/01/2000
d
There are multiple instances of missing dates, and it's a large df (~9000 rows).
Thanks for your time! :)
try this:
# If your date format is dayfirst, then use the following code
df['date (dd/mm/yyyy)'] = pd.to_datetime(df['date (dd/mm/yyyy)'], dayfirst=True)
out = df.set_index('date (dd/mm/yyyy)').asfreq('D', method='ffill').reset_index()
print(out)
Assuming that your dates are drawn at a regular frequency, you can generate a pd.DateIndex with date_range, filter those which are not in your date column, crate a dataframe to concatenate with nan in the value column and fillna using the back or forward fill method.
# assuming your dataframe is df:
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq='M')
known_dates = set(df.date.to_list()) # set is blazing fast on `in` compared with a list.
unknown_dates = all_dates[~all_dates.isin(known_dates)]
df2 = pd.DateFrame({'date': unknown_dates})
df2['value'] = np.nan
df = pd.concat([df, df2])
df = df.sort_values('value').fillna(method='ffill')

How to pivot a pandas df where each column header is an hour and each row is a date

So I have a pandas df that looks like this
where each column is an hour of the day noted in the date column. I would like to pivot this df so each hour of the day is its own row. Similar to this
where there would be 24 rows for every hour of each date.
I've tried to used pd.melt using the following
hourly_value = ['00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00','08:00','09:00','10:00','11:00','12:00']
df = df.melt(id_vars = ['DATE'], var_name = hourly_value, value_name = ('Hourly Precip'))
but keep getting error "IndexError: Too many levels: Index has only 1 level, not 2". I've also looked into using df.pivot but Im starting to think my df is in a much different format than most of the examples.
One way to get what you want is to:
Use .set_index('DATE') to turn the DATE column into the index.
Use .stack() to bring the columns into the index as well, creating a MultiIndex where the row for each date gets inserted as a second level in the index.
Use .reset_index() to turn all index levels back into rows.
The following snippet illustrates:
import numpy as np
import pandas as pd
dates = [f"1/{i}/2020" for i in range(1, 21)]
cols = ["DATE"] + [str(i) + ":00" for i in range(25)]
zeros = np.zeros((len(dates), len(cols) - 1))
data = list([[x] + list(y) for x, y in zip(dates, zeros)])
df = pd.DataFrame(data=data, columns=cols)
df2 = (
df.set_index("DATE") # makes the DATE column the index
.stack() # stacks
.reset_index()
.rename(columns={"level_1": "Time", 0: "Value"})
)
print(df2.head())
Which outputs:
DATE Time Value
0 1/1/2020 0:00 0.0
1 1/1/2020 1:00 0.0
2 1/1/2020 2:00 0.0
3 1/1/2020 3:00 0.0
4 1/1/2020 4:00 0.0
Try this :
pd.melt( df.reset_index(), id_vars=['DATE'], var_name='hour', value_name='Hourly Precip')

Python Pandas : pandas.to_datetime() is switching day & month when day is less than 13

I wrote a code that reads multiple files, however on some of my files datetime swaps day & month whenever the day is less than 13, and any day that is from day 13 or above i.e. 13/06/11 remains correct (DD/MM/YY).
I tried to fix it by doing this,but it doesn't work.
My data frame looks like this:
The actual datetime is from 12june2015 to 13june2015
when my I read my datetime column as a string the dates remain correct dd/mm/yyyy
tmp p1 p2
11/06/2015 00:56:55.060 0 1
11/06/2015 04:16:38.060 0 1
12/06/2015 16:13:30.060 0 1
12/06/2015 21:24:03.060 0 1
13/06/2015 02:31:44.060 0 1
13/06/2015 02:37:49.060 0 1
but when I change the type of my column to datetime column it swaps my day and month for each day that is less than 13.
output:
print(df)
tmp p1 p2
06/11/2015 00:56:55 0 1
06/11/2015 04:16:38 0 1
06/12/2015 16:13:30 0 1
06/12/2015 21:24:03 0 1
13/06/2015 02:31:44 0 1
13/06/2015 02:37:49 0 1
Here is my code :
I loop through files :
df = pd.read_csv(PATH+file, header = None,error_bad_lines=False , sep = '\t')
then when my code finish reading all my files I concatenat them, the problem is that my datetime column needs to be in a datetime type so when I change its type by pd_datetime() it swaps the day and month when the day is less than 13.
Post converting my datetime column the dates are correct (string type)
print(tmp) # as a result I get 11.06.2015 12:56:05 (11june2015)
But when I change the column type I get this:
tmp = pd.to_datetime(tmp, unit = "ns")
tmp = temps_absolu.apply(lambda x: x.replace(microsecond=0))
print(tmp) # I get 06-11-2016 12:56:05 (06november2015 its not the right date)
The question is : What command should i use or change in order to stop day and month swapping when the day is less than 13?
UPDATE
This command swaps all the days and months of my column
tmp = pd.to_datetime(tmp, unit='s').dt.strftime('%#m/%#d/%Y %H:%M:%S')
So in order to swap only the incorrect dates, I wrote a condition:
for t in tmp:
if (t.day < 13):
t = datetime(year=t.year, month=t.day, day=t.month, hour=t.hour, minute=t.minute, second = t.second)
But it doesn't work either
You can use the dayfirst parameter in pd.to_datetime.
pd.to_datetime(df.tmp, dayfirst=True)
Output:
0 2015-06-11 00:56:55
1 2015-06-11 04:16:38
2 2015-06-12 16:13:30
3 2015-06-12 21:24:03
4 2015-06-13 02:31:44
5 2015-06-13 02:37:49
Name: tmp, dtype: datetime64[ns]
Well I solved my problem but in a memory consuming method, I split my tmp column first to a date and time columns then I re-split my date column to day month and year, that way I could look for the days that are less than 13 and replace them with the correspondent month
df['tmp'] = pd.to_datetime(df['tmp'], unit='ns')
df['tmp'] = df['tmp'].apply(lambda x: x.replace(microsecond=0))
df['date'] = [d.date() for d in df['tmp']]
df['time'] = [d.time() for d in df['tmp']]
df[['year','month','day']] = df['date'].apply(lambda x: pd.Series(x.strftime("%Y-%m-%d").split("-")))
df['day'] = pd.to_numeric(df['day'], errors='coerce')
df['month'] = pd.to_numeric(df['month'], errors='coerce')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
#Loop to look for days less than 13 and then swap the day and month
for index, d in enumerate(df['day']):
if(d <13):
df.loc[index,'day'],df.loc[index,'month']=df.loc[index,'month'],df.loc[index,'day']
# convert series to string type in order to merge them
df['day'] = df['day'].astype(str)
df['month'] = df['month'].astype(str)
df['year'] = df['year'].astype(str)
df['date']= pd.to_datetime(df[['year', 'month', 'day']])
df['date'] = df['date'].astype(str)
df['time'] = df['time'].astype(str)
# merge time and date and place result in our column
df['tmp'] =pd.to_datetime(df['date']+ ' '+df['time'])
# drop the added columns
df.drop(df[['date','year', 'month', 'day','time']], axis=1, inplace = True)
I ran into the same issue. In my case the dates were the index column (called "Date"). The above mentioned solution using to_datetime() directly on the dataframe with index column "Date" didn't work for me. I had to use read_csv() first without setting the index to "Date", then apply to_datetime() on it and only then set the index to "Date".
df= pd.read_csv(file, parse_dates=True)
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df = df.set_index('Date')
I got the same problem, the day and month were switching from 13 onwards. This works for me, basically I reorder the date through string type with a conditional and use to_datetime.
def calendario(fecha):
if fecha.day < 13:
dia_real = fecha.month
mes_real = fecha.day
if dia_real < 10:
dia_real = '0'+str(dia_real)
nfecha = str(dia_real) + str(mes_real) + str(fecha.year)
nfecha = pd.to_datetime(nfecha, format='%d%m%Y', errors='ignore')
else:
nfecha = fecha
return nfecha
df['Nueva_fecha']=df['Fecha'].apply(calendario)
The output as expected:
enter image description here

Categories

Resources