I am relatively new to working with python and pandas and I'm trying to get the value of a cell in an excel sheet with python. To make matters worse, the excel sheet I'm working with doesn't have proper column names.
Here's what the dataframe looks like:
Sign Name 2020-09-05 2020-09-06 2020-09-07
JD John Doe A A B
MP Max Power B B A
What I want to do is to print the value of the "cell" where the column header is the current date and the sign is "MP".
What I've tried so far is this:
import pandas as pd
from datetime import datetime
time=datetime.now()
relevant_sheet = time.strftime("%B" " %y")
current_day = time.strftime("%Y-%m-%d")
excel_file = pd.ExcelFile('theexcelfile.xlsx')
df = pd.read_excel(excel_file, relevant_sheet, skiprows=[0,1,2,3]) # I don't need these
relevant_value = df.loc[df['Sign'] == "MP", df[current_day]]
This gives me a key error for current_day:
KeyError: '2020-09-07'
To fully disclose any possible issue with the real dataframe I'm working with: If I just print the dataframe, I get columns that look like this:
2020-09-01 00:00:00
Which is why I also tried:
current_day = time.strftime("%Y-%m-%d 00:00:00")
Of course I also "manually" tried all kinds of date formats, but to no avail. Am I going entirely wrong about this? Is this excel screwing with me?
If in columns names are datetimes use Timestamp.floor for remove times (set them to 00:00:00):
current_day = pd.to_datetime('now').floor('d')
print (current_day)
2020-09-07 00:00:00
relevant_value = df.loc[df['Sign'] == "MP", current_day]
If in columns names are datetimes in strings format use:
relevant_value = df.loc[df['Sign'] == "MP", current_day]
If there are python dates:
current_day = pd.to_datetime('now').date()
print (current_day)
2020-09-07
relevant_value = df.loc[df['Sign'] == "MP", current_day]
You need to pass column name only instead of df[col_name].
Look .loc[] for detail.
df.loc[df['Sign'] == "MP", current_day]
Use df.filter to filter the relevant column.
Get the relevant column by extracting today's date and converting it to a string.
Proceed and query Sign for MP
df.loc[df['Sign']=='MP',(dt.date.today()).strftime('%Y-%m-%d')]
Minor changes to how you are doing things will get you the result.
Step 1: strip out the 00:00:00 (if you want just the date value)
Step 2: your condition had an extra df[]
#strip last part of the column names if column starts with 2020
df.rename(columns=lambda x: x[:10] if x[:4] == '2020' else x, inplace=True)
current_day = datetime.date(datetime.now()).strftime("%Y-%m-%d")
relevant_value = df.loc[df['Sign'] == 'MP', current_day] #does not need df before current_day
print(relevant_value)
since you are already using pandas, you don't need to import datetime. you can just give this to get your date in yyyy-mm-dd format
current_day = pd.to_datetime('now').strftime("%Y-%m-%d")
Related
I have a column of (created AT) in my DataFrame which has a timestamps like shown below:
Created AT
1) 2021-04-19T09:14:10.526Z
2) 2021-04-19T09:13:06.809Z
3) 2021-04-19T09:13:06.821Z
I want to extract the time only from above column etc . It should show like:
9:14:8 etc
How to extract this ?
If your date column is a string, you need to convert it to datetime and then take a substring of the time:
df = pd.DataFrame(data = {"Created At":["2021-04-19T09:14:10.526Z","2021-04-19T09:14:10.526Z"]})
df['Created At'] = pd.to_datetime(df['Created At'])
df['Created At'] = df['Created At'].dt.time.astype(str).str[:8]
df['time'] = pd.to_datetime(df['Created AT'])
print(df['time'].dt.time)
On the first line, convert the datetime to objects and write in a new column.
On the second, we get the time from datetime objects
I have a solution to your question. It can have multiple solutions but here I am giving some solution here using time, DateTime
you can get the string using
import time
import datetime
s = '2021-04-19T09:14:10.526Z'
t = s.split('T')[1].split('.')[0]
print(t)
and for getting time stamp of it do one more line
print(datetime.datetime.strptime(t,"%H:%M:%S"))
Convert to datetime and use strftime to format exactly as you like it.
data = ['2021-04-19T09:14:10.526Z',
'2021-04-19T09:13:06.809Z',
'2021-04-19T09:13:06.821Z']
df = pd.DataFrame(data=data, columns=['Created AT'])
df['Created AT'] = pd.to_datetime(df['Created AT']).dt.strftime('%H:%M:%S')
print(df)
Created AT
0 09:14:10
1 09:13:06
2 09:13:06
First convert the column to datetime format if not already in that format:
df['Created AT'] = pd.to_datetime(df['Created AT'])
Then, add the new column time with formatting by .dt.strftime() as follows (if you don't want the nano-second part):
df['time'] = df['Created AT'].dt.strftime('%H:%M:%S')
print(df)
Created AT time
0 2021-04-19 09:14:10.526000+00:00 09:14:10
1 2021-04-19 09:13:06.809000+00:00 09:13:06
2 2021-04-19 09:13:06.821000+00:00 09:13:06
I have a dataset of 70000+ data points (see picture)
As you can see, in the column 'date' half of the format is different (more messy) compared to the other half (more clear). How can I make the whole format as the second half of my data frame?
I know how to do it manually, but it will take ages!
Thanks in advance!
EDIT
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
Date is in a strange format
[
EDIT 2
two data formats:
2012-01-01 00:00:00
2020-07-21T22:45:00+00:00
I've tried the below and it works, note that this assuming two key assumptions:
1- Your date fromat follows one and ONLY ONE of the TWO formats in your example!
2- The final output is a string!
If so, this should do the trick, else, it's a starting point and can be altered to you want it to look like:
import pandas as pd
import datetime
#data sample
d = {'date':['20090602123000', '20090602124500', '2020-07-22 18:45:00+00:00', '2020-07-22 19:00:00+00:00']}
#create dataframe
df = pd.DataFrame(data = d)
print(df)
date
0 20090602123000
1 20090602124500
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
#loop over records
for i, row in df.iterrows():
#get date
dateString = df.at[i,'date']
#check if it's the undesired format or the desired format
#NOTE i'm using the '+' substring to identify that, this comes to my first assumption above that you only have two formats and that should work
if '+' not in dateString:
#reformat datetime
#NOTE: this is comes to my second assumption where i'm producing it into a string format to add the '+00:00'
df['date'].loc[df.index == i] = str(datetime.datetime.strptime(dateString, '%Y%m%d%H%M%S')) + '+00:00'
else:
continue
print(df)
date
0 2009-06-02 12:30:00+00:00
1 2009-06-02 12:45:00+00:00
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
you can format the first part of your dataframe
import datetime as dt
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
this checks if all characters of the value are digits, then format the date as the second part
EDIT
the timestamp seems to be in miliseconds while they should be in seconds => / 1000
date['Maturity_date'] = data.apply(lambda data: relativedelta(months=int(data['TRM_LNTH_MO'])) + data['POL_EFF_DT'], axis=1)
Tried this also:
date['Maturity_date'] = date['POL_EFF_DT'] + date['TRM_LNTH_MO'].values.astype("timedelta64[M]")
TypeError: 'type' object does not support item assignment
import pandas as pd
import datetime
#Convert the date column to date format
date['date_format'] = pd.to_datetime(date['Maturity_date'])
#Add a month column
date['Month'] = date['date_format'].apply(lambda x: x.strftime('%b'))
If you are using Pandas, you may use a resource called: "Frequency Aliases". Something very out of the box:
# For "periods": 1 (is the current date you have) and 2 the result, plus 1, by the frequency of 'M' (month).
import pandas as pd
_new_period = pd.date_range(_existing_date, periods=2, freq='M')
Now you can get exactly the period you want as the second element returned:
# The index for your information is 1. Index 0 is the existing date.
_new_period.strftime('%Y-%m-%d')[1]
# You can format in different ways. Only Year, Month or Day. Whatever.
Consult this link for further information
I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.
I have a csv with a date column with dates listed as MM/DD/YY but I want to change the years from 00,02,03 to 1900, 1902, 1903 so that they are instead listed as MM/DD/YYYY
This is what works for me:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
but I'd have to do this for every year up until 68 (aka repeat this 68 times). I'm not sure how to create a loop to do the code above for every year in that range. I tried this:
ogyear=00
newyear=1900
while ogyear <= 68:
df2['date']=df2['Date'].str.replace(r'ogyear','newyear')
ogyear += 1
newyear += 1
but this returns an empty data set. Is there another way to do this?
I can't use datetime because it assumes that 02 refers to 2002 instead of 1902 and when I try to edit that as a date I get an error message from python saying that dates are immutable and that they must be changed in the original data set. For this reason I need to keep the dates as strings. I also attached the csv here in case thats helpful.
I would do it like this:
# create a data frame
d = pd.DataFrame({'date': ['20/01/00','20/01/20','20/01/50']})
# create year column
d['year'] = d['date'].str.split('/').str[2].astype(int) + 1900
# add new year into old date by replacing old year
d['new_data'] = d['date'].str.replace('[0-9]*.$','') + d['year'].astype(str)
date year new_data
0 20/01/00 1900 20/01/1900
1 20/01/20 1920 20/01/1920
2 20/01/50 1950 20/01/1950
I'd do it the following way:
from datetime import datetime
# create a data frame with dates in format month/day/shortened year
d = pd.DataFrame({'dates': ['2/01/10','5/01/20','6/01/30']})
#loop through the dates in the dates column and add them
#to list in desired form using datetime library,
#then substitute the dataframe dates column with the new ordered list
new_dates = []
for date in list(d['dates']):
dat = datetime.date(datetime.strptime(date, '%m/%d/%y'))
dat = dat.strftime("%m/%d/%Y")
new_dates.append(dat)
new_dates
d['dates'] = pd.Series(new_dates)
d