Extracting date time from a mixed letter and numeric column pandas - python

I have a column in pandas dataframe that contains two types of information = 1. date and time, 2=company name. I have to split the column into two (date_time, full_company_name). Firstly I tried to split the columns based on character count (first 19 one column, the rest to other column) but then I realized that sometimes the date is missing so the split might not work. Then I tried using regex but I cant seem to extract it correctly.
The column:
the desired output:

If the dates are all properly formatted, maybe you don't have to use regex
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
OR
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")

Note:
If you have an option to avoid concatenating those strings to begin with, please do so. This is not a healthy habit.
Solution (not that pretty gets the job done):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet

use a non capturing group ?.* instead of (.*)
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Create a dataframe from a date range in python

Given an interval from two dates, which will be a Python TimeStamp.
create_interval('2022-01-12', '2022-01-17', 'Holidays')
Create the following dataframe:
date
interval_name
2022-01-12 00:00:00
Holidays
2022-01-13 00:00:00
Holidays
2022-01-14 00:00:00
Holidays
2022-01-15 00:00:00
Holidays
2022-01-16 00:00:00
Holidays
2022-01-17 00:00:00
Holidays
If it can be in a few lines of code I would appreciate it. Thank you very much for your help.
If you're open to using Pandas, this should accomplish what you've requested
import pandas as pd
def create_interval(start, end, field_val):
#setting up index date range
idx = pd.date_range(start, end)
#create the dataframe using the index above, and creating the empty column for interval_name
df = pd.DataFrame(index = idx, columns = ['interval_name'])
#set the index name
df.index.names = ['date']
#filling out all rows in the 'interval_name' column with the field_val parameter
df.interval_name = field_val
return df
create_interval('2022-01-12', '2022-01-17', 'holiday')
I hope I coded exactly what you need.
import pandas as pd
def create_interval(ts1, ts2, interval_name):
ts_list_dt = pd.date_range(start=ts1, end=ts2).to_pydatetime().tolist()
ts_list = list(map(lambda x: ''.join(str(x)), ts_list_dt))
d = {'date': ts_list, 'interval_name': [interval_name]*len(ts_list)}
df = pd.DataFrame(data=d)
return df
df = create_interval('2022-01-12', '2022-01-17', 'Holidays')
print(df)
output:
date interval_name
0 2022-01-12 00:00:00 Holidays
1 2022-01-13 00:00:00 Holidays
2 2022-01-14 00:00:00 Holidays
3 2022-01-15 00:00:00 Holidays
4 2022-01-16 00:00:00 Holidays
5 2022-01-17 00:00:00 Holidays
If you want DataFrame without Index column, use df = df.set_index('date') after creating DataFrame df = pd.DataFrame(data=d). And then you will get:
date interval_name
2022-01-12 00:00:00 Holidays
2022-01-13 00:00:00 Holidays
2022-01-14 00:00:00 Holidays
2022-01-15 00:00:00 Holidays
2022-01-16 00:00:00 Holidays
2022-01-17 00:00:00 Holidays

CSV / Dataset with hourly prices in columns for each day in the rows

I have a dataset / CSV with hourly prices in columns for each day in the rows. It looks something like this:
Date
01:00:00
02:00:00
...
00:00:00
01.01.2019
348,87
340,83
...
343,38
02.01.2019
...
...
...
...
I would need the dataset to be like this:
Date
Price
01.01.2019 01:00:00
348,87
01.01.2019 02:00:00
343,38
...
...
02.01.2019 00:00:00
...
And all the way to 01.01.2022.
I'm using pandas dataframes in Python. Could anyone help me with this?
RE:
df1 = pd.read_csv('Hourly_prices1.csv', delimiter = ';', index_col = ['Date'])
So basically, I want the index row to contain the price each hour on each day, going chronologically from 01.01.2019 01.00.00 all until 01.01.2022 00.00.00.
I need this to create a time series analysis, and amongst other things plot daily changes in the price of each hour of each day.
Assuming your dataset is df, Please try this:
hours_ls = df.columns[1:]
df['Date'] = df['Date'].astype(str)
df_new = pd.DataFrame()
for date in df['Date'].values:
price_ls = []
date_hr = []
for hour in hours_ls:
date_hr.append(date + ' ' + str(hour))
price_ls.append(df[df['Date']==date][hour].iloc[0])
df_new = df_new.append(pd.DataFrame(data={'Date':date_hr,'Price':price_ls}))
df_new will be the formatted dataframe required

How to read a csv and aggregate data by groups?

We have a csv file and written below code to do a group by and get the max value and create an output file. But while reading final output file using data-frame read_csv , it is showing as empty..
Input file:
Manoj,2020-01-01 01:00:00
Manoj,2020-02-01 01:00:00
Manoj,2020-03-01 01:00:00
Rajesh,2020-01-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
Final output file:
Manoj,2020-03-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
and then when i am trying to read above final output file using df.read_Csv it shows dataframe empty.
import os
import re
import pandas as pd
z=open('outfile.csv','w')
fin=[]
k=open('j.csv','r')
for m in k:
d=m.split(',')[0]
if d not in fin:
fin.append(d.strip())
for p in fin:
gg=[]
g=re.compile(r'{0}'.format(p))
y=open('j.csv','r')
for b in y:
if re.search(g,b):
gg.append(b)
z.write(gg[-1].strip())
z.write('\n')
df = pd.read_csv("outfile.csv", delimiter=',', names=['Col1','Col2'], header=0)
print(df)
final output: Empty DataFrame , Index: []
Is there anything i missed , please any one suggest...
It's not necessary to use the for-loop to process the file. The data aggregation is more easily completed in pandas.
Your csv is shown without headers, so read the file in with pandas.read_csv, header=None, and use parse_dates to correctly format the datetime column.
The column with datetimes, is shown at column index 1, therefore parse_dates=[1]
This assumes the data begins on row 0 in the file and has no headers, as shown in the OP.
Create headers for the columns
As per a comment, the date component of 'datetime' can be accessed with the .dt accessor.
.groupby on name and aggregate .max()
import pandas as pd
# read the file j.csv
df = pd.read_csv('j.csv', header=None, parse_dates=[1])
# add headers
df.columns = ['name', 'datetime']
# select only the date component of datetime
df.datetime = df.datetime.dt.date
# display(df)
name datetime
0 Manoj 2020-01-01
1 Manoj 2020-02-01
2 Manoj 2020-03-01
3 Rajesh 2020-01-01
4 Rajesh 2020-05-01
5 Suresh 2020-04-01
# groupby
dfg = df.groupby('name')['datetime'].max().reset_index()
# display(dfg)
name datetime
0 Manoj 2020-03-01
1 Rajesh 2020-05-01
2 Suresh 2020-04-01
# save the file. If the headers aren't wanted, use `header=False`
dfg.to_csv('outfile.csv', index=False)
Create dataframe
import pandas as pd
df=pd.DataFrame(zip(
['Manoj','Manoj','Manoj','Rajesh','Rajesh','Suresh'],
['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-05-01','2020-04-01'],
['01:00:00','01:00:00','01:00:00','01:00:00','01:00:00','01:00:00']),
columns=['name','date','time'])
Convert date and time from string to date and time object
df['date']=pd.to_datetime(df['date'], infer_datetime_format=True).dt.date
df['time']=pd.to_datetime(df['time'],format='%H:%M:%S').dt.time
Take groupby
out=df.groupby(by=['name','time']).max().reset_index()
You can save and load it again
out.to_csv('out.csv',index=False)
df1=pd.read_csv('out.csv')
result
name time date
0 Manoj 01:00:00 2020-03-01
1 Rajesh 01:00:00 2020-05-01
2 Suresh 01:00:00 2020-04-01
Sorry, I created two separate columns for date and time, but I hope you can understand it

How to use Pandas to get date_range from some timestamp?

I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]

Categories

Resources