date formatting for time series data - python

I'm trying to format the start_date and end_date of this data frame. The issue is even though with other time-series data I'm not losing any observation when I'm formatting both dates using the following command:
df1['start_date'] = pd.to_datetime(df2['start_date']).dt.date
df1['end_date'] = pd.to_datetime(df2['end_date']).dt.date
df2['start_date'] = pd.to_datetime(df2['start_date']).dt.date
df2['end_date'] = pd.to_datetime(df2['end_date']).dt.date
But, when I'm doing this for the uploaded excel file I lose half of the observation though the data format is the same. then I went ahead with this code but in vain since it returns me an error:
#formatting the date column correctly
df4.start_date=df4.start_date.apply(lambda x:datetime.datetime.strptime(x, '%Y-%m-%dT%H::%M::%S.%f'))
Error message : ValueError: time data '2014-01-02T00:00:00.000Z' does not match format '%Y-%m-%dT%H::%M::%S.%f'
start_date,end_date,temp,company_id
2014-01-02T00:00:00.000Z,2014-01-02T00:00:00.000Z,24.7076052756763,93
2014-01-11T00:00:00.000Z,2014-01-11T00:00:00.000Z,26.4211755123126,93
2014-01-15T00:00:00.000Z,2014-01-15T00:00:00.000Z,24.305641594482,93
2014-01-20T00:00:00.000Z,2014-01-20T00:00:00.000Z,25.1427441225934,93
2014-01-23T00:00:00.000Z,2014-01-23T00:00:00.000Z,23.6531733860261,93

Related

How to I split the time_taken' column of a dataframe?

I am trying to split the time_taken attribute (eg., 02h 10m) into only numbers using the below code.
I have checked earlier posts and this code seemed to work fine for some of you but it is not working for me.
t=pd.to_timedelta(df3['time_taken'])
df3['hours']=t.dt.components['hours']
df3['minutes']=t.dt.components['minutes']
df3.head()
I am getting the following error:
ValueError: invalid unit abbreviation: hm
I am unable to understand the error. Can anyone help me split the column into hours and mins? It would be of great help. Thanks in advance.
You can try this code. Since you mentioned that your time_taken attribute looks like this: 02h 10m. I have written an example code which you can try out.
import pandas as pd
# initializing example time data
time_taken = ['1h 10m', '2h 20m', '3h 30m', '4h 40m', '5h 50m']
#inserting the time data into a pandas DataFrame
data = pd.DataFrame(time_taken, columns = ['time_taken'])
# see how the data looks like
print(data)
# initializing "Hours" and "Minutes" columns"
# and assigning the value 0 to both for now.
data['Hours'] = 0
data['Minutes'] = 0
# when I ran this code, the data type for the elements
# in time_taken column was numpy.int64
# so we convert it into string type
data['time_taken'] = data['time_taken'].apply(str)
# loop through the elements to split into Hours and minutes
for i in range(len(data)):
temp = data.iat[i,0]
hours, minutes = temp.split() # use python .split() function for strings
data.iat[i,1] = hours.translate({ord('h'): None})
data.iat[i,2] = minutes.translate({ord('m'): None})
# the correct data is here
print(data)

How to transform invalid date to valid date using Pandas?

I have a dataframe like as shown below
df = pd.DataFrame({'d1' :['2/26/2019 03:31','10241-2-19 0:0:0','31/03/2016 16:00'],
'd2' :['2/29/2019 05:21','10241-2-29 0:0:0','03/04/2016 12:00']})
As you can see there are some invalid date values. Meaning records with year like 10241.
On the other hand, valid dates can be in both format as mdy_dm or dmy_dm.
When I try the below, I get an error message that "date of out of range"
df['d1'] = pd.to_datetime(df.d1)
df['d1'].dt.strftime('%m/%d/%Y hh:ss')
Is there anyway to fix this?
I expect my output to be like as shown below

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

Extracting columns from CSV using Pandas

I am trying to extract the Start Station from a csv file, example data below.
Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
1423854,2017-06-23 15:09:32,2017-06-23 15:14:53,321,Wood St & Hubbard St,Damen Ave & Chicago Ave,Subscriber,Male,1992.0
The problem I am having is when I try to extract the data I receive the following error message:
AttributeError: 'Series' object has no attribute 'start'
def load_data(city, month, day):
# load data file into a dataframe
df = pd.read_csv(CITY_DATA[city])
I believe my problem stems from converting the Start Station, but can't seem to figure why.
# convert the Start Station column to dataframe
df['Start Station'] = pd.DataFrame(df['Start Station'])
# extract street names from Start Station and End Station to create new columns
df['start'] = df['Start Station'].start
def station_stats(df):
"""Displays statistics on the most popular stations and trip."""
# TO DO: display most commonly used start station
popular_start_station = df['start']
print(popular_start_station)
Your code is confusing. Just try this:
df = pd.read_csv(CITY_DATA, index = True) # load data file into a one df
start_data_series = df[['Start Station']] # create series with column of interest
You can add more columns to the second line according to your liking. For further reading, refer to this post.

Trying to convert the time stamps in my dictionary to dates

all_data = {}
for ticker in ['TWTR', 'SNAP', 'FB']:
all_data[ticker] = np.array(pd.read_csv('https://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q={}'.format(ticker, skiprows=7, header=None))
date = []
for i in np.arange(0, len(all_data['SNAP'])):
if all_data['SNAP'][i][0][0] == 'a':
t = datetime.datetime.fromtimestamp(int(all_data['SNAP'][i][0].replace('a','')))
date.append(t)
else:
date.append(t+ datetime.timedelta(minutes= int(all_data['SNAP'][i][0])))
Hi, what this code does is to create a dictionary(all_data) and then put intraday data for twitter, snapchat, facebook into the dictionary from the url. The dates are in epoch time format and so the second for did a second for loop.
I was only able to do so for one of the tickers (SNAP) and i was wondering if anyone knew how to create iterate all the data to do the same
With pandas, you normally convert a timestamp to datetime using:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit="s")
Note:
Your script seems to contain other errors, which are outside the scope of the question.

Categories

Resources