Find a certain date inside Timestamp vector - python

I have a certain timestamp vector and I need to find the position index of the date inside this vector. Let's say I want to find inside this vector the position index of 2017-01-01.
Here below is the basic code that creates a ts vector:
import numpy as np
import pandas as pd
ts_vec = []
t = pd._libs.tslibs.timestamps.Timestamp('2016-03-03 00:00:00')
for i in range(1000):
ts_vec = [*ts_vec,t]
t = t+pd.Timedelta(days=1)
ts_vec = np.array(ts_vec)
How should I do this? Thank You

outp = np.where(ts_vec==pd._libs.tslibs.timestamps.Timestamp('2017-01-01 00:00:00'))

Related

Creating an array of timestamps between two timestamps in pyspark

I have two timestamp columns in my pyspark dataframe. I want to create a third column which has the array of timestamp hours between the two timestamps.
This is the code I wrote for that..
# Creating udf function
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
dates = np.array(date_list)
return dates
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
# Using udf function
orders.withColumn('date_array', udf_betweens(F.col('start_date'), F.col('ICUDischargeDate'))).show()
However this is showing the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I think the inputs to the functions are going in as two arrays and not as two datetimes causing the error. Is there any way around this? Any other way of solving this problem?
Thank you very much.
You are getting the error when returning numpy array from your udf. You can simply return the date_list and it will work.
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
return date_list
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
To test the above function:
df = spark.sql("select current_timestamp() as t1").withColumn("t2", col("t1") + expr("INTERVAL 1 DAYS"))
df.withColumn('date_array', udf_betweens(F.col('t1'), F.col('t2'))).show()

Creating new pandas columns with original value plus random number in error range

I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])
Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps
Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Python sklearn.datasets.dump_svmlight_file failed to output the right index of column

I want to execute SVM light and SVM rank,
so I need to process my data into the format of SVM light.
But I had a big problem....
My Python codes are below:
import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file
self.df = pd.DataFrame()
self.df['patent_id'] = patent_id_list
self.df['Target'] = class_list
self.df['backward_citation'] = backward_citation_list
self.df['uspc_originality'] = uspc_originality_list
self.df['science_linkage'] = science_linkage_list
self.df['sim_bc_structure'] = sim_bc_structure_list
self.df['claim_num'] = claim_num_list
self.qid = dataset_list
X = self.df[np.setdiff1d(self.df.columns, ['patent_id','Target'])]
y = self.df.Target
dump_svmlight_file(X,y,'test.dat',zero_based=False, query_id=self.qid,multilabel=False)
The output file "test.dat" is look like this:
But the real data is look like this:
I got a wrong index....
Take first instance for example, the value of column 1 is 7, and the values of column 2~4 are zeros, the value of column 5 is 2....
So my expected result is look like this:
1 qid:1 1:7 5:2
but the column index of output file are totally wrong....
and unfortunately... I cannot figure out where is the problem occur....
I cannot fix this problem for a long time....
Thank you for help!!
I change the data structure, I use np.array to produce array-like input.
Finally, I succeed!
If you're interested in loading into a numpy array, try:
X = clicks_train[:,0:2]
y = clicks_train[:,2]
where 2 is the index of the target column

Pandas data.frame, has incorrect indices

I'm pulling data from Yahoo finance, but the data.frame I'm creating won't load because my indices are incorrect.
I know what I need to fix, I just don't know how :/
Here is my code and error:
from scipy import stats
import scipy as sp
import numpy as np
import pandas as pd
import datetime as datetime
from matplotlib.finance import quotes_historical_yahoo
ticker = 'IBM'
begdate = (1962,1,1)
enddate = (2013,11,22)
x = quotes_historical_yahoo(ticker, begdate, enddate, asobject = True, adjusted = True)
logret = np.log(x.aclose[1:] / x.aclose[:-1])
date = []
d0 = x.date
print len(logret)
for i in range(0, len(logret)):
t1 = ''.join([d0[i].strftime("%Y"), d0[i].strftime("%m"), "01"])
date.append(datetime.datetime.strptime(t1, "%Y%m%d"))
y = pd.DataFrame(logret, date)
retM = y.groupby(y.index).sum()
ret_Jan = retM[retM.index.month == 1]
ret_others = retM[retM.index.month != 1]
print sp.stats.bartlett(ret_Jan.values, ret_others.values)
The error comes from this line:
y = pd.DataFrame(logret, date)
And produces this:
ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
I believe I need to change logret into a list? ... or a tuple?
But my efforts to convert, using tuple(logret) or creating an empty list and populating it, have not worked so far.
Suggestions?
ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
means that you've given pd.DataFrame a series of length 13064 and an index of length 1, and asked it to index the series by the index. Indeed, that is what you've done: date starts off as [], and then you append one value to it, so the index you're passing to the dataframe is just a singleton list.
I think you probably didn't mean to create the DataFrame inside the loop.
I think you're making this a lot harder by going in and out of Pandas objects. If you just stay in Pandas this is pretty simple. I think all you need to do is:
import pandas.io.data as web
import datetime
start = datetime.datetime(1962, 1, 1)
end = datetime.datetime(2013, 11, 22)
f=web.DataReader("IBM", 'yahoo', start, end)
f['returns'] = log(f.Close/f.Close.shift(1))
ret_Jan = f.returns[f.index.month == 1]
ret_others = f.returns[f.index.month != 1]
print sp.stats.bartlett(ret_Jan, ret_others)
(122.77708966467267, 1.5602965581388475e-28)

Categories

Resources