I'm trying to get my code to take times in the 24hr format (such as 0930 (09:30 AM) or 20h45 (08:45 PM)) and output it as 09:30 and 20:45, respectively.
I tried using datetime and strftime, etc., but I can't use it on Numpy arrays.
I've also tried formatting as Numpy datetime, but I can't seem to get the HH:MM format.
This is my array (example)
[0845 0925 1046 2042 2153]
and I would like to output this:
[08:45 09:25 10:46 20:42 21:53]
Thank you all in advance.
Although I'm not entirely sure what are you trying to acomplish, I think this is the desired output.
For parsing dates you should first use "strptime" to get a datetime object and then "strftime" to get it back into desired string.
You are saying you got numpy arrays, but u got leading zeroes in this example you gave, so I guess it is an np array with defined string dytpe.
Custom functions can be vectorized to work on numpy arrays.
import numpy as np
from datetime import datetime
a=np.array(["0845", "0925", "1046", "2042", "2153"],dtype = str)
def fun(x):
x=datetime.strptime(x,"%H%M")
return datetime.strftime(x,"%H:%M")
vfunc = np.vectorize(fun)
result = vfunc(a)
print(result)
You can leverage pandas.to_datetime():
import numpy as np
import pandas as pd
x = np.array(["0845","0925","1046","2042","2153"])
y = pd.to_datetime(x, format="%H%M").to_numpy()
Outputs:
>>> x
['0845' '0925' '1046' '2042' '2153']
>>> y
['1900-01-01T08:45:00.000000000' '1900-01-01T09:25:00.000000000'
'1900-01-01T10:46:00.000000000' '1900-01-01T20:42:00.000000000'
'1900-01-01T21:53:00.000000000']
More info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
Assuming that the initial array includes only valid values (eg. values like '08453' or '2500' are not valid), this is a simple solution that does not require importing extra modules:
import numpy
arr = numpy.array(["0845","0925","1046","2042","2153"])
new_arr = []
for x in arr:
elem = x[:2] + ":" + x[2:]
new_arr.append(elem)
print(new_arr)
Related
I have an array A with shape (2,4,1). I want to calculate the mean of A[0] and A[1] and store both the means in A_mean. I present the current and expected outputs.
import numpy as np
A=np.array([[[1.7],
[2.8],
[3.9],
[5.2]],
[[2.1],
[8.7],
[6.9],
[4.9]]])
for i in range(0,len(A)):
A_mean=np.mean(A[i])
print(A_mean)
The current output is
5.65
The expected output is
[3.4,5.65]
The for loop is not necessary because NumPy already knows how to operate on vectors/matrices.
solution would be to remove the loop and just change axis as follows:
A_mean=np.mean(A, axis=1)
print(A_mean)
Outputs:
[[3.4 ]
[5.65]]
Now you can also do some editing to remove the brackets with [3.4 5.65]:
print(A_mean.ravel())
Try this.
import numpy as np
A=np.array([[[1.7],
[2.8],
[3.9],
[5.2]],
[[2.1],
[8.7],
[6.9],
[4.9]]])
A_mean = []
for i in range(0,len(A)):
A_mean.append(np.mean(A[i]))
print(A_mean)
I am using pd.Grouper to group my time-series in frequency of 3 days.To retrieve the time array, I use date = df.index.values which returns me an array of time which looks like this:
array(['2010-01-31T00:00:00.000000000', '2010-02-03T00:00:00.000000000',
'2017-05-12T00:00:00.000000000', '2017-05-15T00:00:00.000000000',
'2017-05-18T00:00:00.000000000', '2017-05-21T00:00:00.000000000',
'2017-05-24T00:00:00.000000000', '2017-05-27T00:00:00.000000000',
'2017-05-30T00:00:00.000000000', '2017-06-02T00:00:00.000000000',
'2017-06-05T00:00:00.000000000', '2017-06-08T00:00:00.000000000',
'2017-06-11T00:00:00.000000000', '2017-06-14T00:00:00.000000000',
'2017-06-17T00:00:00.000000000', '2017-06-20T00:00:00.000000000',
'2017-06-23T00:00:00.000000000', '2017-06-26T00:00:00.000000000',
'2017-06-29T00:00:00.000000000', '2017-07-02T00:00:00.000000000',
'2017-07-05T00:00:00.000000000', '2017-07-08T00:00:00.000000000',
'2017-07-11T00:00:00.000000000', '2017-07-14T00:00:00.000000000',
'2017-07-17T00:00:00.000000000', '2017-07-20T00:00:00.000000000',
'2017-07-23T00:00:00.000000000', '2017-07-26T00:00:00.000000000',
'2017-07-29T00:00:00.000000000', '2017-08-01T00:00:00.000000000',
'2017-08-04T00:00:00.000000000', '2017-08-07T00:00:00.000000000'],
dtype='datetime64[ns]')
I have been trying to get just date (and eventually MJD out of it). It works when I copy 1-2 elements of this array and do this;
times =['2010-02-03T00:00:00.000000000','2010-02-03T00:00:00.000000000']
t = Time(times, format='isot', scale='utc')
print(t.mjd)
>>[55230. 55230.]
However, I am not able to use the same type of code for the entire array
from astropy.time import Time
t = Time(date, format='isot', scale='utc')
print(t.mjd)
it gives me an error "Input values did not match the format class isot". So, I guessed that Time will require list rather than an array but changing Date to list doesn't fix the problem. I am not able to work it out, the example above is a list of 2 strings and it works fine. What am I doing wrong here? I have tried few other ways using pandas and trying to loop over elements. Thanks for the help.
Abhi
Since astropy 3.1 there is built-in support for datetime64, so you can simply do this:
In [2]: dates = np.array(['2010-01-31T00:00:00', '2010-02-03T00:00:00'],
...: dtype='datetime64[ns]')
...:
In [3]: tm = Time(dates)
In [4]: tm.mjd
Out[4]: array([55227., 55230.])
Found a way to do this, after looking at this link
from astropy.time import Time
date = df.index.values
a= []
for i in [x for x in date]:
ts = pd.to_datetime(str(i))
d = ts.strftime('%Y-%m-%d')
a.append(d)
print(d)
grouped_date = Time(a, format='iso', out_subfmt='date')
grouped_date_mjd = grouped_date.mjd
print(a[0:3], grouped_date_mjd[0:3])
>> ['2010-01-31', '2010-02-03', '2010-02-06'] [55227. 55230. 55233.]
I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.
EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.
I want to print a numpy array without truncation. I have seen other solutions but those don't seem to work.
Here is the code snippet:
total_list = np.array(total_list)
np.set_printoptions(threshold=np.inf)
print(total_list)
And this is what the output looks like:
22 A
23 G
24 C
25 T
26 A
27 A
28 A
29 G
..
232272 G
232273 T
232274 G
232275 C
232276 T
232277 C
232278 G
232279 T
This is the entire code. I might be making a mistake in type casting.
import csv
import pandas as pd
import numpy as np
seqs = pd.read_csv('BAP_GBS_BTXv2_imp801.hmp.csv')
plts = pd.read_csv('BAP16_PlotPlan.csv')
required_rows = np.array([7,11,14,19,22,31,35,47,50,55,58,63,66,72,74,79,82,87,90,93,99])
total_list = []
for i in range(len(required_rows)):
curr_row = required_rows[i];
print(curr_row)
for j in range(len(plts.RW)):
if(curr_row == plts.RW[j]):
curr_plt = plts.PI[j]
curr_range = plts.RA1[j]
curr_plt = curr_plt.replace("_", "").lower()
if curr_plt in seqs.columns:
new_item = [curr_row,curr_range,seqs[curr_plt]]
total_list.append(new_item)
print(seqs[curr_plt])
total_list = np.array(total_list)
'''
np.savetxt("foo.csv", total_list[:,2], delimiter=',',fmt='%s')
total_list[:,2].tofile('seqs.csv',sep=',',format='%s')
'''
np.set_printoptions(threshold='nan')
print(total_list)
use the following snippet to get no ellipsis.
import numpy
import sys
numpy.set_printoptions(threshold=sys.maxsize)
EDIT:
If you have a pandas.DataFrame use the following snippet to print your array:
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
Or you can use the pandas.DataFrame.to_string() method to get the desired result.
EDIT':
An earlier version of this post suggested the option below
numpy.set_printoptions(threshold='nan')
Technically, this might work, however, the numpy documentation specifies int and None as allowed types. Reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html.
You can get around the weird Numpy repr/print behavior by changing it to a list:
print list(total_list)
should print out your list of 2-element np arrays.
You are not printing numpy arrays.
Add the following line after the imports:
pd.set_option('display.max_rows', 100000)
#for a 2d array
def print_full(x):
dim = x.shape
pd.set_option('display.max_rows', dim[0])#dim[0] = len(x)
pd.set_option('display.max_columns', dim[1])
print(x)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
It appears that as of Python 3, the threshold can no longer be unlimited.
Therefore, the recommended option is:
import numpy
import sys
numpy.set_printoptions(threshold=sys.maxsize)
I have recently (1 week) decided to migrate my work to Python from matlab. Since I am used to matlab, I am finding it difficult sometimes to get the exact equivalent of what I want to do in python.
Here's my problem:
I have a set of csv files that I want to process. So far, I have succeeded in loading them into groups. Each column has a size of more 600000 x 1. In one of the columns in the csv file is the time which has a format of 'mm/dd/yy HH:MM:SS'. I want to convert the time column to number and I am using date2num from matplot lib for that. Is there a 'matrix' way of doing it? The command in matlab for doing that is datenum(time, 'mm/dd/yyyy HH:MM:SS') where time is a 600000 x 1 matrix.
Thanks
Here is an example of the code that I am talking about:
import csv
import time
import datetime from datetime
import date from matplotlib.dates
import date2num
time = []
otherColumns = []
for d in csv.DictReader(open('MyFile.csv')):
time.append(str(d['time']))
otherColumns.append(float(d['otherColumns']))
timeNumeric = date2num(datetime.datetime.strptime(time,"%d/%m/%y %H:%M:%S" ))
you could use a generator:
def pre_process(dict_sequence):
for d in dict_sequence:
d['time'] = date2num(datetime.datetime.strptime(d['time'],"%d/%m/%y %H:%M:%S" ))
yield d
now you can process your csv:
for d in pre_process(csv.DictReader(open('MyFile.csv'))):
process(d)
the advantage of this solution is that it doesn't copy sequences that are potentially large.
Edit:
So you the contents of the file in a numpy array?
reader = csv.DictReader(open('MyFile.csv'))
#you might want to get rid of the intermediate list if the file is really big.
data = numpy.array(list(d.values() for d in pre_process(reader)))
Now you have a nice big array that allows all kinds of operations. You want only the first column to get your 600000x1 matrix:
data[:,0] # assuming time is the first column
The closest thing in Python for matlab's matrix/vector operation is list comprehension. If you would like to apply a Python function on each item in a list you could do:
new_list = [date2num(data) for data in old_list]
or
new_list = map(date2num, old_list)