I am trying to achieve a script, which will create an Orange data table with just a single column containing a custom time stamp.
Usecase: I need a complete time stamp so I can merge some other csv files later on. I'm working in the Orange GUI BTW and am not working in the actual python shell or any other IDE (in case this information makes any difference).
Here's what I have come up with so far:
From Orange.data import Domain, Table, TimeVariable
import numpy as np
domain = Domain([TimeVariable("Timestamp")])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]")
# Obviously necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)
However the results do not match:
>>> print(arr)
[['2022-03-08T00:00']
['2022-03-08T00:01']
['2022-03-08T00:02']
...
['2022-03-14T23:57']
['2022-03-14T23:58']
['2022-03-14T23:59']]
>>> print(out_data)
[[27444960.0],
[27444961.0],
[27444962.0],
...
[27455037.0],
[27455038.0],
[27455039.0]]
Obviously I'm missing something when handing over the data from numpy but I'm having a real hard time trying to understand the documentation.
I've also found this post which seems to tackle a similar issue, but I haven't figured out how to apply the solution on my problem.
I would be really glad if anyone could help me out here. Please try to use simple terms and concepts.
Thank you for the question, and apologies for the weak documentation of the TimeVariable.
In your code, you must change two things to work.
First, it is necessary to set whether the TimeVariable includes time and/or date data:
TimeVariable("Timestamp", have_date=True) stores only date information -- it is analogous to datetime.date
TimeVariable("Timestamp", have_time=True) stores only time information (without date) -- it is analogous to datetime.time
TimeVariable("Timestamp", have_time=True, have_date=True) stores date and time -- it is analogous to datetime.datetime
You didn't set that information in your example, so both were False by default. For your case, you must set both to True since your attribute will hold the date-time values.
The other issue is that Orange's Table stores date-time values as UNIX epoch (seconds from 1970-01-01), and so also Table.from_numpy expect values in this format. Values in your current arr array are in minutes instead. I just transformed the dtype in the code below to seconds.
Here is the working code:
from Orange.data import Domain, Table, TimeVariable
import numpy as np
# Important: set whether TimeVariable contains time and/or date
domain = Domain([TimeVariable("Timestamp", have_time=True, have_date=True)])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]").astype("datetime64[s]")
# necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)
Related
I am fairly new to python but have a problem I'd like to solve but need a little bit of help.
I need to ask a user which directory path they want which I've figured out that part but....
From there I need to figure out a way to ask a user for a specific date/time range day-month-year, hours:minutes:seconds then filter out which csv files are in that range.
From there, I need my program to go into the filtered CSV files and look at time stamps recorded in the csv files.
From those time stamps I need to calculate if there are any gaps from the end of one csv file to the start of next.
If there are gaps I need to return a statement that indicates how long the gap is.
I've seen a few things, but am having trouble putting it all together!
Any guidance would be appreciated!
Consider using Dask data frames (https://docs.dask.org/en/latest/dataframe.html), which works on top of Pandas data frames.
Without going much deeper into Dask you need to know that it works in lazy mode, which means that will not do any processing until it is explicitly triggered with compute method. That makes the coding slightly different from Pandas.
Following example solves the part of reading multiple files and find gaps. The datafiles (that you can find here: https://github.com/mchiuminatto/stackoverflow/tree/master/data)
are OHLC data with frequency of D (one day) so the gap condition is that the difference between any two consecutive dates is more then 1 day.
import dask.dataframe as dd
# read all the csv files in the directory
# how much is loaded into memory is managed by Dask.
df = dd.read_csv('./data/*.csv')
df['date_time'] = dd.to_datetime(df['Time (UTC)'])
df['Time (UTC)'] = dd.to_datetime(df['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['dif'] = df['date_time'] - df['date_time'].shift(1) # calculates gaps
# no data transformation is performed until you execute compute.
df.compute().head(5)
To check one record:
df.loc['2020-01-06 22:00:00'].compute()
Filter the periods with more than one day of difference
_mask = df['dif'] > '1 days' # time unit can be adjusted
df_gap = df[_mask].compute() # now we persist transformations in a Pandas df: df_gap
df_gap.head(5)
df_gap.tail(5)
I am rather new to Python and I am working with .pos files. They are not that common, but I can explain their structure.
There is a header with general information and then 15 different columns containing data.The first two columns contain the GPS time (the date the first column and the time in the second - standard format YYYY/MM/DD hh:mm:ss.ms), then there are 3 columns containing coordinates or distances in meters and then other columns that are other measurements, always numbers. Here can be found an example, mind only that my GPST (gps time) is as explained above.
As a matter of fact, there are three data types in this file, that are datetime, integer, and floating numbers.
I need to import this file in Python as an array. Apparently, Python can consider .pos file as a text file, so I have tried to to use the loadtext() command, specifying the different data types (datetime64, int, float). However, it gave me an error, saying that the date format could not be recognized. Then, I tried with the command genfromtext(), both specifying the data types and with dtype=None. In the first case I got empty columns for date and time and in the latter case I got the date and time as a string.
I would like the date and the time to be recognized as such and not as a string, as I will need it later on for further analyses. Does someone have an idea on how I could import this file correctly?
Please, just try to be clear because I am a neophyte programmer!
Thank you for your help.
I answer my own question, maybe it is useful to someone.
.pos file can be open using the Pandas package as follows:
import pandas as pd
df = pd.read_table(filepath, sep='\s+', parse_dates={'Timestamp': [0, 1]})
In my data, the first two columns are date and time, which is considered as such by the argument "parse_dates={'Timestamp': [0, 1]}"
I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?
It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.
#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.
I've been trying to figure out how to generate the same Unix epoch time that I see within InfluxDB next to measurement entries.
Let me start by saying I am trying to use the same date and time in all tests:
April 01, 2017 at 2:00AM CDT
If I view a measurement in InfluxDB, I see time stamps such as:
1491030000000000000
If I view that measurement in InfluxDB using the -precision rfc3339 it appears as:
2017-04-01T07:00:00Z
So I can see that InfluxDB used UTC
I cannot seem to generate that same timestamp through Python, however.
For instance, I've tried a few different ways:
>>> calendar.timegm(time.strptime('04/01/2017 02:00:00', '%m/%d/%Y %H:%M:%S'))
1491012000
>>> calendar.timegm(time.strptime('04/01/2017 07:00:00', '%m/%d/%Y %H:%M:%S'))
1491030000
>>> t = datetime.datetime(2017,04,01,02,00,00)
>>> print "Epoch Seconds:", time.mktime(t.timetuple())
Epoch Seconds: 1491030000.0
The last two samples above at least appear to give me the same number, but it's much shorter than what InfluxDB has. I am assuming that is related to the precision, InfluxDB does things down to nanosecond I think?
Python Result: 1491030000
Influx Result: 1491030000000000000
If I try to enter a measurement into InfluxDB using the result Python gives me it ends up showing as:
1491030000 = 1970-01-01T00:00:01.49103Z
So I have to add on the extra nine 0's.
I suppose there are a few ways to do this programmatically within Python if it's as simple as adding on nine 0's to the result. But I would like to know why I can't seem to generate the same precision level in just one conversion.
I have a CSV file with tons of old timestamps that are simply, "4/1/17 2:00". Every day at 2 am there is a measurement.
I need to be able to convert that to the proper format that InfluxDB needs "1491030000000000000" to insert all these old measurements.
A better understanding of what is going on and why is more important than how to programmatically solve this in Python. Although I would be grateful to responses that can do both; explain the issue and what I am seeing and why as well as ideas on how to take a CSV with one column that contains time stamps that appear as "4/1/17 2:00" and convert them to timestamps that appear as "1491030000000000000" either in a separate file or in a second column.
InfluxDB can be told to return epoch timestamps in second precision in order to work more easily with tools/libraries that do not support nanosecond precision out of the box, like Python.
Set epoch=s in query parameters to enable this.
See influx HTTP API timestamp format documentation.
Something like this should work to solve your current problem. I didn't have a test csv to try this on, but it will likely work for you. It will take whatever csv file you put where "old.csv" is and create a second csv with the timestamp in nanoseconds.
import time
import datetime
import csv
def convertToNano(date):
s = date
secondsTimestamp = time.mktime(datetime.datetime.strptime(s, "%d/%m/%y %H:%M").timetuple())
nanoTimestamp = str(secondsTimestamp).replace(".0", "000000000")
return nanoTimestamp
with open('old.csv', 'rb') as old_csv:
csv_reader = csv.reader(old_csv)
with open('new.csv', 'wb') as new_csv:
csv_writer = csv.writer(new_csv)
for i, row in enumerate(csv_reader):
if i != 0:
# Put whatever rows the data appears in and the row you want the data to go in here
row.append(convertToNano(row[<location of date in the row>]))
csv_writer.writerow(row)
As to why this is happening, after reading this it seems like you aren't the only one getting frustrated by this issue. It seems as though influxdb just happens to be using a different type of precision then most python modules. I didn't really see any way to get around it than doing the string manipulation of the date conversion unfortunately.
Suppose there’s a sensor which records the date and time at every activation. I have this data stored as a list in a .json file in the format (e.g.) "2000-01-01T00:30:15+00:00".
Now, what I want to do is import this file in python and use NumPy/ Mathplotlib to plot how many times this sensor is activated per day.
My problem is, using this data, I don’t know how to write an algorithm which counts how many times the sensor is activated daily. (This should be simple, but due to limited Python knowledge, I’m stuck). Supposedly there is a way to split this list wrt T, bin each recording by date (e.g. “2000-01-01”) and then count the recordings on this date.
How would you count how many times the sensor is activated? (to then make a plot showing the number of activations each day?)
First of all you need to load your JSON file:
import json
with open("logfile.json", "r") as logfile:
records = json.load(logfile)
Records will be a list or dictionary containing your records.
Assuming that your logfile looks like:
[u"2000-01-01T00:30:15+00:00",
u"2000-01-01T00:30:16+00:00",
...
]
Records will be a list of strings. So parsing the dates is just:
import datetime
for record in records:
datepart, _ = record.split("T")
date = datetime.datetime.strptime(datepart, "%Y-%m-%d")
Hopefully that's clear enough. Using "string".split and datetime.strptime should do the trick, although you don't have to parse this into a date object just to bin it but it may make things easier later on.
Finally, binning should be pretty straightforward using a dictionary of lists. Starting
from what we've got above let's add binning:
import collections
import datetime
date_bins = collections.defaultdict(list)
for record in records:
datepart, _ = record.split("T")
date = datetime.datetime.strptime(datepart, "%Y-%m-%d")
date_bins[date].append(record)
This should give you a dictionary where each key is a date and each value is the list of records that were logged on that day.
You'll probably want to sort this by date (although you may be able to use collections.OrderedDict if the data is already in order).
Counting activations per day could be something like:
for date in date_bins:
print "activations on %s: %s"%(date, len(date_bins[date]))
Of course it's a little bit more work to take that information and massage it into a format that matplotlib needs but it shouldn't be too bad from here.
if your json file load a list like:
j_list = [('2000-01-01T00:30:15+00:00', 'xx'),
('2000-01-01T00:30:15+00:00', 'yyy'),
('2000-01-02T00:30:15+00:00', 'zzz')]
Note: this assumes the json file returns a list of lists with the timestamp as the first element. Adjust accordingly.
There are parsers in dateutil and datetime to parse the timestamp.
If counting is really all you are doing, even that might be overkill. You could:
>>> from itertools import groupby
>>> [(k,len(list(l))) for k,l in groupby(j_list,lambda x: x[0][:10])]
[('2000-01-01', 2), ('2000-01-02', 1)]