How to categorize and count imported data

How to categorize and count imported data - python

Suppose there’s a sensor which records the date and time at every activation. I have this data stored as a list in a .json file in the format (e.g.) "2000-01-01T00:30:15+00:00".
Now, what I want to do is import this file in python and use NumPy/ Mathplotlib to plot how many times this sensor is activated per day.
My problem is, using this data, I don’t know how to write an algorithm which counts how many times the sensor is activated daily. (This should be simple, but due to limited Python knowledge, I’m stuck). Supposedly there is a way to split this list wrt T, bin each recording by date (e.g. “2000-01-01”) and then count the recordings on this date.
How would you count how many times the sensor is activated? (to then make a plot showing the number of activations each day?)

First of all you need to load your JSON file:
import json
with open("logfile.json", "r") as logfile:
records = json.load(logfile)
Records will be a list or dictionary containing your records.
Assuming that your logfile looks like:
[u"2000-01-01T00:30:15+00:00",
u"2000-01-01T00:30:16+00:00",
...
]
Records will be a list of strings. So parsing the dates is just:
import datetime
for record in records:
datepart, _ = record.split("T")
date = datetime.datetime.strptime(datepart, "%Y-%m-%d")
Hopefully that's clear enough. Using "string".split and datetime.strptime should do the trick, although you don't have to parse this into a date object just to bin it but it may make things easier later on.
Finally, binning should be pretty straightforward using a dictionary of lists. Starting
from what we've got above let's add binning:
import collections
import datetime
date_bins = collections.defaultdict(list)
for record in records:
datepart, _ = record.split("T")
date = datetime.datetime.strptime(datepart, "%Y-%m-%d")
date_bins[date].append(record)
This should give you a dictionary where each key is a date and each value is the list of records that were logged on that day.
You'll probably want to sort this by date (although you may be able to use collections.OrderedDict if the data is already in order).
Counting activations per day could be something like:
for date in date_bins:
print "activations on %s: %s"%(date, len(date_bins[date]))
Of course it's a little bit more work to take that information and massage it into a format that matplotlib needs but it shouldn't be too bad from here.

if your json file load a list like:
j_list = [('2000-01-01T00:30:15+00:00', 'xx'),
('2000-01-01T00:30:15+00:00', 'yyy'),
('2000-01-02T00:30:15+00:00', 'zzz')]
Note: this assumes the json file returns a list of lists with the timestamp as the first element. Adjust accordingly.
There are parsers in dateutil and datetime to parse the timestamp.
If counting is really all you are doing, even that might be overkill. You could:
>>> from itertools import groupby
>>> [(k,len(list(l))) for k,l in groupby(j_list,lambda x: x[0][:10])]
[('2000-01-01', 2), ('2000-01-02', 1)]

Related

Orange Python Script create custom timestamp (Orange Data Mining Windows 10)

I am trying to achieve a script, which will create an Orange data table with just a single column containing a custom time stamp.
Usecase: I need a complete time stamp so I can merge some other csv files later on. I'm working in the Orange GUI BTW and am not working in the actual python shell or any other IDE (in case this information makes any difference).
Here's what I have come up with so far:
From Orange.data import Domain, Table, TimeVariable
import numpy as np
domain = Domain([TimeVariable("Timestamp")])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]")
# Obviously necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)
However the results do not match:
>>> print(arr)
[['2022-03-08T00:00']
['2022-03-08T00:01']
['2022-03-08T00:02']
...
['2022-03-14T23:57']
['2022-03-14T23:58']
['2022-03-14T23:59']]
>>> print(out_data)
[[27444960.0],
[27444961.0],
[27444962.0],
...
[27455037.0],
[27455038.0],
[27455039.0]]
Obviously I'm missing something when handing over the data from numpy but I'm having a real hard time trying to understand the documentation.
I've also found this post which seems to tackle a similar issue, but I haven't figured out how to apply the solution on my problem.
I would be really glad if anyone could help me out here. Please try to use simple terms and concepts.

Thank you for the question, and apologies for the weak documentation of the TimeVariable.
In your code, you must change two things to work.
First, it is necessary to set whether the TimeVariable includes time and/or date data:
TimeVariable("Timestamp", have_date=True) stores only date information -- it is analogous to datetime.date
TimeVariable("Timestamp", have_time=True) stores only time information (without date) -- it is analogous to datetime.time
TimeVariable("Timestamp", have_time=True, have_date=True) stores date and time -- it is analogous to datetime.datetime
You didn't set that information in your example, so both were False by default. For your case, you must set both to True since your attribute will hold the date-time values.
The other issue is that Orange's Table stores date-time values as UNIX epoch (seconds from 1970-01-01), and so also Table.from_numpy expect values in this format. Values in your current arr array are in minutes instead. I just transformed the dtype in the code below to seconds.
Here is the working code:
from Orange.data import Domain, Table, TimeVariable
import numpy as np
# Important: set whether TimeVariable contains time and/or date
domain = Domain([TimeVariable("Timestamp", have_time=True, have_date=True)])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]").astype("datetime64[s]")
# necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)

Understanding difference in unix epoch time via Python vs. InfluxDB

I've been trying to figure out how to generate the same Unix epoch time that I see within InfluxDB next to measurement entries.
Let me start by saying I am trying to use the same date and time in all tests:
April 01, 2017 at 2:00AM CDT
If I view a measurement in InfluxDB, I see time stamps such as:
1491030000000000000
If I view that measurement in InfluxDB using the -precision rfc3339 it appears as:
2017-04-01T07:00:00Z
So I can see that InfluxDB used UTC
I cannot seem to generate that same timestamp through Python, however.
For instance, I've tried a few different ways:
>>> calendar.timegm(time.strptime('04/01/2017 02:00:00', '%m/%d/%Y %H:%M:%S'))
1491012000
>>> calendar.timegm(time.strptime('04/01/2017 07:00:00', '%m/%d/%Y %H:%M:%S'))
1491030000
>>> t = datetime.datetime(2017,04,01,02,00,00)
>>> print "Epoch Seconds:", time.mktime(t.timetuple())
Epoch Seconds: 1491030000.0
The last two samples above at least appear to give me the same number, but it's much shorter than what InfluxDB has. I am assuming that is related to the precision, InfluxDB does things down to nanosecond I think?
Python Result: 1491030000
Influx Result: 1491030000000000000
If I try to enter a measurement into InfluxDB using the result Python gives me it ends up showing as:
1491030000 = 1970-01-01T00:00:01.49103Z
So I have to add on the extra nine 0's.
I suppose there are a few ways to do this programmatically within Python if it's as simple as adding on nine 0's to the result. But I would like to know why I can't seem to generate the same precision level in just one conversion.
I have a CSV file with tons of old timestamps that are simply, "4/1/17 2:00". Every day at 2 am there is a measurement.
I need to be able to convert that to the proper format that InfluxDB needs "1491030000000000000" to insert all these old measurements.
A better understanding of what is going on and why is more important than how to programmatically solve this in Python. Although I would be grateful to responses that can do both; explain the issue and what I am seeing and why as well as ideas on how to take a CSV with one column that contains time stamps that appear as "4/1/17 2:00" and convert them to timestamps that appear as "1491030000000000000" either in a separate file or in a second column.

InfluxDB can be told to return epoch timestamps in second precision in order to work more easily with tools/libraries that do not support nanosecond precision out of the box, like Python.
Set epoch=s in query parameters to enable this.
See influx HTTP API timestamp format documentation.

Something like this should work to solve your current problem. I didn't have a test csv to try this on, but it will likely work for you. It will take whatever csv file you put where "old.csv" is and create a second csv with the timestamp in nanoseconds.
import time
import datetime
import csv
def convertToNano(date):
s = date
secondsTimestamp = time.mktime(datetime.datetime.strptime(s, "%d/%m/%y %H:%M").timetuple())
nanoTimestamp = str(secondsTimestamp).replace(".0", "000000000")
return nanoTimestamp
with open('old.csv', 'rb') as old_csv:
csv_reader = csv.reader(old_csv)
with open('new.csv', 'wb') as new_csv:
csv_writer = csv.writer(new_csv)
for i, row in enumerate(csv_reader):
if i != 0:
# Put whatever rows the data appears in and the row you want the data to go in here
row.append(convertToNano(row[<location of date in the row>]))
csv_writer.writerow(row)
As to why this is happening, after reading this it seems like you aren't the only one getting frustrated by this issue. It seems as though influxdb just happens to be using a different type of precision then most python modules. I didn't really see any way to get around it than doing the string manipulation of the date conversion unfortunately.

Python: creating list of timestamps by minute

I am trying to figure out what the best way to create a list of timestamps in Python is, where the values for the items in the list increment by one minute. The timestamps would be by minute, and would be for the previous 24 hours. I need to create timestamps of the format "MM/dd/yyy HH:mm:ss" or to at least contain all of those measures. The timestamps will be an axis for a graph of data that I am collecting.
Calculating the times alone isn't too bad, as I could just get the current time, convert it to seconds, and change the value by one minute very easily. However, I am kind of stuck on figuring out the date aspect of it without having to do a lot of checking, which doesn't feel very Pythonic.
Is there an easier way to do this? For example, in JavaScript, you can get a Date() object, and simply subtract one minute from the value and JS will take care of figuring out if any of the other fields need to change and how they need to change.

datetime is the way to go, you might want to check out This Blog.
import datetime
import time
now = datetime.datetime.now()
print now
print now.ctime()
print now.isoformat()
print now.strftime("%Y%m%dT%H%M%S")
This would output
2003-08-05 21:36:11.590000
Tue Aug 5 21:36:11 2003
2003-08-05T21:36:11.590000
20030805T213611
You can also do subtraction with datetime and timedelta objects
now = datetime.datetime.now()
minute = timedelta(days=0,seconds=60,microseconds=0)
print now-minute
would output
2015-07-06 10:12:02.349574

You are looking for datetime and timedelta objects. See the docs.

python calculate time attendance to file CSV

I have a csv file containing the pointing personal. of this form:
3,23/02/2015,08:27,08:27,12:29,13:52,19:48
3,24/02/2015,08:17,12:36,13:59,19:28
5,23/02/2015,10:53,13:44
5,25/02/2015,09:05,12:34,12:35,13:30,19:08
5,26/02/2015,08:51,12:20,13:46,18:47,18:58
and I want it cleaning. in this way:
ID, DATE, IN,BREAK_OUT, BREAK_IN, OUT, WORK_TIME
3,Monday 23/02/2015,08:27,12:29,13:52,19:48,08:00hours
3,Tuesday 24/02/2015,08:17,12:36,13:59,19:28,08:00hours
5,Monday 23/02/2015,10:53,NAN,13:44,NAN,2houres
5,Wednesday 25/02/2015,09:05,12:34,13:30,19:08,08hours
can you help me please
think you

I'd suggest you use pandas to import the data from the file
import pandas as pd
pd.read_csv(filepath, sep = ',')
should do the trick, assuming filepath leads to your csv. I'd then suggest that you use the datetime functions to convert your strings to dates you can calculate with (I think you could also use numpys datetime64 types, I'm just not used to them).
import datetime as dt
day = dt.datetime.strptime('23/02/2015', '%d/%m/%Y')
in = dt.datetime.combine(day, dt.datetime.strptime('08:27', '%H:%M').time())
should do the trick. It is necessary, that your in is also a datetime object, not only a time object, otherwise you cannot substract them (which would be the necessary next step to calculate the Worktime.
Is think this should be a bit to get you started, You'll find the pandas documentation here and the datetime documentation here.
If you have further questions, try to ask your question more specific.

This question might help you out: How to split string into column
First, read the whole file and split the columns. Check if there's data or not and write it back into a new file.
If you need additional help, tell us what you tried, what worked for you and what didn't and so on. We won't write a complete program/script for you.

Save date time in filename with valid chars

To keep track of my when my files were backed up I want to have the filename of the backups as the datetime of when they were backed up. This will eventually be sorted and retrieved and sorted using python to allow me to get the most recent file based on the datetime filename.
The problem is, the automatic format of date time cant be saved like this:
2007-12-31 22:29:59
It can for example be saved like this:
2007-12-31 22-29-59
What is the best way to format the datetime so that I can easily sort by datetime on the name, and for bonus points, what is the python to show the datetime in that way.

You should have a look the documentation of the python time module: http://docs.python.org/2/library/time.html#module-time
If you go to the strftime() function, you will see that it accepts a string as input, which describes the format of the string you want to get as the return value.
Example (with hyphens between each date/time token):
>>> s = time.strftime('%Y-%m-%d-%H-%M-%S')
>>> print s
2012-12-08-14-55-44
The documentation contains a complete table of directives you can use to get different tokens.
What is the best way to format the datetime so that I can easily sort by datetime?
If you want to sort files according to datetimes names, you can consider that a biggest-to-lowest time specifier representation of a datetime (e.g.: YYYYMMDDhhmmss) preserves the same chronological and lexicographical order.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to categorize and count imported data - python

Related

Orange Python Script create custom timestamp (Orange Data Mining Windows 10)

Understanding difference in unix epoch time via Python vs. InfluxDB

Python: creating list of timestamps by minute

python calculate time attendance to file CSV

Save date time in filename with valid chars

Categories

Resources