How to parse a Timestamp() with Python? - python

I am iterating over a dictionary that contains data from a SQL database and I want to count the number of times that user values appear between initial_date and ending_date, however, I am having some problems when I try to parse the Timestamp values. This is the code I have
initial_date = datetime(2017,09,01,00,00,00)
ending_date = datetime(2017,09,30,00,00,00)
dictionary sample that I got
sample = {'id': 100008222, 'sector name': 'BONGOX', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-11 00:00:00'), 'mtti': '16', 'mttr': '1', 'mttc': '2','user':'John D.'},
{'id': 100008234, 'sector name': 'BONGOY', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-09 12:05:00'), 'mtti': '1', 'mttr': '14', 'mttc': '7','user':'John D.'}
{'id': 101108234, 'sector name': 'BONGOA', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-01 10:00:00'), 'mtti': '1', 'mttr': '12', 'mttc': '1','user':'John C.'}
{'id': 101108254, 'sector name': 'BONGOB', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-02 20:00:00'), 'mtti': '2', 'mttr': '19', 'mttc': '73','user':'John C.'}
This is the code that I use to count the number of times user values appear between initial_date and ending_date
from datetime import time, datetime
from collections import Counter
#This approach does not work
Counter([li['user'] for li in sample if initial_date < dateutil.parser.parse(time.strptime(str(li.get(
'open date'),"%Y-%m-%d %H:%M:%S") < ending_date])
The code from above does not work because I encountered the error decoding to str: need a bytes-like object, Timestamp found
I have two questions:
How can I parse this Timestamp value that I encountered in these dictionaries?
I read in this post Why Counter is slow that Collections.Counter is a slow method compared to other approaches to count the number of times an item appears. If want to avoid using Counter.Collections, how can I achieve my desired result of counting the number of times user values appear between these dates?

use Timestamp.to_datetime() to convert to a datetime object

Question: How can I parse this Timestamp value that I encountered in these dictionaries?
Using class Timestampe from pandas
from pandas import Timestamp
Using Counter()
# Initialize a Counter() object
c = Counter()
# Iterate data
for s in sample:
# Get a datetime from Timestamp
dt = s['open date'].to_pydatetime()
# Compare with ending_date
if dt < ending_date:
print('{} < {}'.format(dt, ending_date))
# Increment the Counter key=s['user']
c[s['user']] += 1
print(c)
Output:
2017-09-11 00:00:00 < 2017-09-30 00:00:00
2017-09-09 12:05:00 < 2017-09-30 00:00:00
2017-09-01 10:00:00 < 2017-09-30 00:00:00
2017-09-02 20:00:00 < 2017-09-30 00:00:00
Counter({'John C.': 2, 'John D.': 2})
Question: If want to avoid using Counter.Collections, how can I achieve my desired result of counting
Without Counter()
# Initialize a Dict object
c = {}
# Iterate data
for s in sample:
# Get a datetime from Timestamp
dt = s['open date'].to_pydatetime()
# Compare with ending_date
if dt < ending_date:
# Add key=s['user'] to Dict if not exists
c.setdefault(s['user'], 0)
# Increment the Dict key=s['user']
c[s['user']] += 1
print(c)
Output:
{'John D.': 2, 'John C.': 2}
Tested with Python: 3.4.2

Related

Python: API request nested dictionaries to dataframe with datetime indexed values

I run a query on python to get hourly price data from an API, using the get function:
result = (requests.get(url_prices, headers=headers, params={'SpotKey':'1','Fields':'hours','FromDate':'2016-05-05','ToDate':'2016-12-05','Currency':'eur','SortType':'ascending'}).json())
where 'SpotKey' identifies the item I want to retrieve from the API, in this example '1' is hourly price timeseries (the other parameters are self explanatory).
The result from the query is:
{'SpotKey': '1',
'SpotName': 'APX',
'Denomination': 'eur/mwh',
'Elements': [{'Date': '2016-05-05T00:00:00.0000000',
'TimeSpans': [{'TimeSpan': '00:00-01:00', 'Value': 23.69},
{'TimeSpan': '01:00-02:00', 'Value': 21.86},
{'TimeSpan': '02:00-03:00', 'Value': 21.26},
{'TimeSpan': '03:00-04:00', 'Value': 20.26},
{'TimeSpan': '04:00-05:00', 'Value': 19.79},
{'TimeSpan': '05:00-06:00', 'Value': 19.79},
...
{'TimeSpan': '19:00-20:00', 'Value': 57.52},
{'TimeSpan': '20:00-21:00', 'Value': 49.4},
{'TimeSpan': '21:00-22:00', 'Value': 42.23},
{'TimeSpan': '22:00-23:00', 'Value': 34.99},
{'TimeSpan': '23:00-24:00', 'Value': 33.51}]}]}
where 'Elements' is the relevant list containing the timeseries, structured as nested dictionaries of 'Date' keys and 'TimeSpans' keys.
Each 'TimeSpans' keys contains other nested dictionaries for each hour of the day, with a 'TimeSpan' key for the hour and a 'Value' key for the price.
I would like to transform it to a dataframe like:
Datetime eur/mwh
2016-05-05 00:00:00 23.69
2016-05-05 01:00:00 21.86
2016-05-05 02:00:00 21.26
2016-05-05 03:00:00 20.26
2016-05-05 04:00:00 19.79
... ...
2016-12-05 19:00:00 57.52
2016-12-05 20:00:00 49.40
2016-12-05 21:00:00 42.23
2016-12-05 22:00:00 34.99
2016-12-05 23:00:00 33.51
For the time being I managed to do so doing:
df = pd.concat([pd.DataFrame(x) for x in result['Elements']])
df['Date'] = pd.to_datetime(df['Date'] + ' ' + [x['TimeSpan'][:5] for x in df['TimeSpans']], errors='coerce')
df[result['Denomination']] = [x['Value'] for x in df['TimeSpans']]
df = df.set_index(df['Date'], drop=True).drop(columns=['Date','TimeSpans'])
df = df[~df.index.isnull()]
I did so because the daylight-saving-time is replacing the 'TimeSpan' hourly values with 'dts' string, giving ParseDate errors when creating the datetime index.
Since I will request data very frequently and potentially for different granularities (e.g. half-hourly), is there a better / quicker / standard way to shape so many nested dictionaries into a dataframe with the format I look for, that allows to avoid the parsing date error for daylight-saving-time changes?
thank you in advance, cheers.
You did not give examples of the dts, so I cannot verify. But in principle, trating the Date as timestamp and TimeSpan as as timedeltas should give you both the ability to ignore granularity changes and potentialy include additional "dts" parsing.
def parse_time(x):
if "dst" not in x:
return x[:5]+":00"
return f"{int(x[:2])+1}{x[2:5]}:00" # TODO ACTUALLY PARSE, time overflow etc
df = pd.DataFrame(result['Elements']).set_index("Date")
d2 = df.TimeSpans.explode().apply(pd.Series)
d2['Datetime'] = pd.to_datetime(d2.index) + pd.to_timedelta(d2.TimeSpan.apply(parse_dt))
pd.DataFrame(d2.set_index(d2.Datetime).Value).rename(columns={"Value": "eur/mwh"})
gives
this should work:
df = pd.DataFrame()
cols = ['Datetime', 'eur/mwh']
# concat days together to one df
for day in results['Elements']:
# chunk represents a day worth of data to concat
chunk = []
date = pd.to_datetime(day['Date'])
for pair in day['TimeSpans']:
# hour offset is just the first 2 characters of TimeSpan
offset = pd.DateOffset(hours=int(pair['TimeSpan'][:1])
value = pair['Value']
chunk.append([(date + offset), value])
# concat day-chunk to df
df = pd.concat([df, pd.DataFrame(chunk, columns=cols)]
only thing i'm not 100% sure of is the pd.to_datetime() but if it does't work you just need to use a format argument with it.
hope it helps :)

JSON to Pandas Dataframe types change

I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?
"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.

How to create a dict of dicts from pandas dataframe?

I have a dataframe df
id price date zipcode
u734 8923944 2017-01-05 AERIU87
uh72 9084582 2017-07-28 BJDHEU3
u029 299433 2017-09-31 038ZJKE
I want to create a dictionary with the following structure
{'id': xxx, 'data': {'price': xxx, 'date': xxx, 'zipcode': xxx}}
What I have done so far
ids = df['id']
prices = df['price']
dates = df['date']
zips = df['zipcode']
d = {'id':idx, 'data':{'price':p, 'date':d, 'zipcode':z} for idx,p,d,z in zip(ids,prices,dates,zips)}
>>> SyntaxError: invalid syntax
but I get the error above.
What would be the correct way to do this, using either
list comprehension
OR
pandas .to_dict()
bonus points: what is the complexity of the algorithm, and is there a more efficient way to do this?
I'd suggest the list comprehension.
v = df.pop('id')
data = [
{'id' : i, 'data' : j}
for i, j in zip(v, df.to_dict(orient='records'))
]
Or a compact version,
data = [dict(id=i, data=j) for i, j in zip(df.pop('id'), df.to_dict(orient='r'))]
Note that, if you're popping id inside the expression, it has to be the first argument to zip.
print(data)
[{'data': {'date': '2017-09-31',
'price': 299433,
'zipcode': '038ZJKE'},
'id': 'u029'},
{'data': {'date': '2017-01-05',
'price': 8923944,
'zipcode': 'AERIU87'},
'id': 'u734'},
{'data': {'date': '2017-07-28',
'price': 9084582,
'zipcode': 'BJDHEU3'},
'id': 'uh72'}]

Separate pd DataFrame Rows that are dictionaries into columns

I am extracting some data from an API and having challenges transforming it into a proper dataframe.
The resulting DataFrame df is arranged as such:
Index Column
0 {'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
1 {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}
I am trying to split the emails into one column and the list into a separate column:
Index Column1 Column2
0 email#email.com [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
Ideally, each 'action'/'date' would have it's own separate row, however I believe I can do the further unpacking myself.
After looking around I tried/failed lots of solutions such as:
df.apply(pd.Series) # does nothing
pd.DataFrame(df['column'].values.tolist()) # makes each dictionary key as a separate colum
where most of the rows are NaN except one which has the pair value
Edit:
As many of the questions asked the initial format of the data in the API, it's a list of dictionaries:
[{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]},{'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
Thanks
One naive way of doing this is as below:
inp = [{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
, {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
index = 0
df = pd.DataFrame()
for each in inp: # iterate through the list of dicts
for k, v in each.items(): #take each key value pairs
for eachv in v: #the values being a list, iterate through each
print (str(eachv))
df.set_value(index,'Column1',k)
df.set_value(index,'Column2',str(eachv))
index += 1
I am sure there might be a better way of writing this. Hope this helps :)
Assuming you have already read it as dataframe, you can use following -
import ast
df['Column'] = df['Column'].apply(lambda x: ast.literal_eval(x))
df['email'] = df['Column'].apply(lambda x: x.keys()[0])
df['value'] = df['Column'].apply(lambda x: x.values()[0])

how to take the specific details out in Python that are separated by a semi colon or a slash?

I have the following results from a vet analyser
result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405
4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta
rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}
Using Python how do i separate the date the time the type PT and APTT.... please note that the results will be different everytime so i need to make a code that will find the date using the / and will get the time because of four digits and the : .... do i use a for loop?
This code makes further usage of fields easier by converting them to dict.
from pprint import pprint
result = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
if result.startswith("result{") and result.endswith("}"):
result = result[(result.index("{") + 1):result.index("}")]
# else:
# raise ValueError("Invalid data '" + result + "'")
# Separate fields
fields = result.split(";")
# Separate field names and values
# First part is the name of the field for sure, but any additional ":" must not be split, as example "date:dd/mm/yyyy HH:MM" -> "date": "dd/mm/yyyy HH:MM"
fields = [field.split(":", 1) for field in fields]
fields = {field[0]: field[1] for field in fields}
a = fields['type'].split("/")
print(fields)
pprint(fields)
print(a)
The result:
{'type': 'PT/APTT', 'error': '0', 'PT': '32.3 s', 'INR': '0.0', 'APTT': '119.2', 'code': '470433200', 'lot': '405 4H0401', 'date': '20/01/2017 06:47', 'PID': 'TREKKER20', 'index': '015', 'C1': '-0.1', 'C2': '-0.1', 'qclock': '0', 'ta rget': '2', 'name': '', 'Sex': '', 'BirthDate': '', 'operatorID': '', 'SN': '024000G0900046', 'version': 'V2.8.0.09'}
{'APTT': '119.2',
'BirthDate': '',
'C1': '-0.1',
'C2': '-0.1',
'INR': '0.0',
'PID': 'TREKKER20',
'PT': '32.3 s',
'SN': '024000G0900046',
'Sex': '',
'code': '470433200',
'date': '20/01/2017 06:47',
'error': '0',
'index': '015',
'lot': '405 4H0401',
'name': '',
'operatorID': '',
'qclock': '0',
'ta rget': '2',
'type': 'PT/APTT',
'version': 'V2.8.0.09'}
['PT', 'APTT']
Note that dictionaries are not sorted (they don't need to be in most cases as you access the fields by the keys).
If you want to split the results by semicolon:
result_array = result.split(';')
In results_array you'll get all strings separated by semicolon, then you can access the date there: result_array[index]
That's quite a bad format to store data as fields might have colons in their values, but if you have to - you can strip away the surrounding result, split the rest on a semicolon, then do a single split on a colon to get dict key-value pairs and then just build a dict from that, e.g.:
data = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 " \
"4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta " \
"rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
parsed = dict(e.split(":", 1) for e in data[7:-1].split(";"))
print(parsed["APTT"]) # 119.2
print(parsed["PT"]) # 32.3 s
print(parsed["date"]) # 20/01/2017 06:47
If you need to further separate the date field to date and time, you can just do date, time = parsed["date"].split(), although if you're going to manipulate the object I'd suggest you to use the datetime module and parse it e.g.:
import datetime
date = datetime.datetime.strptime(parsed["date"], "%d/%m/%Y %H:%M")
print(date) # 2017-01-20 06:47:00
print(date.year) # 2017
print(date.hour) # 6
# etc.
To go straight to the point and get your type, PT, APTT, date and time, use re:
import re
from source import result_gen
result = result_gen()
def from_result(*vars):
regex = re.compile('|'.join([f'{re.encode(var)}:.*?;' for var in vars]))
matches =dict(f.group().split(':', 1) for f in re.finditer(regex, result))
return tuple(matches[v][:-1] for v in vars)
type, PT, APTT, datetime = from_result('type', 'PT', 'APTT', 'date')
date, time = datetime.split()
Notice that this can be easily extended in the event you became suddenly interested in some other 'var' in the string...
In short you can optimize this further (to avoid the split step) by capturing groups in the regex search...

Categories

Resources