Validating user access logs with pandas DataFrames - python

I am not very proficient in pandas, but have been using it for various project for the last year or so - I like it so far, but I have not really gotten a firm grip on it, and so I would love some help. I have tried googling for days, but I'm approaching a point where I just want to use pandas as an iterator, which seems like a waste. I feel I might just be missing some basic terminology and just don't know what to search for, but I am getting fed up with reading and searching.
What I am working on right now requires me to check some logs for valid access, by comparing ID's and dates of access with something like an user registry. I'm using python and pandas because they are the tools I feel most comfortable with, but I am open to suggestions on other approaches. The registry is parsed from a few excelsheets managed by someone else and the logs are nicely ordered csv's.
Loading these into two DataFrames, I want to check each log entry for validity. One dataframe acts as a registry, providing user ID's, a creation date and an end date and the other contains the logs as rows, with userid and a timestamp:
Registry
created end ID
1 2018-09-04 NaT 66f56efc5cc6353ba
2 2018-10-09 2018-11-09 167a2c65133d9f4a3
3 2018-10-09 2018-11-09 f0efc501e52e7b1e1
Logs
Timestamp ID
0 2019-08-01 00:01:48.027 4459eeab695a2
1 2019-08-01 00:06:03.981 e500df5f2c2ed
2 2019-08-01 00:06:36.100 e500df5f2c2ed
I want to check each log entry against my registry to see, if access was permitted when it occured. I have written a function to check ID and date against my registry, but I need to figure out how to apply my check to the whole log-dataframe:
def validate(userid, date): #eg. 'wayf-1234', datetime.date(2019,11,23)
df_target=df_registry[df_acl['created'].notnull() & ~(df_registry['end'] < date)]
return (df_target.values==userid).any()
My first inclination was to use the function directly like a row selector (not sure what to call it), but it doesn't work:
df_logs[validate(df_logs['id'], df_logs['Timestamp']) == True]
I am pretty sure it would be incredibly inefficient to initiate a dataframe for every row to check for a specific date, but I'm just hacking and trying to make something work and inefficient is fine for now. But I would really love know if someone has any input or perspectives on how to work this.
Should I just iterate through the rows of the dataframe and apply my logic for each line (which seems to work counter to how pandas is supposed to be used) or is there a smarter way to go about it?
Thanks.

merge_asof is the tool here. It allow to exactly merge on a list of columns (with by) and then to find the highest value in one column immediately under the value in the other dataframe.
Here you could use:
tmp = pd.merge_asof(logs, registry, left_on = ['Timestamp'],
right_on=['created'], by='ID')
For each line from logs, you get the line from registry with same ID and a created date immediately below the Timestamp. Then an access is valid if end is NaT or greater that timestamp (after adding one day...):
tmp['valid'] = tmp.end.isna()|(tmp.end + pd.Timedelta('1D') > tmp.Timestamp)
Beware: this only works if date columns are true pd.Timestamp (dtype datetime64)...

Related

Resample().mean() in Python/Pandas and adding the results to my dataframe when the starting point is missing

I'm pretty new to coding and have a problem resampling my dataframe with Pandas. I need to resample my data ("value") to means for every 10 minutes (13:30, 13:40, etc.). The problem is: The data start around 13:36 and I can't access them by hand because I need to do this for 143 dataframes. Resampling adds the mean at the respective index (e.g. 13:40 for the second value), but because 13:30 is not part of my indices, that value gets lost.
I'm trying two different approaches here: First, I tried every option of resample() (offset, origin, convention, ...). Then I tried adding the missing values manually with a loop, which doesn't run properly because I didn't know how to access the correct spot on the list. The list does include all relevant values though. I also tried adding a row with 13:30 as the index on top of the dataframe but didn't manage to convince Python that my index is legit because it's a timestamp (this is not in the code).
Sorry for the very rough code, it just didn't work in several places which is why I'm asking here.
If you have a possible solution, please keep in mind that it has to function within an already long loop because of the many dataframes I have to work on simultaneously.
Thank you very much!
df["tenminavg"] = df["value"].resample("10Min").mean()
df["tenminavg"] = df["tenminavg"].ffill()
ls1 = df["value"].resample("10Min").mean() #my alternative: list the resampled values in order to eventually access the first relevant timespan
for i in df.index: #this loop doesn't work. It should add the value for the first 10 min
if df["tenminavg"][i]=="nan":
if datetime.time(13,30) <= df.index.time < datetime.time(13,40):
df["tenminavg"][i] = ls1.index.loc[i.floor("10Min")]["value"] #tried to access the corresponding data point in the list
else:
continue

Python / Bloomberg api - historical security data incl. weekends (sat. and sun.)

I'm currently working with an blpapi and trying to get a bdh of an index including weekends. (I'll later need to match this df with another date vector.)
I'm allready using
con.bdh([Index],['PX_LAST'],'19910102', today.strftime('%Y%m%d'), [("periodicitySelection", "DAILY")])
but this will return only weekdays (mon - fr). I know how this works in excel with the bbg function-builder but not sure about the wording within the blpapi.
Since I'll need always the first of each month,
con.bdh([Index],['PX_LAST'],'19910102', today.strftime('%Y%m%d'), [("periodicitySelection", "MONTHLY")])
wont work as well because it will return 28,30,31 and so.
Can anyone help here? THX!
You can use a combination of:
"nonTradingDayFillOption", "ALL_CALENDAR_DAYS" # include all days
"nonTradingDayFillMethod", "PREVIOUS_VALUE" # fill non-trading days with previous value

Is there a way to get datetime ranges based on the value of another column in a pandas dataframe?

I'm sure the question was not descriptive enough, but here is what I'm looking for. I have 3 columns in my dataframe. Two are datetimes and the other is a float that contains either a 1 or 0. It is a status column and 1 means on 0 means off obviously. I need to find out whether I am able to get the different ranges of times when the status was 0 and 1.
Can I do this with pandas, or do I need to try something else?
Sample data from dataframe
Dataframe name is uptime. Columns in order from left to right are time_utc, state, local_time. I'm really not concerned with time_utc, so you can disregard that.
This is my first question on here as I wasn't really even sure how to google this question. Please let me know if more information is required, and I will provide what I can. Thank you in advance for any response lol.
Edit:
In the table shown in the picture, you can see it was down from 04:54:27 to 5:01:21, which is when it came back up, and was back down by 05:02:16. It then went back down until 05:09:24, where it was back up until 05:11:50. I am just trying to write something that can pull those ranges, and maybe store them in another dataframe.
Edit:
I am doing a terrible job of asking this question, I know, but hopefully this picture of example output will help.
If its by time here you go, and i dont know the column names so i am just gonna throw in a random one
selecte_df = df.loc[(df['time_utc'].dt.hour >= 6) & (df['time_local'].dt.hour <= 8) & (df['state'] == 1)] # Get hour range and state
selected_df.to_csv('new.csv', index=False) # Write to new csv
I was able to solve this with an incredibly complicated IF statement.

Order_By custom date in peewee for SQLite

I made a huge misstake building up a database, but it works perfectly except for 1 feature. Changing the program in all the places where it needs to be changed for that feature to work would be a titanic job of weeks, so let's hope this workaround is possible.
The issue: I've stored data in a SQLite database as "dd/mm/yyyy" TextField format instead of DateField.
The need: I need to sort by dates on a union query, to get the last number of records in that union following my custom date format. They are from different tables, so I can't just use rowid or stuff like that to get the last ones, I need to do it by date and I can't change the already stored data in the database because there are already invoices created with that format ("dd/mm/yyyy" is the default date format in my country).
This is the query that captures data:
records = []
limited_to = 25
union = (facturas | contado | albaranes | presupuestos)
for record in (union
.select_from(union.c.idunica, union.c.fecha, union.c.codigo,
union.c.tipo, union.c.clienterazonsocial,
union.c.totalimporte, union.c.pagada,
union.c.contabilizar, union.c.concepto1,
union.c.cantidad1, union.c.precio1,
union.c.observaciones)
.order_by(union.c.fecha.desc()) # TODO this is what I need to change.
.limit(limited_to)
.tuples()):
records.append(record)
Now to complicate things even more, the union is already created by a really complex where clause for each database before it's transformed into an union query.
So my only hope is: Is there a way to make order_by follow a custom date format instead?
To clarify, this is the simple transformation that I'd need the order_by clause to follow, because I assume SQLite wouldn't have issues sorting if this would be the date format:
def reverse_date(date: str) -> str:
"""Reverse the date order from dd/mm/yyyy dates into yyyy-mm-dd"""
yyyy, mm, dd = date.split("/")
return f"{yyyy}-{mm}-{dd}"
Note: I've left lot of code out because I think it's unnecesary. This is the minimum amount of code needed to understand the problem. Let me know if you need more data.
Update: Trying this workaround, it seems to work fine. Need more testing but it's promising. If someone ever faces the same issue, here you go:
.order_by(sqlfunc.Substr(union.c.fecha, 7)
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 4, 2))
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 1, 2))
.desc())
Happy end of 2020 year!
As you pointed out, if you want the dates to sort properly, they need to be in yyyy-mm-dd format, which is the text format you should always use in SQLite (or something with the same year, month, day, order).
You might be able to do a rearrangement here using re.sub:
.order_by(re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\2-\1',
union.c.fecha))
Here we are using regex to capture the year, month, and day components in separate capture groups. Then, we replace with these components in the correct order for sorting.

Why does df.loc[] return the same row twice, when there is only one?

I am currently working with a few Stock Price csv files but there is some strange behaviour of the .loc[] command after using pd.read_csv to get the data into a df.
I am doing this with 10 files but I wanted to do a lot more before i ran into this problem, which literally ONLY happens with ONE of the files...
I basically want to subset each df to only show the data between 9:30 and 16:00 and it is a simple operation that has always worked without an issue:
open = dt.time(hour= 9, minute= 30)
close = dt.time(hour= 15, minute= 59)
but when i call:
df.loc[open]
I get:
Open High Low Close Volume
Date
2017-12-29 09:30:00 119.46 119.6 119.42 119.57 480
2017-12-29 09:30:00 119.46 119.6 119.42 119.57 480
BUT there are no duplicates in the csv, and when I make it print parts of the Dataframe or pause the debugger while running it to show me the df in memory, there are also no duplicates.
this happens with any time I choose to pass and with any column names i add to the loc[] command.
BUT only with ONE of the dataframes.
This is also messing with other parts of my script, for instance when I want to retrieve a value from a row and use it in a calculation, it throws an error because this weirdness is returning a series when it should simply return one value
I have used .loc and Datetime.Indexes many times before but never encountered this..
I tried resetting the index, using different times, making copies of the dataframes, nothing seems to work and it keeps pretending that every row exists twice (in this one particular dataframe) which is not the case...
Thank you to anyone, who tries to help.

Categories

Resources