I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.
Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.
When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.
Related
I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you
There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.
New to python. I have a dataframe with a date time column (essentially a huge time series data). I basically want to divide this into multiple subsets where each subset data frame contains one week worth of data (starting from the first timestamp). I have been trying this with groupBy and Grouper but it returns tuples which themselves don't contain a week's worth of data. In addition, the Grouper (Erstwhile TimeGrouper) documentation isn't very clear on this.
This is what I tried. Any better ideas or approaches?
grouped = uema_label_format.groupby(pd.Grouper(key='HEADER_START_TIME', freq='W'))
If your dataset is really big, it could be worth externalising this work to a time-series database and then query it to get each week you are interested in. These results can then be loaded into pandas, but the database handles the heavy lifting. For example in QuestDB you could get the current week as follows
select * from yourTable where timestamp = '2020-06-22;7d'
Although this would return the data for a single week, you could iterate on this to get the individual objects quickly since the results are instantaneous. Also, you can easily change the sample interval after the fact, for example to monthly using 1M. This would still be an instant response.
You can try this here using this query as an example to get one week worth of data (roughly 3M rows) out of a 1.6 billion rows NYC taxi dataset.
select * from trips where pickup_datetime = '2015-08-01;7d';
If this would solve your use case, there is a tutorial on how to get query results from QuestDB to pandas here.
I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.
redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure
So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.
Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.
I have a needs to do calculation like average of selected data grouped by time rage collections.
Example:
Table which is storing data has several main columns which are:
| time_stamp | external_id | value |
Now i want to calculate average for 20 (or more) groups of date ranges:
1) 2000-01-01 00-00-00 -> 2000-01-04 00-00-00
2) 2000-01-04 00-00-00 -> 2000-01-15 00-00-00
...
The important thing is that there are no gaps and intersections between groups so it means that first date and last date are covering full time range.
The other important thing is that in set of "date_from" to "date_to" there can be rows for outside of the collection (unneeded external_id's).
I have tried 2 approaches:
1) Execute query for each "time range" step with average function in SQL query (but i don't like that - it's consuming too much time for all queries, plus executing multiple queries sounds like not good approach)
2) I have selected all required rows (at one SQL request) and then i made loop over the results. The problem is that i have to check on each step to which "data group" current datetime belongs. This seams like a better approach (from SQL perspective) but right now i have not too good performance because of loop in the loop. I need to figure out how to avoid executing loop (checking to which group current timestamp belongs) in the main loop.
Any suggestions would be much helpful.
Actually both approaches are nice, and both could benefit on the index on the time_stamp column in your database, if you have it. I will try to provide advice on them:
Multiple queries are not such a bad idea, your data looks to be pretty static, and you can run 20 select avg(value) from data where time_stamp between date_from and date_to-like queries in 20 different connections to speed up the total operation. You'll eliminate need of transferring a lot of data to your client from DB as well. The downside would be that you need to include an additional where condition to exclude rows with unneeded external_id values. This complicates the query and can slow the processing down a little if there are a lot of these values.
Here you could sort the data on server by time_stamp index before sending and then just checking if your current item is from a new data range (because of sorting you will be sure later items will be from later dates). This would reduce the inner loop to an if statement. I am unsure this is the bottleneck here, though. Maybe you'd like to look into streaming the results instead of waiting them all to be fetched.
I'm running daily simulations in a batch: I do 365 simluations to get results for a full year. After every run, I want to extract some arrays from the results and add them to a pandas.DataFrame for analysis later.
I have a rough model (doing an optimisation) and a more precise model for a post-simulation, so I can get the same variable from two sources. In case the post-simulation is done, the results may overwrite the optimization results.
To make it more complicated, the optimization model has a smaller output interval, depending on the discretisation settings, but the final analysis will happen on the larger interval of the post-simulation).
What is the best way to construct this DataFrame?
This was my first appraoch:
creation of an empty DataFrame df for the full year, with DateRange index with the larger post- simulation interval (=15 minutes)
do optimization for 1 day ==> create temporary df_temp with DateRange as index with smaller interval
downsample this DataFrame to 15 minutes as described here:
update df with df_temp (rows in df are still empty, except for the last row of the previous run, so I have to take df_temp[1:])
do simulation for same day ==> create temporary df_temp2 with interval = 15min
overwrite the corresponding rows in df with df_temp2
Which methods should I use in step 4) and 6)? Or is there a better way from the start?
Thanks,
Roel
I think that using DataFrame.combine_first could be the way to go, but depending on the scale of the data, it might be more useful to have a method like "update" that just modified particular rows in an existing DataFrame. combine_first is more general and can cause the result to be of a different size than either of the inputs (because the indexes will get unioned together).
https://github.com/pydata/pandas/issues/961