How to combine the data from two different dataframes with pyspark? - python

I have two different (and very large) dataframes (details below). And I need to merge the data from both of them. Since these dataframes are huge (with millions of rows in the first one and thousands in the second one), I was trying to use the AWS EMR service. But I don't quite understand how it is done there and the tutorials I've seen mostly show the instructions for one dataframe only. So, I've been wondering how I can use pyspark for two different dataframes.
Here are the details:
The first dataframe, say df, contains data about the people watching tv on different days. It looks like this:
id date other_data
0 0 2020-01-01 some data
1 1 2020-02-01 some data
2 2 2020-03-01 some data
3 3 2020-04-01 some data
4 4 2020-05-01 some data
Here, id is the id of the watcher, date is the date of watching, other_data contains other information (like the duration of watching, channel, etc.)
The second dataframe, say program, contains data about the programs. It looks like this:
date program start_time end_time
0 2020-01-01 program 1 14:00:00 15:00:00
1 2020-01-01 program 2 15:00:00 16:00:00
2 2020-01-01 program 3 16:00:00 17:00:00
3 2020-01-01 program 4 17:00:00 18:00:00
4 2020-01-01 program 5 18:00:00 19:00:00
Here, date is the date, program is the name of the program, start_time and end_time are the time of the program's beginning and end.
Basically, what I need to do is to create one dataframe that would contain all the information from both of these dataframes. I need this final dataframe to have a separate row for each user and each program. In other words, I need a dataframe that would duplicate each row in the first dataframe for each of the programs on the same day.
It might seem a little bit confusing, but here is an example of the final dataframe I want to receive:
id date other_data program start_time end_time
0 0 2020-01-01 some data program 1 14:00:00 15:00:00
1 0 2020-01-01 some data program 2 15:00:00 16:00:00
2 0 2020-01-01 some data program 3 16:00:00 17:00:00
3 0 2020-01-01 some data program 4 17:00:00 18:00:00
4 0 2020-01-01 some data program 5 18:00:00 19:00:00
As you can see, this final dataframe contains the data for each user and each program that was shown on the same day this user watched tv. In this particular case, the user with id=0 has watched tv on 01/01/2020. On the same day, program 1, program 2, program 3, program 4, and program 5 were shown. So, I need to have one row for each of the programs with their details. And, of course, I need the data from the first dataframe (contained in other_data).
So far, I created the following approach: I iterate over the first dataframe, for each row I find all the rows in the second dataframe that have the same date, merge it and add to the third (final) dataframe.
Here is the code I use:
ids = [] # users' id
dates = [] # dates
other_data = [] # other data from the first dataframe
programs = [] # all programs
start_times = [] # starting times
end_times = [] # ending times
for i, row in df.iterrows():
temp = program.loc[program['date'] == row['date']] # find all programs on the same date
for j, program_row in temp.iterrows(): # iterate over the programs on the same date
# append all the info
ids.append(row['id'])
dates.append(row['date'])
other_data.append(row['other_data'])
programs.append(program_row['program'])
start_times.append(program_row['start_time'])
end_times.append(program_row['end_time'])
# create final dataframe
final = pd.DataFrame({'id': ids, 'date': dates, 'other_data': other_data, 'program': programs,
'start_time': start_times, 'end_time': end_times})
This approach is working, but it is extremely slow (considering the large size of dataframes). I was, therefore, wondering how to split this job between several workers using ERM by AWS. If I understand it correctly, I need to split the first dataframe df between workers, and, at the same time, provide them with the full program dataframe. Is it possible to do that? And how?
Would appreciate any help or advice!

It seems that both df and program are Pandas dataframes and merging/joining is the action needed, see pandas.DataFrame.merge. Try this:
import pandas as pd
finial = pd.merge(df, program, on=['date'], how='inner')
In case the Pandas version is too slow, you could convert the dataframes to PySPark dataframes and perform the following steps:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("convert").getOrCreate()
df_spark = spark.createDataFrame(df)
program_spark = spark.createDataFrame(program)
final_spark = df_spark.join(F.broadcast(program), on=['date'], how='inner')
Here, it is assumed that the dataframe program is a small dataframe - if not, please remove the broadcast.
Hopefully, it is solving your issue and removing the slow loops here.

Related

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Extract minutes per hour based on occupancy in Excel

Is there an easy way to extract the minutes per hour a room was used based on occupancy level? I would like to get an overview of how many minutes room 1 was used from 08:00:00- 08:59:59, 09:00:00-09:59:59..etc
I have done this manually by creating time intervals for every hour starting at fex 08:00:00 and ending at 08:59:59. Then I have used a sumif formula to get number of minutes the room was occupied per hour for one day (9 hours in total per day).
As I want to see how many minutes per hour different rooms are occupied and compare them, I wonder if there is an easier way to do this? It would be great to have a format that I could use for all rooms. However, since all rooms will have different timestamps, this might be difficult?
If anyone knows how to do this in SQL or Python, that would be very helpful as well, especially in SQL!
The link below will give you an example of the data.
In python, the most analogous data structure to a spreadsheet or a SQL table is the DataFrame from the pandas library.
First we can read in data from a spreadsheet like so:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df["Time"] = df.Timestamp.dt.time
Here I am going to assume you have removed your work-in-progress (table on the right in the image) and that the data is in the first worksheet of the Excel file (otherwise we'll have to pass additional options).
I've ensured that the first (Timestamp) column is correctly understood as containing date-time data. By default it will assume 09.01.2020 ... refers to the 1st of September, American-style - I'm guessing that's what you want; additional options can be passed if you were really referring to the 9th of January (which is how I'd read that date).
I then overwrote the Time column with a time object extracted from the Timestamp, this isn't really necessary but gets the data as close to what was in the spreadsheet as possible. The DataFrame now looks like this:
Timestamp Room name Occupancy % Time
0 2020-09-01 08:04:01 Room 1 0 08:04:01
1 2020-09-01 09:04:01 Room 1 100 09:04:01
2 2020-09-01 09:19:57 Room 1 0 09:19:57
3 2020-09-01 09:48:57 Room 1 0 09:48:57
4 2020-09-01 09:53:01 Room 1 100 09:53:01
5 2020-09-01 10:05:01 Room 1 100 10:05:01
6 2020-09-01 10:08:57 Room 1 100 10:08:57
7 2020-09-01 10:13:01 Room 1 100 10:13:01
(Note for next time, it would have been good to include something like this text in your question, it makes it much easier to construct an answer if the data doesn't have to be painstakingly put together)
Now, there are a lot of things we can do with a DataFrame like this, but I'm going to try and get to where you want to go as directly as possible.
We'll start by using the Timestamp column as the 'index' and prepending a row for the time 08:00:00 because it's not currently part of your dataset, but you indicated you want it.
df2 = df.set_index("Timestamp")
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = ("Room1", 0.0, None)
df2.sort_index(inplace=True)
The result looks like this:
Room name Occupancy % Time
Timestamp
2020-09-01 08:00:00 Room 1 0.0 None
2020-09-01 08:04:01 Room 1 0.0 08:04:01
2020-09-01 09:04:01 Room 1 100.0 09:04:01
2020-09-01 09:19:57 Room 1 0.0 09:19:57
2020-09-01 09:48:57 Room 1 0.0 09:48:57
2020-09-01 09:53:01 Room 1 100.0 09:53:01
2020-09-01 10:05:01 Room 1 100.0 10:05:01
2020-09-01 10:08:57 Room 1 100.0 10:08:57
2020-09-01 10:13:01 Room 1 100.0 10:13:01
Now, the simplest way to do this is to start by upsampling and forward-filling the data.
upsampled = df2.resample("1min").ffill()
upsampled is a huge DataFrame with a value for every second in the range. The forward-filling ensures your occupancy % is carried forward every second until one of your original datapoints said 'it changed here'. After the change, the new value is carried forward to the next datapoint etc.
This is done to ensure we get the necessary time resolution. Normally I would now downsample. You were interested in each hour:
downsampled = upsampled.resample("1h").mean()
By taking the mean, we'll get only the numeric columns in our output, i.e. 'occupancy', and here you'll get the following:
Occupancy %
Timestamp
2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
But you indicated you might want to do this 'per room', so there might be other data with e.g. 'Room 2'. In that case, we have a categorical column, Room name, that we need to group by.
This is a bit harder, because it means we have to group before we upsample, to avoid ambiguity. This is going to create a MultiIndex. We have to collapse the 'group' level of the index, then group and downsample!
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.get_level_values(1)
result = grouped.groupby("Room name").resample("1h").mean()
which will look something like this:
Occupancy %
Room name Timestamp
Room 1 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
Room 2 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
(I just duplicated the data for Room 1 as Room 2, so the numbers are the same)
For a neat finish, we might unstack this multi-index, pivoting the room names into columns. Then convert those percentages into the nearest number of minutes.
Thus the whole solution is:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df2 = df.set_index("Timestamp")
# prepend some dummy rows for every different room name
for room_name in df2["Room name"].unique():
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = (room_name, 0.0, None)
df2.sort_index(inplace=True)
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.droplevel(0)
result = (
grouped
.groupby("Room name")
.resample("1h")
.mean()
.unstack(level=0)
.div(100) # % -> fraction
.mul(60) # fraction -> minutes
.astype(int) # nearest number of whole minutes
)
# no longer 'Occupancy %', so drop the label
result.columns = result.columns.droplevel(0)
yielding a result like
Room name Room 1 Room 2
Timestamp
2020-09-01 08:00:00 0 0
2020-09-01 09:00:00 22 22
2020-09-01 10:00:00 60 60
which hopefully is close to what you were after.
As a starting point:
SELECT
room_name, sum(start-stop)
FROM
room_table
WHERE
timestamp BETWEEN 'some_time' AND 'another_time'
GROUP BY
room_name
Where in the above the SQL table is room_table. Also assumes start and stop fields are time types. The 'some_time/another_time` are just placeholders for the time range you are interested in.

Create Time Buckets Pandas Python and Count for missing time-range

How do you group data by time buckets and count no of observation in the given bucket. If none, fill the empty time buckets with 0s.
I have the following data set in a dataframe:
'''
df=
Time
0:10
5:00
5:00
5:02
5:03
5:05
5:07
5:09
6:00
6:00
6:00
'''
I would like to create 5 min time bucket going from 00:00 to 23:59, and count how many times it appears in that time bucket. If none, then 0. Basically, each time represents a unit in a queue and and I want to calculate how many in the given time bucket.
From the above data (example set), i would like to get the following:
Time Obs
00:00 0
00:05 0
00:10 1
00:15 0
...
05:00 2
05:05 3
05:10 2
06:00 3
...
I tried the following code
df['time_bucket'] = pd.to_datetime(df['Time']).dt.ceil('5min')
which did not work.
I tried the following as well:
df1= df.resample('5T', on ='time_bucket').count()
which results in :
Time time_bucket
time_bucket
2020-05-24 00:10:00 1 1
2020-05-24 00:15:00 0 0
2020-05-24 00:20:00 0 0
2020-05-24 00:25:00 0 0
2020-05-24 00:30:00 0 0
The time starts at 00:10 but not at 00:00; seems like it starts from the initial value of the time_bucket column.
Basically in the new column, I want to calculate the count. Eventually, I would like to create a function which takes a parameter, ex: time buckets (5, 10, 15) and create table for given time bucket with counts.
I could not find a standard way to address your specific issues, specifically about time (without date) buckets in native pandas functions.
First, instead of your dataset, which seems to be in string format, I used Time().
import random
import datetime
import pandas as pd
from collections import Counter
from datetime import time, timedelta
# generate 10k lines of random data
x = [None]*10000
x = [time(hour=random.randrange(0,24), minute=random.randrange(0,60)) for item in x]
# use Counter to aggregate minute-wise data
y = Counter(x)
z = [{'time':item, 'freq':y.get(item)} for item in y]
df = pd.DataFrame(z)
# create bins
df['tbin']=df['time'].apply(lambda x: x.hour*12 + int(x.minute/5))
df['binStart']=df['time'].apply(lambda x: time(hour=x.hour, minute=(x.minute - x.minute%5)))
df['binEnd']=df['binStart'].apply(lambda a: (datetime.datetime.combine(datetime.datetime.now(), a)+timedelta(minutes=5)).time())
# grouping also orders the data
df_histogram=df.groupby(['tbin', 'binStart', 'binEnd'])['freq'].sum()
this is probably too late, but I was working on solving a similar but simpler problem, and came across your unanswered question which I thought would be more fun to solve than my own (which got solved in the process.)

pandas - efficiently computing minutely returns as columns on intraday data

I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.

Python pandas: filtering rows based on time criteria using pandas

I have a CSV file with millions of rows in the following format:
Amount,Price,Time
0.36,13924.98,2010-01-01 00:00:08
0.01,13900.09,2010-01-01 00:02:04
0.02,13907.59,2010-01-01 00:04:54
0.07,13907.59,2010-01-01 00:05:03
0.03,13925,2010-01-01 00:05:41
0.03,13920,2010-01-01 00:07:02
0.15,13910,2010-01-01 00:09:37
0.03,13909.99,2010-01-01 00:09:58
0.03,13909.99,2010-01-01 00:10:03
0.14,13909.99,2010-01-01 00:10:03
I want to first filer this data and then perform some calculation on the filtered data. I import it using pandas data = pd.read_csv(), to get a DataFrame.
I then transform the Time column to TimeDelta column (which I am not sure is necessary for what I want to do) where I write the time difference to the time 2010-01-01 00:00:00 by using
data['TimeDelta'] = pd.to_timedelta(pd.to_datetime(data.Date)-pd.Timedelta(days=14610))/np.timedelta64(1, 'm')
Here comes the part that I struggle with. I want a function that returns a new DataFrame, where I want only the first row after every n minutes, where n is an integer defined by the user.
For example. If n=5, the desired output of this function for my data would be:
Amount,Price,Time
0.36,13924.98,2010-01-01 00:00:08
0.07,13907.59,2010-01-01 00:05:03
0.03,13909.99,2010-01-01 00:10:03
And the output for n=3 would be:
Amount,Price,Time
0.36,13924.98,2010-01-01 00:00:08
0.02,13907.59,2010-01-01 00:04:54
0.15,13910,2010-01-01 00:09:37
I have tried doing this using the floor and the remainder %, but being a beginner with Python I am unable to get it working.
Use pd.Grouper:
n=5
df.groupby(pd.Grouper(key = 'Time', freq=f'{n} min')).first()
Amount Price
Time
2010-01-01 00:00:00 0.36 13924.98
2010-01-01 00:05:00 0.07 13907.59
2010-01-01 00:10:00 0.03 13909.99

Categories

Resources