I am new to pandas and would like to know how to best use time bounded sliding windows and rolling statistics calculation. I process a continuous stream with different rolling data calculations (weighted average, mean, sum, max, oldest...) within different time windows (1hr, 4hrs, 1day, 1week...) also grouped by different item ID.
An output stream is produced for each item with its own rolling statistics but also with statistics from similar items (data are linked on closest variable time spans and previously identified similar items).
I currently created a custom code without using pandas where the huge speed improvement is due to: calculation of rolling statistics using differential calculation only (ie. compute difference on new data and data discarded from sliding window), linking of the variable timespan of similar items as it happens in the stream. I would like to switch to pandas but would like to be sure of expected performance.
Is there a way to achieve similar (or better) performance with pandas ? Then:
does pandas calculate each rolling statistics on all the sliding window's values or does it do differential calculation on new/old values ? Then, when creating "custom functions" for rolling statistics could I also do differential calculation to avoid the huge cost of re-processing all the values ?
what is the most efficient way to declare multiple rolling statistics on several time windows ? If I also want to group this by each item I assume I should just add something like "my_stream.groupby(item_key)", would it be still efficient ?
output: for each item, I output its own rolling statistics and statistics from similar items but timespan are variable (from 10mn to 40mn). How could I link each item row from the other item only with the closest "older" timestamp (I mean: if time is 02:00 for Item 1, and Item 2 has data at 02:01 and 01:50, I should link with data from 01:50) ? will it highly impact performance ?
I tried to create a quick illustration but not very easy :
Input:
Item | Price | Date
------- | ----- | --------------
1 | 10 | 2014 01:01:01
2 | 20 | 2014 01:01:02
1 | 20 | 2014 01:21:00
1 | 20 | 2014 01:31:01
Output:
Item | Date | Price | Mean1hr | Mean4hr | Mean24hr | Sum1hr | Sum4hr | Sum24hr | SimilarMean1hr | SimilarMean4hr | Similar24hr |
-------|------|--------|-------|-------------|-----------|-------|--------|-------|----------|----------|--------|
1 | 2014 01:21:00 | 15 | 8 | 3 | 30 | 30 | 35 | 16 | 14 | 10 |
Thanks a lot,
Xavier
Related
I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)
I have a simple data frame which might look like this:
| Label | Average BR_1 | Average BR_2 | Average BR_3 | Average BR_4 |
| ------- | ------------ | ------------ | ------------ | ------------ |
| Label 1 | 50 | 30 | 50 | 50 |
| Label 2 | 60 | 20 | 50 | 50 |
| Label 3 | 65 | 50 | 50 | 50 |
What I would like to be able to do is to add a % symbol in every column.
I know that I can do something like this for every column:
df['Average BR_1'] = df['Average BR_1'].astype(str) + '%'
However, the problem is, that I read in the data from a CSV file which might contain more of these columns, so instead of Average BR_1 to Average BR_4, it might contain Average BR_1 to say Average BR_10.
So I would like this change to happen automatically for every column which contains Average BR_ in its column name.
I have been reading about .loc but I managed only to change column values to an entirely new value like so:
df.loc[:, ['Average BR_1', 'Average BR_2']] = "Hello"
Also, I haven't yet been able to implement regex here.
I tried with a list:
colsArr = [c for c in df.columns if 'Average BR_' in c]
print(colsArr)
But I did not manage to implement this with .loc.
I suppose I could do this using a loop, but I feel like there must be some better pandas solution, but I can not figure it out.
Could you help and point me in the right direction?
Thank you
# extract the column names that need to be updated
cols = df.columns[df.columns.str.startswith('Average BR')]
# update the columns
df[cols] = df[cols].astype(str).add('%')
print(df)
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
working example
You can use df.update and df.filter
df.update(df.filter(like='Average BR_').astype('str').add('%'))
df
Out:
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
I have done some research on this, but couldn't find a concise method when the index is of type 'string'.
Given the following Pandas dataframe:
Platform | Action | RPG | Fighting
----------------------------------------
PC | 4 | 6 | 9
Playstat | 6 | 7 | 5
Xbox | 9 | 4 | 6
Wii | 8 | 8 | 7
I was trying to get the index (Platform) of the smallest value in the 'RPG' column, which would return 'Xbox'. I managed to make it work but it's not efficient, and looking for a better/quicker/condensed approach. Here is what I got:
# Return the minimum value of a series of all columns values for RPG
series1 = min(ign_data.loc['RPG'])
# Find the lowest value in the series
minim = min(ign_data.loc['RPG'])
# Get the index of that value using boolean indexing
result = series1[series1 == minim].index
# Format that index to a list, and return the first (and only) element
str_result = result.format()[0]
Use Series.idxmin:
df.set_index('Platform')['RPG'].idxmin()
#'Xbox'
or what #Quang Hoang suggests on the comments
df.loc[df['RPG'].idxmin(), 'Platform']
if Platform already the index:
df['RPG'].idxmin()
EDIT
df.set_index('Platform').loc['Playstat'].idxmin()
#'Fighting'
df.set_index('Platform').idxmin(axis=1)['Playstat']
#'Fighting'
if already the index:
df.loc['Playstat'].idxmin()
I have a Spark DataFrame with data like below:
ID | UseCase
-----------------
0 | Unidentified
1 | Unidentified
2 | Unidentified
3 | Unidentified
4 | UseCase1
5 | UseCase1
6 | Unidentified
7 | Unidentified
8 | UseCase2
9 | UseCase2
10 | UseCase2
11 | Unidentified
12 | Unidentified
I have to extract the top 4 rows which have value Unidentified in column UseCase and do further processing with them. I don't want to get the middle and last two rows with Unidentified value at this point.
I want to avoid using the ID column as they are not fixed. The above data is just a sample.
When I use map function (after converting this to RDD) or UDFs, I end up with 8 rows in my output DataFrame (which is expected of these functions).
How can this be achieved? I am working in PySpark. I don't want to use collect on the DataFrame and get it as a list to iterate over. This would defeat the purpose of Spark. The DataFrame size can go up to 4-5 GB.
Could you please suggest how this can be done?
Thanks in advance!
Just do a filter and a limit. The following code is Scala, but you'll understand the point.
Assume your dataframe is called df, then:
df.filter($"UseCase"==="Unidentified").limit(4).collect()
I have a dataframe that looks like the below.
Day | Price
12-05-2015 | 73
12-06-2015 | 68
11-07-2015 | 77
10-08-2015 | 54
I would like to subtract the price for one Day from the corresponding price 30 days later. To add to the days, I've used data.loc[data['Day'] + timedelta(days=30)] however this obviously overflowed near the final dates in my dataframe. Is there a way to subtract the prices without iterating over all the rows in the dataframe?
If it helps, my desired output is something like the following.
Start_Day | Price
12-05-2015 | -5
11-07-2015 | -23
You can use df.diff() function.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html