Purpose
The main purpose is to be able to compute the share of resources used by node i in relation to its neighbors:
r_i / sum_j^i{r_j}
where r_i are node i resources and sum_j^i{r_j} is the sum of i's neighbors' resources.
I am open to any R, python or eventually stata solutions, that are able to achieve this task on which I am almost giving up...
See below snippets with my previous attempts.
To achieve this goal, I am trying to perform a search of this type:
node
col1
col2
col3
i
[A]
[list]
list
j
[A, B , i]
search i in col 1 if found update col1
node
col1
col2
col3
i
[A, j]
[list]
list
j
[A, B , i]
Data
Dataframe is about 700k rows and lists can be of max 20 elements. Lists in col1-col3 may be empty. Entries look like '1579301860' that are stored as strings.
The first 10 entries of the df:
df[["ID","s22_12","s22_09","s22_04"]].head(10)
,ID,s22_12,s22_09,s22_04
0,547232925,[],[],[]
1,1195452119,[],[],[]
2,543827523,[],[],[]
3,1195453927,[],[],[]
4,1195456863,[],[],[]
5,403735824,[],[],[]
6,403985344,[],[],[]
7,1522725190,"['547232925', '1561895862', '1195453927', '1473969746', '1576299336', '1614620375', '1526127302', '1523072827', '398988727', '1393784634', '1628271142', '1562369345', '1615273511', '1465706815', '1546795725']","['1550103038', '547232925', '1614620375', '1500554025', '1526127302', '1523072827', '1554793443', '1393784634', '1603417699', '1560658585', '1533511207', '1439071476', '1527861165', '1539382728', '1545880720']","['1529732185', '1241865116', '1524579382', '1523072827', '1526127302', '1560851415', '1535455909', '1457280850', '1577015775', '1600877852', '1549989930', '1528007558', '1533511207', '1527861165', '1591602766']"
8,789656124,[],[],[]
9,662539468,[1195453927],[],[]
What I tried: R Attempts
Exploding the lists and put in a long format.
Then I tried two main approaches in R:
loading long data into igraph and then apply to the nodes' graph neighbors(), saving into lists and using plyr to have a neighbor_df (works but 2 nodes gets done in 67 seconds)
# Initialize the result data frame
result <- data.frame(Node = nodes)
#result <- as.data.frame(matrix(NA, nrow = n_nodes, ncol = 0))
neighbor_lists <- lapply(nodes, function(x) {
neighbors <- names(neighbors(graph, x))
if (length(neighbors) == 0) {
neighbors <- NA
}
return(neighbors)
})
neighbor_df <- plyr::ldply(neighbor_lists, rbind)
names(neighbor_df) <- paste0("Neighbor",1:ncol(neighbor_df))
result <- cbind(result,neighbor_df)
read the long format with data.table, split, lapply dcast on the splits (<- memory overload)
result_long <- edges[, .(to = to, Node = from)][, rn := .I][, .(Node, Neighbor = to, Number = rn)][order(Number),]
result_long[,cast_cat:=findInterval(Number,seq(100000,6000000,100000))]
# reshape to wide
result_wide <- dcast(result_long, Node ~ Number, value.var = "Neighbor", fill = "")
#Only tested on sample data, target data is 19 mln rows and dcast shall be split, but then it consumes 200Gb of ram
result_wide[, (2:ncol(result_wide)) := lapply(.SD, function(x) ifelse(x == "", NA, x)), .SDcols = 2:ncol(result_wide)]
result_wide = na_move(result_wide, cols = names(result_wide[,!1]) )
result_wide<- Filter(function(x)!all(is.na(x)), result_wide)
I posted as per Andy request, yet I think it clutters the question.
Thanks to the comment of #Stefano Barbi:
# extract attributes characteristics:
r <- vertex_attr(g,"rcount",index=V(g))
#create a dgC sparse matrix from graph
m <- get.adjacency(g)
# premultiply the adj matrix to find the sum of the neighbors resources
sum_of_rj = r %*% m
# add node's own resources
sum_of_r = sum_of_rj + r
#find the vector of shares
share = r / sum_of_r#x
sh_tab = data.table(i = sum_of_r#Dimnames[[2]], sh = share)
sh_tab
Related
I wanted to calculate the co-occurrence such that it can be used as an edge for the construction of a graph. I have a column skill that consists of the list of skills for each id in the dataset. Now I wanted to calculate cooccurrence and use it as an edge.the data format is
skill
Product Management,RPM,Progress 4GL,IP,CAMEL,Prince2 Foundation,Continuous Integration,GSM(HLR,MSC),Programming,SS7,INAP,ClearCase,SS7 protocol,Software Development,Shell Scripting,GPRS(SGSN,GGSN),MySQL,VOIP,Linux,Agile,SIP,Diameter,Test,Oracle,Software
User Experience,Interaction Design,3D rendering,Event,Team,Graphic Design,Engineering,User Experience Design,Sales,3D Modeling,Product Marketing,Employee Training,business plan,3D,Business Development,Creative Problem Solving,Product Design,renewable energy,Electronics,news paper,Project Management,Product Development,Social Enterprise
the above is the skills list of two ids in the dataset.
Now I want my output to be in the format of 3 columns which is the source, target, and weight count, and in the next step I can use them for the graph construction.
Source_elt Target_elt WeightCount
Can anyone share your insights that would be helpful. My endpoint is that using this weight count I will go further for community detection.
I amusing the following code for the co-occurrence calculation.
document =nested_list
#unique job titles
fnc_names = unique_jobtitle
# Get a list of all of the combinations you have
expanded = [tuple(itertools.combinations(d, 2)) for d in document]
expanded = itertools.chain(*expanded)
# Sort the combinations so that A,B and B,A are treated the same
expanded = [tuple(sorted(d)) for d in expanded]
# count the combinations
c = Counter(expanded)
#initialize NxN matrix with zeros
table =np.zeros((len(fnc_names),len(fnc_names)), dtype=int)
for i, v1 in enumerate(fnc_names):
for j, v2 in enumerate(fnc_names[i:]):
j = j + i
table[i, j] = c[v1, v2]
table[j, i] = c[v1, v2]
df_cooccMatrix = pd.DataFrame(table, index=fnc_names, columns=fnc_names)
df_cooccMatrix.head()
and later for the weight count
#Assign count as f edge
weight_cout = df_cooccMatrix.stack()
weight_cout = pd.DataFrame(weight_cout.rename_axis(('Source_elt', 'Target_elt')).reset_index(name='WeightCount'))
#weight_cout.sort_values(by=['WeightCount'], inplace=True, ascending=False)
#weight_cout.head(10)
But when I am calculating weight count I am getting the memory issue
MemoryError: Unable to allocate 6.25 GiB for an array with shape (1678704784,) and data type int32
so can anyone help me in solving the issue.
Thanks in advance
I have a dataframe with the columns:
[id, range_start, range_stop, score]
If two rows have a range overlap by x percentage I retain the row with the higher score. However, I am confused how to pull out rows with no overlap to other ranges. I am using a nested loop and recursion to condense overlapping ranges into a new dataframe. However, this structure causes all rows to be retained when I am looking for the non overlapping rows.
## This is my function to recursively select the highest scoring overlapping regions
def overlap_retention(df_overlap, threshold, df_nonoverlap=None):
if df_nonoverlap != None:
df_nonoverlap = pd.DataFrame()
df_overlap = pd.DataFrame()
for index, row in x.iterrows():
rs = row['range_start']
re = row['range_end']
## Silly nested loop to compare ranges between all rows
for index2, row2 in x.drop(index).iterrows():
rs2 = row2['range_start']
re2 = row2['range_end']
readRegion=[*range(rs,re,1)]
refRegion=[*range(rs2,re2,1)]
regionUnion = set(readRegion).intersection(set(refRegion))
overlap_length = len(regionUnion)
overlap_min = min(rs, rs2)
overlap_max = max(re, re2)
overlap_full_range = overlap_max-overlap_min
overlap_percentage = (overlap_length/overlap_full_range)*100
## Check if they overlap by x_percentage and retain the higher score
if overlap_percentage>x_percentage:
evalue = row['score']
evalue_2 = row2['score']
if evalue_2 > evalue:
df_overlap = df_overlap.append(row2)
else:
df_overlap = df_overlap.append(row)
#----------------------------------------------------------
## How to find non-overlapping rows without pulling everything?
else:
df_nonoverlap = df_nonoverlap.append(row)
# ---------------------------------------------
### Recursion here to condense overlapped list further
if len(df_overlap)>1:
overlap_retention(df_overlap, threshold, df_nonoverlap)
else:
return(df_nonoverlap)
An example input is below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'range_start':[1,12,11,1,20, 10],
'range_end':[4,15,15,6,23,16],
'score':[3,1,8,2,5,1]}
input = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
The desired output can change based on the overlap threshold. In the example above id1 and id4 may both be retained or simply id1 depending on the overlap threshold:
data = {'id':['id1', 'id3', 'id5'],
'range_start':[1,11,20],
'range_end':[4,15,23],
'score':[3,8,5]}
output = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
You can make a cartesian join between all the ranges, then find length and % of the overlap for each pair, and filter it based on the x_overlap threshold.
After that, for each range we can find the overlapping range with the highest score (which could be the range itself, with the overlap of 100%):
# set min overlap parameter
x_overlap = 0.5
# cartesian join all ranges
z = df.assign(k=1).merge(
df.assign(k=1), on='k', suffixes=['_1', '_2'])
# find lengths of overlaps
z['len_overlap'] = (
z[['range_end_1', 'range_end_2']].min(axis=1) -
z[['range_start_1', 'range_start_2']].max(axis=1)).clip(0)
# we're only interested in cases where ranges overlap, so the total
# range is the range between min(start1, start2) and max(end1, end2)
z['len_total'] = (
z[['range_end_1', 'range_end_2']].max(axis=1) -
z[['range_start_1', 'range_start_2']].min(axis=1)).clip(0)
# find % overlap and filter out pairs above threshold
# these include 'pairs' where a range is paired to itself
z['pct_overlap'] = z['len_overlap'] / z['len_total']
z = z[z['pct_overlap'] > x_overlap]
# for each range find an overlapping range with the highest score
# (could be the range itself)
z = z.sort_values('score_2').groupby('id_1')['id_2'].last()
# filter the inputs
df_out = df[df['id'].isin(z)]
df_out
Output:
id range_start range_end score
0 id1 1 4 3
2 id3 11 15 8
4 id5 20 23 5
P.S. Please note that it is not very clear what should happen with id4 in your example. Since you don't have it in your output, I assumed (hopefully correctly) that you're not interested in zero-length ranges in the output
P.P.S. There is a new syntax for cartesian join in pandas 1.2.0+ with how=cross parameter in the merge method. I've used in my answer a version with a dummy variable k=1, which is more verbose, but compatible with older versions
I think you need a very clear definition of overlap. If you have [2;7], [6;10] and [7;8], which one overlaps with which one ?
Avoid using input as a variable name, it shadows the function input() (to get input from the user)
If you want to select clear overlaps (only the start or the end differs), and you only have at most ONE overlap, here you go:
sorted_df = df.sort_values(by=["range_start"])
starts_earlier = sorted_df[sorted_df.range_end.shift(-1) == sorted_df.range_end]
sorted_df = df.sort_values(by=["range_end"])
ends_earlier = sorted_df[sorted_df.range_start.shift(-1) == sorted_df.range_start]
Then you can do a df.drop(starts_earlier.index) and df.drop(ends_earlier.index) to remove the shorter ones/
df.shift() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
This code won't work for multiple overlapping segments. If you are interested in that, let me know.
I have a pandas dataframe as shown below. There are many more columns in that frame that are not important concerning the task. The column id shows the sentenceID while the columns e1 and e2 contain entities (=words) of the sentence with their relationship in the column r
id e1 e2 r
10 a-5 b-17 A
10 b-17 a-5 N
17 c-1 a-23 N
17 a-23 c-1 N
17 d-30 g-2 N
17 g-20 d-30 B
I also created a graph for each sentence. The graph is created from a list of edges that looks somewhat like this
[('wordB-5', 'wordA-1'), ('wordC-8', 'wordA-1'), ...]
All of those edges are in one list (of lists). Each element in that list contains all the edges of each sentence. Meaning list[0] has the edges of sentence 0 and so on.
Now I want to perform operations like these:
graph = nx.Graph(graph_edges[i])
shortest_path = nx.shortest_path(graph, source="e1",
target="e2")
result_length = len(shortest_path)
result_path = shortest_path
For each row in the data frame, I'd like to calculate the shortest paths (from the entity in e1 to the entity in e2 and save all of the results in a new column in the DataFrame but I have no idea how to do that.
I tried using constructions such as these
e1 = DF["e1"].tolist()
e2 = DF["e2"].tolist()
for id in Df["sentenceID"]:
graph = nx.Graph(graph_edges[id])
shortest_path = nx.shortest_path(graph,source=e1, target=e2)
result_length = len(shortest_path)
result_path = shortest_path
to create the data but it says the target is not in the graph.
new df=
id e1 e2 r length path
10 a-5 b-17 A 4 ..
10 b-17 a-5 N 4 ..
17 c-1 a-23 N 3 ..
17 a-23 c-1 N 3 ..
17 d-30 g-2 N 7 ..
17 g-20 d-30 B 7 ..
Here's one way to do what you are trying to do, in three distinct steps so that it is easier to follow along.
Step 1: From a list of edges, build the networkx graph object.
Step 2: Create a data frame with 2 columns (For each row in this DF, we want the shortest distance and path from the e1 column to the entity in e2)
Step 3: Row by row for the DF, calculate shortest path and length. Store them in the DF as new columns.
Step 1: Build the graph and add edges, one by one
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
elist = [[('a-5', 'b-17'), ('b-17', 'c-1')], #sentence 1
[('c-1', 'a-23'), ('a-23', 'c-1')], #sentence 2
[('b-17', 'g-2'), ('g-20', 'c-1')]] #sentence 3
graph = nx.Graph()
for sentence_edges in elist:
for fromnode, tonode in sentence_edges:
graph.add_edge(fromnode, tonode)
nx.draw(graph, with_labels=True, node_color='lightblue')
Step 2: Create a data frame of desired distances
#Create a data frame to store distances from the element in column e1 to e2
DF = pd.DataFrame({"e1":['c-1', 'a-23', 'c-1', 'g-2'],
"e2":['b-17', 'a-5', 'g-20', 'g-20']})
DF
Step 3: Calculate Shortest path and length, and store in the data frame
This is the final step. Calculate shortest paths and store them.
pathlist, len_list = [], [] #placeholders
for row in DF.itertuples():
so, tar = row[1], row[2]
path = nx.shortest_path(graph, source=so, target=tar)
length=nx.shortest_path_length(graph,source=so, target=tar)
pathlist.append(path)
len_list.append(length)
#Add these lists as new columns in the DF
DF['length'] = len_list
DF['path'] = pathlist
Which produces the desired resulting data frame:
Hope this helps you.
For anyone that's interested in the solution (thanks to Ram Narasimhan) :
pathlist, len_list = [], []
so, tar = DF["e1"].tolist(), DF["e2"].tolist()
id = DF["id"].tolist()
for _,s,t in zip(id, so, tar):
graph = nx.Graph(graph_edges[_]) #Constructing each Graph
try:
path = nx.shortest_path(graph, source=s, target=t)
length = nx.shortest_path_length(graph,source=s, target=t)
pathlist.append(path)
len_list.append(length)
except nx.NetworkXNoPath:
path = "No Path"
length = "No Pathlength"
pathlist.append(path)
len_list.append(length)
#Add these lists as new columns in the DF
DF['length'] = len_list
DF['path'] = pathlist
I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454
consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0