Pandas really slow join - python

I am trying to merge 2 dataframes, where I want to use the most recent date's row. Note that the date is not sorted, so it is not possible to use groupby.first() or groupby.last().
Left DataFrame (n=834,570) | Right DataFrame (n=1,592,005)
id_key | id_key date other_vars
1 | 1 2015-07-06 ...
2 | 1 2015-07-07 ...
3 | 1 2014-04-04 ...
Using the groupby/agg example, it takes 8 minutes! When I convert the dates to integers, then it takes 6 minutes.
gb = right.groupby('id_key')
gb.agg(lambda x: x.iloc[x.date.argmax()])
I used my own version where I make a dictionary for id, where I store the date and index of the currently highest date seen. You just iterate over the whole data once, ending up with a dictionary {id_key : [highest_date, index]}.
This way, it is really fast to just find the rows necessary.
It only takes 6 seconds to end up with the merged data; about a 85 times speedup.
I have to admit I'm very surprised as I thought pandas would be optimised for this. Does anyone have an idea what is going on, and whether the dictionary method should also be an option in pandas? It would also be simply to adapt this to other conditions of course, like sum, min etc.
My code:
# 1. Create dictionary
dc = {}
for ind, (ik, d) in enumerate(zip(right['id_key'], right['date'])):
if ik not in dc:
dc[ik] = (d, ind)
continue
if (d, ind) > dc[ik]:
dc[ik] = (d, ind)
# 2. Collecting indices at once (subsetting was slow), so to only subset once.
# It has the same amount of rows as left
inds = []
for x in left['id_key']:
# using this to append the last value that was given (missing strategy, very very few)
if x in dc:
row = dc[x][1]
inds.append(row)
# 3. Take the values
result = right.iloc[inds]

Related

How to exclude elements contained in another column - Pyspark DataFrame

Imagine you have a pyspark data frame df with three columns: A, B, C. I want to take the rows in the data frame where the value of B does not exist in C.
Example:
A B C
a 1 2
b 2 4
c 3 6
d 4 8
would return
A B C
a 1 2
c 3 6
What I tried
df.filter(~df.B.isin(df.C))
I also tried to making the values of B into a list, but that takes a significant amount of time.
The problem is how you're using isin. For better or worse, isin can't actually handle another pyspark Column object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :
col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))
Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :
df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()
Thanks to Emma in the comments for her contribution.

python, pandas, How to find connections between each group

I am having troubles to find connections between groups based on the associated data (groupby maybe?) in order to create a network.
For each group, if they have the same element, they are connected.
For example, my data frame is looks like this:
group_number data
1 a
2 a
2 b
2 c
2 a
3 c
4 a
4 c
So the out put would be
Source_group Target_group Frequency
2 1 1 (because a-a)
3 2 1 (because c-c)
4 2 2 (because a-a, c-c)
Of course (because...) will not be in the output, just explanation
Thank you very much
I thought about your problem. You could do something like the following:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'group_number': [1,2,2,2,2,3,4,4],
'data': ['a','a','b','c','a','c','a','c']})
# group the data using multiindex and convert it to dictionary
d = defaultdict(dict)
for multiindex, group in df.groupby(['group_number', 'data']):
d[multiindex[0]][multiindex[1]] = group.data.size
# iterate groups twice to compare every group
# with every other group
relationships = []
for key, val in d.items():
for k, v in d.items():
if key != k:
# get the references to two compared groups
current_row_rel = {}
current_row_rel['Source_group'] = key
current_row_rel['Target_group'] = k
# this is important, but at this point
# you are basically comparing intersection of two
# simple python lists
current_row_rel['Frequency'] = len(set(val).intersection(v))
relationships.append(current_row_rel)
# convert the result to pandas DataFrame for further analysis.
df = pd.DataFrame(relationships)
I’m sure that this could be done without the need to convert to a list of dictionaries. I find this solution however to be more straightforward.

How to shift Index of Series by 1 row in another pandas TimeIndex?

I have a pd.DatetimeIndex named "raw_Ix" which contains all the indices i am working with and two pandas (Time)series("t1" and "nextloc_ixS") (both with the same timeindex).
the values in "nextloc_ixS" are the same indices of t1.index and nextloc_ixS.index Shifted by one in raw_Ix. to better understand what "nextloc_ixS" is:
nextloc_ixS = t1.index[np.searchsorted(raw_Ix, t1.index)+1]
nextloc_ixS = pd.DataFrame(nextloc_ixS, index = t1.index)
All three get passed into a function, where i need them in the following form:
I need to drop the t1-rows where t1.index is not in raw_Ix (to avoid errors, since raw_Ix could have been manipulated)
After that I now copy t1 deeply (lets call it t1_copy). because I need the Values of nextloc_ixS as the new DatetimeIndex of t1_copy. (sounds simple, but here i got difficulties)
But before I replace the index of i might need to save the old index of t1_copy as a column in t1_copy, for the last step (== step 5).
The actual function selects some indices of t1_copy in a specific procedure and returns "result", which is a pd.DatetimeIndex that containes some indices of t1_copy with duplicates
i need to shift result back by 1, but not via np.searchsorted. (note: "result" is still artificially shifted forward, so we can set it backwards by getting the indices location in t1_copy.index and then in the backup column from step 3 getting the "old"-indices.
I know it sounds a bit complicated, therefore here is the inefficient code which i worked on:
def main(raw_Ix, t1, nextloc, nextloc_ixS=None):
t1_copy = t1.copy(deep=True).to_frame()
nextloc_ixS = nextloc_ixS.to_frame()
if nextloc_ixS is not None:
t1_copy = t1_copy.loc[t1_copy.index.intersection(pd.DatetimeIndex(raw_Ix))]
t1_copy = t1_copy[~t1_copy.index.duplicated(keep='first')]# somehow duplicates came up, I couldnt explain why
t1_copy["index_old"] = t1_copy.index.copy(deep=True)
temp = nextloc_ixS.loc[nextloc_ixS.index.intersection(raw_Ix)].copy(deep=True)
t1_copy.set_index(pd.DatetimeIndex(temp[~temp.index.duplicated(keep='first')].values), inplace=True) # somehow duplicates came up, I couldnt explain why therefore the .duplicated(...)
else: # in this case we just should find the intersection
t1_copy = t1_copy.loc[t1.index.intersection(pd.DatetimeIndex(raw_Ix))]
t1_copy = t1_copy[~t1_copy.index.duplicated(keep='first')]
result = func(t1_copy, raw_Ix) # this function is a huge nested algorithm. For relevance one can get random indices from t1_copy with duplicates (result has the same length as t1_copy, but random chosen indices with multiple duplicates)
if nextloc:
# this is just "pseudo" code.
result_locations = t1_copy.index.where(result)
result = t1_copy["index_old"].iloc[result_locations]
So in a nutshell:
I try to do an index shift back and later again forth while avoiding np.searchsorted() and instead using the two pd.Series (or better call it columns because they get passed seperately from a DataFrame)
Is there any way to do that efficiently in terms of codelines and timeuse? (very large amount of rows)
Your logic is complex to achieve two things
remove rows that are not in a list. I've used a trick for this so I can use dropna()
to shift() a column
This is performing pretty well. A fraction of a second on a dataset > 0.5m rows.
import time
d = [d for d in pd.date_range(dt.datetime(2015,5,1,2),
dt.datetime(2020,5,1,4), freq="128s")
if random.randint(0,3) < 2 ] # miss some sample times...
# random manipulation of rawIdx so there are some rows where ts is not in rawIdx
df = pd.DataFrame({"ts":d, "rawIdx":[x if random.randint(0,3)<=2
else x + pd.Timedelta(1, unit="s") for x in d],
"val":[random.randint(0,50) for x in d]}).set_index("ts")
start = time.time()
print(f"size before: {len(df)}")
dfc = df.assign(
# make it float64 so can have nan, map False to nan so can dropna() rows that are not in rawIdx
issue=lambda dfa: np.array(np.where(dfa.index.isin(dfa["rawIdx"]),True, np.nan), dtype="float64"),
).dropna().drop(columns="issue").assign(
# this should be just a straight forward shift. rawIdx will be same as index due to dropna()
nextloc_ixS=df.rawIdx.shift(-1)
)
print(f"size after: {len(dfc)}\ntime: {time.time()-start:.2f}s\n\n{dfc.head().to_string()}")
output
size before: 616264
size after: 462207
time: 0.13s
rawIdx val nextloc_ixS
ts
2015-05-01 02:02:08 2015-05-01 02:02:08 33 2015-05-01 02:06:24
2015-05-01 02:06:24 2015-05-01 02:06:24 40 2015-05-01 02:08:33
2015-05-01 02:10:40 2015-05-01 02:10:40 15 2015-05-01 02:12:48
2015-05-01 02:12:48 2015-05-01 02:12:48 45 2015-05-01 02:17:04
2015-05-01 02:17:04 2015-05-01 02:17:04 14 2015-05-01 02:21:21

Python: How to pass row and next row DataFrame.apply() method?

I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.

Pandas Cumulative Sum using Current Row as Condition

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:
Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time
So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.
I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.
You can try below code to get the final result.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])
active_events= {}
for i in df.index:
active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})
df.join(last_columns)
Here goes. This is going to be SLOW.
Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)
import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics
def overlaps_with_row(row,frame):
starts_before_mask = frame.start_time <= row.start_time
ends_after_mask = frame.end_time > row.start_time
return (starts_before_mask & ends_after_mask).sum()
df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)
Yields:
In [8]: df
Out[8]:
start_time end_time number_which_overlap
0 4 7 3
1 3 5 2
2 1 3 1
3 2 8 2
[4 rows x 3 columns]
def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()
df["count"] = df.apply(counter , axis = 1)
This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

Categories

Resources