Let's say I have some data like this:
timestamp
patient_id
99 10
99 100
3014 20
3014 200
How exactly would one in pandas be able to find the largest, smallest, and average range of timestamps per patient id?
What I'm looking for is to be able to report this:
shortest range = 90 (100 - 10)
longest range = 180 (200 - 20)
average range = (180 + 90) / 2 = 135
The Setup
Create dummy DataFrame:
import pandas as pd
data = '''99 10
99 100
3014 20
3014 200'''.split('\n')
Using two nested list comprehensions, split the rows, then the columns and convert all elements to int. Then import into a DataFrame.
data = [[int(n) for n in item.split()] for item in data]
DF = pd.DataFrame(data, columns=['pid', 'timestamp'])
As learning exercise, loop through each group (assumes arbitrary number of timestamps per pid, not just two). This is not the solution -- it is just to demonstrate how groupby works:
for pid, grp in DF.groupby('pid'):
print(pid, grp.timestamp.min(), grp.timestamp.max())
# Prints:
# (99, 10, 100)
# (3014, 20, 200)
The Solution
The solution is more efficient: get vector of mins and maxs, extract ranges, and then find the min, max, and average of the ranges. The strength of Pandas is that it will operate on any column in the DataFrame as a unit, making the calculation on arrays very simple, like this:
mins = DF.groupby('pid').timestamp.min()
maxs = DF.groupby('pid').timestamp.max()
ranges = maxs - mins
shortest_range = ranges.min()
longest_range = ranges.max()
average_range = ranges.mean()
print(shortest_range, longest_range, average_range)
# (90, 180, 135.0)
Related
Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.
Here's roughly my problem:
DTE
Strike
Bid
Ask
1
100
10
11
1
200
16
17
1
300
17
18
1
400
11
12
1
500
12
13
1
600
13
14
2
100
10
30
2
200
15
20
2
300
16
21
import pandas as pd
pd.DataFrame({
'DTE': [1,1,1,1,1,1,2,2,2],
'Strike': [100,200,300,400,500,600,100,200,300],
'Bid': [10,16,17,11,12,13,10,15,16],
'Ask': [11,17,18,12,13,14,30,20,21],
})
I would like to:
group these by DTE. Here we have two groups (DTE 1 and DTE 2). Then within each group...
find the longest paired increasing subsequence. Sort-ordering is determined by Strike, which is unique for each DTE group. So 200 Strike comes after 100 Strike.
thus, the Bid and the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid and ask.
any strikes in between that does NOT have bids and asks both increasing in value are deleted.
In this case, the answer would be:
DTE
Strike
Bid
Ask
1
100
10
11
1
400
11
12
1
500
12
13
1
600
13
14
2
200
15
20
2
300
16
21
Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.
Note that standard Longest increasing subsequence algorithm of O(nlogn) does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.
Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?
My current solution looks like this:
def modify_lsc_row(row, df, longest_lsc):
lsc_predecessor_count = 0
lsc_predecessor_index = -1
df_predecessors = df[(df['Bid'] <= row.Bid) &
(df['Ask'] <= row.Ask) &
(df['lsc_count'] != -1)]
if len(df_predecessors) > 0:
df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
lsc_predecessor_index = df_predecessors.index.max()
lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']
new_predecessor_count = lsc_predecessor_count + 1
df.at[row.name, 'lsc_count'] = new_predecessor_count
df.at[row.name, 'prev_index'] = lsc_predecessor_index
if new_predecessor_count >= longest_lsc.lsc_count:
longest_lsc.lsc_count = new_predecessor_count
longest_lsc.lsc_index = row.name
def longest_increasing_bid_ask_subsequence(df):
original_columns = df.columns
df.sort_values(['Strike'], ascending=True, inplace=True)
df.set_index(['Strike'], inplace=True)
assert df.index.is_unique
longest_lsc = LongestLsc()
longest_lsc.lsc_index = df.index.max()
longest_lsc.lsc_count = 1
df['lsc_count'] = -1
df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
axis=1)
while longest_lsc.lsc_index != -1:
df.at[longest_lsc.lsc_index, 'keep'] = True
longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']
df.dropna(inplace=True)
return df.reset_index()[original_columns]
df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)
Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )
Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.
Updated code as follows:
#njit
def set_list_indices(bids, asks, indices, indices_to_keep):
entries = len(indices)
lis_count = np.full(entries, 0)
prev_index = np.full(entries, -1)
longest_lis_count = -1
longest_lis_index = -1
for i in range(entries):
predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1
if best_predecessor_index < i:
prev_index[i] = best_predecessor_index
new_count = predecessor_counts[best_predecessor_index] + 1
lis_count[i] = new_count
if new_count >= longest_lis_count:
longest_lis_count = new_count
longest_lis_index = i
while longest_lis_index != -1:
indices_to_keep[indices[longest_lis_index]] = True
longest_lis_index = prev_index[longest_lis_index]
# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)
# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)
df_groups = df.groupby(['DTE'])
row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
bids = group['Bid'].to_numpy()
asks = group['Ask'].to_numpy()
indices = group.index.to_numpy()
set_list_indices(bids, asks, indices, row_indices_to_keep)
df = df.iloc[row_indices_to_keep]
What is the complexity of your algorithm of finding the longest increasing subsequence?
This article provides an algorithm with the complexity of O(n log n).
Upd: doesn't work.
You don't even need to modify the code, because in python comparison works for tuples: assert (1, 2) < (3, 4)
>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)]
>>> subsequence(seq)
[(10, 11), (11, 12), (12, 13), (13, 14)]
Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?
Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the .groupby().
from pandarallel import pandarallel
pandarallel.initialize()
# just an example of usage:
df.groupby("DTE").parallel_apply(subsequence)
Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using df.iloc
Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.
df
ID Sales
倀굖곾ꆹ譋῾理 100
倀굖곾ꆹ 50
倀굖곾ꆹ譋῾理 70
곾ꆹ텊躥㫆 60
My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.
ID Sales
13434535 100
67894335 50
13434535 70
10986467 60
The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.
Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.
The primary method here will be to use Series.map() on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
not needing a uniqueness check since it's deterministically based on the original ID and
So original IDs which were the same will generate the same new ID
doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]
Can be done in one line, no imports needed
Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60
Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame([
['倀굖곾ꆹ譋῾理', 100],
['倀굖곾ꆹ', 50],
['倀굖곾ꆹ譋῾理', 70],
['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
if np.all(np.unique(ids, return_counts = True)[1] <= 1):
break
df.ID = ids[iv]
print(df)
Output:
ID Sales
0 31043191 100
1 36168634 50
2 31043191 70
3 17162753 60
Given a dataframe df, create a list of the ids:
id_list = list(df.ID)
Then import the random package
from random import randint
from collections import deque
def idSetToNumber(id_list):
id_set = deque(set(id_list))
checked_numbers = []
while len(id_set)>0:
#get the id
id = randint(10000000,99999999)
#check if the id has been used
if id not in checked_numbers:
checked_numbers.append(id)
id_set.popleft()
return checked_numbers
This gives a list of unique 8-digit number for each of your keys.
Then create a dictionary
checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
name2id[id_list[i]]=checked_numbers[i]
Last step, replace all the pandas ID fields with the ones in the dictionary.
for i in range(df.shape[0]):
df.ID[i] = str(name2id[df.ID[i]])
I would:
identify the unique ID values
build (from np.random) an array of unique values of same size
build a tranformation dataframe with that array
use merge to replace the original ID values
Possible code:
trans = df[['ID']].drop_duplicates() # unique ID values
n = len(trans)
# np.random.seed(0) # uncomment for reproducible pseudo random sequences
while True:
# build a greater array to have a higher chance to get enough unique values
arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
if len(arr) >= n:
arr = arr[:n] # ok keep only the required number
break
trans['new'] = arr # ok we have our transformation table
df['ID'] = df.merge(trans, how='left', on='ID')['new'] # done...
With your sample data (and with np.random.seed(0)), it gives:
ID Sales
0 12215104 100
1 48712131 50
2 12215104 70
3 70969723 60
Per #Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:
...
np.random.shuffle(arr)
trans['new'] = arr
...
I have a large data frame across different timestamps. Here is my attempt:
all_data = []
for ws in wb.worksheets():
rows=ws.get_all_values()
df_all_data=pd.DataFrame.from_records(rows[1:],columns=rows[0])
all_data.append(df_all_data)
data = pd.concat(all_data)
#Change data type
data['Year'] = pd.DatetimeIndex(data['Week']).year
data['Month'] = pd.DatetimeIndex(data['Week']).month
data['Week'] = pd.to_datetime(data['Week']).dt.date
data['Application'] = data['Application'].astype('str')
data['Function'] = data['Function'].astype('str')
data['Service'] = data['Service'].astype('str')
data['Channel'] = data['Channel'].astype('str')
data['Times of alarms'] = data['Times of alarms'].astype('int')
#Compare Channel values over weeks
subchannel_df = data.pivot_table('Times of alarms', index = 'Week', columns='Channel', aggfunc='sum').fillna(0)
subchannel_df = subchannel_df.sort_index(axis=1)
The data frame I am working on
What I hope to achieve:
add a percentage row (the last row vs the second last row) at the end of the data frame, excluding situations as such: divide by zero and negative percentage
show those channels which increase more than 10% as compared against last week.
I have been trying different methods to achieve those for days. However, I would not manage to do it. Thank you in advance.
You could use the shift function as an equivalent to Lag window function in SQL to return last week's value and then perform the calculations in row level. To avoid dividing by zero you can use numpy where function that is equivalent to CASE WHEN in SQL. Let's say your column value on which you perform the calculations named: "X"
subchannel_df["XLag"] = subchannel_df["X"].shift(periods=1).fillna(0).astype('int')
subchannel_df["ChangePercentage"] = np.where(subchannel_df["XLag"] == 0, 0, (subchannel_df["X"]-subchannel_df["XLag"])/subchannel_df["XLag"])
subchannel_df["ChangePercentage"] = (subchannel_df["ChangePercentage"]*100).round().astype("int")
subchannel_df[subchannel_df["ChangePercentage"]>10]
Output:
Channel X XLag ChangePercentage
Week
2020-06-12 12 5 140
2020-11-15 15 10 50
2020-11-22 20 15 33
2020-12-13 27 16 69
2020-12-20 100 27 270
Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)
I am trying to read from array i created and return value inside array from the column and row its found in.this is what i have at the moment.
import pandas as pd
import os
import re
Dir = os.getcwd()
Blks = []
for files in Dir:
for f in os.listdir(Dir):
if re.search('txt', f):
Blks = [each for each in os.listdir(Dir) if each.endswith('.txt')]
print (Blks)
for z in Blks:
df = pd.read_csv(z, sep=r'\s+', names=['x','y','z'])
a = []
a = df.pivot('y','x','z')
print (a)
OUTPUTS:
x 300.00 300.25 300.50 300.75 301.00 301.25 301.50 301.75
y
200.00 100 100 100 100 100 100 100 100
200.25 100 100 100 100 110 100 100 100
200.50 100 100 100 100 100 100 100 100
x will be my columns and y the rows, inside the array is values corresponding to there adjacent column and row. as you can see above there is a odd 110 value that is 10 above the other values, i'm trying to read the array and return the x (column) and y (row) value for the value that's 10 difference by checking its values next to it(top,bottom,right,left) to calculate the difference.
Hope someone can kindly guide me into right direction, and any beginner tips are appreciated.if its unclear what i'm asking please ask i don't have years experience in all methodology,i have only recently started with python .
You could use DataFrame.ix to loop through all the values, row by row and column by column.
oddCoordinates=[]
for r in df.shape[0]:
for c in df.shape[1]:
if checkDiffFromNeighbors(df,r,c):
oddCoordinates.append((r,c))
The row and column of values that are different from the neighbors are listed in oddCoordinates.
To check the difference between the neighbors, you could loop them and count how many different values there are:
def checkDiffFromNeighbors(df,r,c):
#counter of how many different
diffCnt = 0
#loop over the neighbor rows
for dr in [-1,0,1]:
r1 = r+dr
#row should be a valid number
if r1>=0 and r1<df.shape[0]:
#loop over columns in row
for dc in [-1,0,1]:
#either row or column delta should be 0, because we do not allow diagonal
if dr==0 or dc==0:
c1 = c+dc
#check legal column
if c1>=0 and c1<df.shape[1]:
if df.ix[r,c]!=df.ix[r1,c1]:
diffCnt += 1
# if diffCnt==1 then a neighbor is (probably) the odd one
# otherwise this one is the odd one
# Note that you could be more strict and require all neighbors to be different
if diffCnt>1:
return True
else:
return False