Python pandas:Fast way to create a unique identifier for groups

Python pandas:Fast way to create a unique identifier for groups - python

I have data that looks something like this
df
Out[10]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
The goal is to get a unique ID for each group of ID1 with particular prices for each of its ID2's, like so:
# Desired Result
df
Out[14]:
ID1 ID2 Price Date UID
0 11 21 10.99 3/15/2016 1
1 11 22 11.99 3/15/2016 1
2 12 23 5 3/15/2016 7
3 11 21 10.99 3/16/2016 5
4 11 22 12.99 3/16/2016 5
5 11 21 10.99 3/17/2016 1
6 11 22 11.99 3/17/2016 1
Speed is an issue because of the size of the data. The best way I can come up with is below, but it is still a fair amount slower than is desirable. If anyone has a way that they think should be naturally faster I'd love to hear it. Or perhaps there is an easy way to do the within group operations in parallel to speed things up?
My method basically concatenates ID's and prices (after padding with zeros to ensure same lengths) and then takes ranks to simplify the final ID. The bottleneck is the within group concatenation done with .transform(np.sum).
# concatenate ID2 and Price
df['ID23'] = df['ID2'] + df['Price']
df
Out[12]:
ID1 ID2 Price Date ID23
0 11 21 10.99 3/15/2016 2110.99
1 11 22 11.99 3/15/2016 2211.99
2 12 23 5 3/15/2016 235
3 11 21 10.99 3/16/2016 2110.99
4 11 22 12.99 3/16/2016 2212.99
5 11 21 10.99 3/17/2016 2110.99
6 11 22 11.99 3/17/2016 2211.99
# groupby ID1 and Date and then concatenate the ID23's
grouped = df.groupby(['ID1','Date'])
df['summed'] = grouped['ID23'].transform(np.sum)
df
Out[16]:
ID1 ID2 Price Date ID23 summed UID
0 6 3 0010.99 3/15/2016 30010.99 30010.9960011.99 630010.9960011.99
1 6 6 0011.99 3/15/2016 60011.99 30010.9960011.99 630010.9960011.99
2 7 7 0000005 3/15/2016 70000005 70000005 770000005
3 6 3 0010.99 3/16/2016 30010.99 30010.9960012.99 630010.9960012.99
4 6 6 0012.99 3/16/2016 60012.99 30010.9960012.99 630010.9960012.99
5 6 3 0010.99 3/17/2016 30010.99 30010.9960011.99 630010.9960011.99
6 6 6 0011.99 3/17/2016 60011.99 30010.9960011.99 630010.9960011.99
# Concatenate ID1 on the front and take rank to get simpler ID's
df['UID'] = df['ID1'] + df['summed']
df['UID'] = df['UID'].rank(method = 'min')
# Drop unnecessary columns
df.drop(['ID23','summed'], axis=1, inplace=True)
UPDATE:
To clarify, consider the original data grouped as follows:
grouped = df.groupby(['ID1','Date'])
for name, group in grouped:
print group
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
ID1 ID2 Price Date
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
ID1 ID2 Price Date
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
ID1 ID2 Price Date
2 12 23 5 3/15/2016
UID's should be at the group level and match if everything about that group is identical ignoring the date. So in this case the first and third printed groups are the same, meaning that rows 0,1,5, and 6 should all get the same UID. Rows 3 and 4 belong to a different group because a price changed and therefore need a different UID. Row 2 is also a different group.
A slightly different way of looking at this problem is that I want to group as I have here, drop the date column (which was important for initially forming the groups) and then aggregate across groups which are equal once I have removed the dates.

Edit: The code below is actually slower than OP's solution. I'm leaving it as it is for now in case someone uses it to write a better solution.
For visualization, I'll be using the following data:
df
Out[421]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5.00 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
7 11 22 11.99 3/18/2016
8 11 21 10.99 3/18/2016
9 12 22 11.99 3/18/2016
10 12 21 10.99 3/18/2016
11 12 23 5.00 3/19/2016
12 12 23 5.00 3/19/2016
First, let's group it by 'ID1' and 'Date' and aggregate the results as tuples (sorted). I also reset the index, so there is a new columns named 'index'.
gr = df.reset_index().groupby(['ID1','Date'], as_index = False)
df1 = gr.agg(lambda x : tuple(sorted(x)))
df1
Out[425]:
ID1 Date index ID2 Price
0 11 3/15/2016 (0, 1) (21, 22) (10.99, 11.99)
1 11 3/16/2016 (3, 4) (21, 22) (10.99, 12.99)
2 11 3/17/2016 (5, 6) (21, 22) (10.99, 11.99)
3 11 3/18/2016 (7, 8) (21, 22) (10.99, 11.99)
4 12 3/15/2016 (2,) (23,) (5.0,)
5 12 3/18/2016 (9, 10) (21, 22) (10.99, 11.99)
6 12 3/19/2016 (11, 12) (23, 23) (5.0, 5.0)
After all grouping is done, I'll use indices from column 'index' to access rows from df (they'd better be unique). (Notice also that df1.index and df1['index'] are completely different things.)
Now, let's group 'index' (skipping dates):
df2 = df1.groupby(['ID1','ID2','Price'], as_index = False)['index'].sum()
df2
Out[427]:
ID1 ID2 Price index
0 11 (21, 22) (10.99, 11.99) (0, 1, 5, 6, 7, 8)
1 11 (21, 22) (10.99, 12.99) (3, 4)
2 12 (21, 22) (10.99, 11.99) (9, 10)
3 12 (23,) (5.0,) (2,)
4 12 (23, 23) (5.0, 5.0) (11, 12)
I believe this is the grouping needed for the problem, so we can now add labels to df. For example like this:
df['GID'] = -1
for i, t in enumerate(df2['index']):
df.loc[t,'GID'] = i
df
Out[430]:
ID1 ID2 Price Date GID
0 11 21 10.99 3/15/2016 0
1 11 22 11.99 3/15/2016 0
2 12 23 5.00 3/15/2016 3
3 11 21 10.99 3/16/2016 1
4 11 22 12.99 3/16/2016 1
5 11 21 10.99 3/17/2016 0
6 11 22 11.99 3/17/2016 0
7 11 22 11.99 3/18/2016 0
8 11 21 10.99 3/18/2016 0
9 12 22 11.99 3/18/2016 2
10 12 21 10.99 3/18/2016 2
11 12 23 5.00 3/19/2016 4
12 12 23 5.00 3/19/2016 4
Or in a possibly faster but tricky way:
# EXPERIMENTAL CODE!
df3 = df2['index'].apply(pd.Series).stack().reset_index()
df3.index = df3[0].astype(int)
df['GID'] = df3['level_0']

Related

How calculate diff() in condition value? Python

I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1

Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.

How do I classify a dataframe in a specific case?

I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?

Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21

Split date column into two

I have the following dataframe:
date
wind (°)
wind (kt)
temp (C°)
humidity(%)
currents (°)
currents (kt)
stemp (C°)
sea_temp_diff
wind_distance_diff
wind_speed_diff
temp_diff
humidity_diff
current_distance_diff
current_speed_diff
8 12018
175.000000
16.333333
25.500000
82.500000
60.000000
0.100000
25.400000
-1.066667
23.333333
-0.500000
-0.333333
-12.000000
160.000000
6.666667e-02
9 12019
180.000000
17.000000
23.344828
79.724138
230.000000
0.100000
23.827586
-0.379310
22.068966
1.068966
0.827586
-7.275862
315.172414
3.449034e+02
10 12020
365.000000
208.653846
24.192308
79.346154
355.769231
192.500000
24.730769
574.653846
1121.923077
1151.153846
1149.346154
-19.538462
1500.000000
1.538454e+03
14 22019
530.357143
372.964286
23.964286
81.964286
1270.714286
1071.560714
735.642857
-533.642857
-327.500000
-356.892857
1.857143
-10.321429
-873.571429
-8.928107e+02
15 22020
216.551724
12.689655
24.517241
81.137931
288.275862
172.565517
196.827586
-171.379310
-8.965517
3.724138
1.413793
-7.137931
-105.517241
-1.722724e+02
16 32019
323.225806
174.709677
25.225806
80.741935
260.000000
161.451613
25.709677
480.709677
486.451613
483.967742
0.387097
153.193548
1044.516129
9.677065e+02
17 32020
351.333333
178.566667
25.533333
78.800000
427.666667
166.666667
26.600000
165.533333
-141.000000
-165.766667
166.633333
158.933333
8.333333
1.500000e-01
18 42017
180.000000
14.000000
27.000000
5000.000000
200.000000
0.400000
25.400000
2.600000
20.000000
-4.000000
0.000000
0.000000
-90.000000
-1.000000e-01
19 42019
694.230769
589.769231
24.038462
69.461538
681.153846
577.046154
26.884615
-1.346154
37.307692
-1.692308
1.500000
4.769231
98.846154
1.538462e-01
20 42020
306.666667
180.066667
24.733333
75.166667
427.666667
166.666667
26.800000
165.066667
205.333333
165.200000
1.100000
-4.066667
360.333333
3.334233e+02
21 52017
146.333333
11.966667
22.900000
5000.000000
116.333333
0.410000
26.066667
-1.553333
8.666667
0.833333
-0.766667
0.000000
95.000000
-1.300000e-01
22 52019
107.741935
12.322581
23.419355
63.032258
129.354839
0.332258
25.935484
-1.774194
14.838710
0.096774
-0.612903
-14.451613
130.967742
I need to sort the 'date' column chronologically, and I'm wondering if there's a way for me to split it two ways, with the '10' in one column and 2017 in another, sort both of them in ascending order, and then bring them back together.
I had tried this:
australia_overview[['month','year']] = australia_overview['date'].str.split("2",expand=True)
But I am getting error like this:
ValueError: Columns must be same length as key
How can I solve this issue?

From your DataFrame :
>>> df = pd.DataFrame({'id': [1, 2, 3, 4],
... 'date': ['1 42018', '12 32019', '8 112020', '23 42021']},
... index = [0, 1, 2, 3])
>>> df
id date
0 1 1 42018
1 2 12 32019
2 3 8 112020
3 4 23 42021
We can split the column to get the first value of day like so :
>>> df['day'] = df['date'].str.split(' ', expand=True)[0]
>>> df
id date day
0 1 1 42018 1
1 2 12 32019 12
2 3 8 112020 8
3 4 23 42021 23
And get the 4 last digit from the column date for the year to get the expected result :
>>> df['year'] = df['date'].str[-4:].astype(int)
>>> df
id date day year
0 1 1 42018 1 2018
1 2 12 32019 12 2019
2 3 8 112020 8 2020
3 4 23 42021 23 2021
Bonus : as asked in the comment, you can even get the month using the same principle :
>>> df['month'] = df['date'].str.split(' ', expand=True)[1].str[:-4].astype(int)
>>> df
id date day year month
0 1 1 42018 1 2018 4
1 2 12 32019 12 2019 3
2 3 8 112020 8 2020 11
3 4 23 42021 23 2021 4

Group columns based on the headers if they are found in the same list. Pandas Python

So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?

I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.

Python how to get values in one dataframe from the other dataframe

import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1

You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas:Fast way to create a unique identifier for groups - python

Related

How calculate diff() in condition value? Python

How do I classify a dataframe in a specific case?

Split date column into two

Group columns based on the headers if they are found in the same list. Pandas Python

Python how to get values in one dataframe from the other dataframe

Categories

Resources