Python : Dropping rows of a dataframe and keep a specific group

Python : Dropping rows of a dataframe and keep a specific group - python

The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you

Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3

Related

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20

Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df

Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20

Split date column into two

I have the following dataframe:
date
wind (°)
wind (kt)
temp (C°)
humidity(%)
currents (°)
currents (kt)
stemp (C°)
sea_temp_diff
wind_distance_diff
wind_speed_diff
temp_diff
humidity_diff
current_distance_diff
current_speed_diff
8 12018
175.000000
16.333333
25.500000
82.500000
60.000000
0.100000
25.400000
-1.066667
23.333333
-0.500000
-0.333333
-12.000000
160.000000
6.666667e-02
9 12019
180.000000
17.000000
23.344828
79.724138
230.000000
0.100000
23.827586
-0.379310
22.068966
1.068966
0.827586
-7.275862
315.172414
3.449034e+02
10 12020
365.000000
208.653846
24.192308
79.346154
355.769231
192.500000
24.730769
574.653846
1121.923077
1151.153846
1149.346154
-19.538462
1500.000000
1.538454e+03
14 22019
530.357143
372.964286
23.964286
81.964286
1270.714286
1071.560714
735.642857
-533.642857
-327.500000
-356.892857
1.857143
-10.321429
-873.571429
-8.928107e+02
15 22020
216.551724
12.689655
24.517241
81.137931
288.275862
172.565517
196.827586
-171.379310
-8.965517
3.724138
1.413793
-7.137931
-105.517241
-1.722724e+02
16 32019
323.225806
174.709677
25.225806
80.741935
260.000000
161.451613
25.709677
480.709677
486.451613
483.967742
0.387097
153.193548
1044.516129
9.677065e+02
17 32020
351.333333
178.566667
25.533333
78.800000
427.666667
166.666667
26.600000
165.533333
-141.000000
-165.766667
166.633333
158.933333
8.333333
1.500000e-01
18 42017
180.000000
14.000000
27.000000
5000.000000
200.000000
0.400000
25.400000
2.600000
20.000000
-4.000000
0.000000
0.000000
-90.000000
-1.000000e-01
19 42019
694.230769
589.769231
24.038462
69.461538
681.153846
577.046154
26.884615
-1.346154
37.307692
-1.692308
1.500000
4.769231
98.846154
1.538462e-01
20 42020
306.666667
180.066667
24.733333
75.166667
427.666667
166.666667
26.800000
165.066667
205.333333
165.200000
1.100000
-4.066667
360.333333
3.334233e+02
21 52017
146.333333
11.966667
22.900000
5000.000000
116.333333
0.410000
26.066667
-1.553333
8.666667
0.833333
-0.766667
0.000000
95.000000
-1.300000e-01
22 52019
107.741935
12.322581
23.419355
63.032258
129.354839
0.332258
25.935484
-1.774194
14.838710
0.096774
-0.612903
-14.451613
130.967742
I need to sort the 'date' column chronologically, and I'm wondering if there's a way for me to split it two ways, with the '10' in one column and 2017 in another, sort both of them in ascending order, and then bring them back together.
I had tried this:
australia_overview[['month','year']] = australia_overview['date'].str.split("2",expand=True)
But I am getting error like this:
ValueError: Columns must be same length as key
How can I solve this issue?

From your DataFrame :
>>> df = pd.DataFrame({'id': [1, 2, 3, 4],
... 'date': ['1 42018', '12 32019', '8 112020', '23 42021']},
... index = [0, 1, 2, 3])
>>> df
id date
0 1 1 42018
1 2 12 32019
2 3 8 112020
3 4 23 42021
We can split the column to get the first value of day like so :
>>> df['day'] = df['date'].str.split(' ', expand=True)[0]
>>> df
id date day
0 1 1 42018 1
1 2 12 32019 12
2 3 8 112020 8
3 4 23 42021 23
And get the 4 last digit from the column date for the year to get the expected result :
>>> df['year'] = df['date'].str[-4:].astype(int)
>>> df
id date day year
0 1 1 42018 1 2018
1 2 12 32019 12 2019
2 3 8 112020 8 2020
3 4 23 42021 23 2021
Bonus : as asked in the comment, you can even get the month using the same principle :
>>> df['month'] = df['date'].str.split(' ', expand=True)[1].str[:-4].astype(int)
>>> df
id date day year month
0 1 1 42018 1 2018 4
1 2 12 32019 12 2019 3
2 3 8 112020 8 2020 11
3 4 23 42021 23 2021 4

How to divide multiple columns based on three conditions

This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !

You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333

Replace last value(s) of group with NaN

My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!

try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN

Add rows based on column date with two column unique key - pandas

So I have a dataframe like:
Number Country StartDate EndDate
12 US 1/1/2023 12/1/2023
12 Mexico 1/1/2024 12/1/2024
And what I am trying to do is:
Number Country Date
12 US 1/1/2023
12 US 2/1/2023
12 US 3/1/2023
12 US 4/1/2023
12 US 5/1/2023
12 US 6/1/2023
12 US 7/1/2023
12 US 8/1/2023
12 US 9/1/2023
12 US 10/1/2023
12 US 11/1/2023
12 US 12/1/2023
12 Mexico 1/1/2024
12 Mexico 2/1/2024
12 Mexico 3/1/2024
12 Mexico 4/1/2024
12 Mexico 5/1/2024
12 Mexico 6/1/2024
12 Mexico 7/1/2024
12 Mexico 8/1/2024
12 Mexico 9/1/2024
12 Mexico 10/1/2024
12 Mexico 11/1/2024
12 Mexico 12/1/2024
This problem is very similar to Adding rows for each month in a dataframe based on column date
However that problem only accounts for the unique key being one column. In this example the unique key is the Number and the Country.
This is what I am currently doing however, it only accounts for one column 'Number' and I need to include both Number and Country as they are the unique key.
df1 = pd.concat([pd.Series(r.Number, pd.date_range(start = r.StartDate, end = r.EndDate, freq='MS'))
for r in df1.itertuples()]).reset_index().drop_duplicates()

Create the range then explode
df['New']= [pd.date_range(start = x, end = y, freq='MS') for x , y in zip(df.pop('StartDate'),df.pop('EndDate'))]
df=df.explode('New')
Out[54]:
Number Country New
0 12 US 2023-01-01
0 12 US 2023-02-01
0 12 US 2023-03-01
0 12 US 2023-04-01
0 12 US 2023-05-01
0 12 US 2023-06-01
0 12 US 2023-07-01
0 12 US 2023-08-01
0 12 US 2023-09-01
0 12 US 2023-10-01
0 12 US 2023-11-01
0 12 US 2023-12-01
1 12 Mexico 2024-01-01
1 12 Mexico 2024-02-01
1 12 Mexico 2024-03-01
1 12 Mexico 2024-04-01
1 12 Mexico 2024-05-01
1 12 Mexico 2024-06-01
1 12 Mexico 2024-07-01
1 12 Mexico 2024-08-01
1 12 Mexico 2024-09-01
1 12 Mexico 2024-10-01
1 12 Mexico 2024-11-01
1 12 Mexico 2024-12-01

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Dropping rows of a dataframe and keep a specific group - python

Related

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

Split date column into two

How to divide multiple columns based on three conditions

Replace last value(s) of group with NaN

Add rows based on column date with two column unique key - pandas

Categories

Resources