As I am newbie to deeper DataFrame operations, I would like to ask, how to find eg. the lowest campaign ID in this DataFrame per every customerid which is in this kind of DataFrame? As I learned, iteration should not be done in DataFrame.
orderid customerid campaignid orderdate city state zipcode paymenttype totalprice numorderlines numunits
0 1002854 45978 2141 2009-10-13 NEWTON MA 02459 VI 190.00 3 3
1 1002855 125381 2173 2009-10-13 NEW ROCHELLE NY 10804 VI 10.00 1 1
2 1002856 103122 2141 2011-06-02 MIAMI FL 33137 AE 35.22 2 2
3 1002857 130980 2173 2009-10-14 E RUTHERFORD NJ 07073 AE 10.00 1 1
4 1002886 48553 2141 2010-11-19 BALTIMORE MD 21218 VI 10.00 1 1
5 1002887 106150 2173 2009-10-15 ROWAYTON CT 06853 AE 10.00 1 1
6 1002888 27805 2173 2009-10-15 INDIANAPOLIS IN 46240 VI 10.00 1 1
7 1002889 24546 2173 2009-10-15 PLEASANTVILLE NY 10570 MC 10.00 1 1
8 1002890 43783 2173 2009-10-15 EAST STROUDSBURG PA 18301 DB 29.68 2 2
9 1003004 15688 2173 2009-10-15 ROUND LAKE PARK IL 60073 DB 19.68 1 1
10 1003044 130970 2141 2010-11-22 BLOOMFIELD NJ 07003 AE 10.00 1 1
11 1003045 40048 2173 2010-11-22 SPRINGFIELD IL 62704 MC 10.00 1 1
12 1003046 21927 2141 2010-11-22 WACO TX 76710 MC 17.50 1 1
13 1003075 130971 2141 2010-11-22 FAIRFIELD NJ 07004 MC 59.80 1 4
14 1003076 7117 2141 2010-11-22 BROOKLYN NY 11228 AE 22.50 1 1
Try the following
df.groupby('customerid')['campaignid'].min()
You can group unique values of customerid and subsequently find the minimum value per group for a given column using ['column_name'].min()
Related
I was solving a practice question where I wanted to get the top 5 percentile of frauds for each state. I was able to solve it in SQL but the pandas gives a different answer for me than SQL.
Full Question
Top Percentile Fraud
ABC Corp is a mid-sized insurer in the US
and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio.
They have developed a ML based predictive model to identify
propensity of fraudulent claims.
Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model.
Your objective is to identify the top 5 percentile of claims from each state.
Your output should be policy number, state, claim cost, and fraud score.
Question: How to get the same answer in pandas that I obtained from SQL?
My attempt
I break the fraud score in 100 equal parts using pandas cut and get categorical codes for each bins, then I took values above or equal to 95, but this gives different result.
I am trying to get same answer that I got from SQL query.
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape) # (400, 4)
df.head(2)
policy_num state claim_cost fraud_score
0 ABCD1001 CA 4113 0.613
1 ABCD1002 CA 3946 0.156
Problem
Group by each state, and find top 5 percentile fraud scores.
My attempt
df['state_ntile'] = df.groupby('state')['fraud_score']\
.apply(lambda ser: pd.cut(ser,100).cat.codes+1) # +1 makes 1 to 100 including.
df.query('state_ntile >=95')\
.sort_values(['state','fraud_score'],ascending=[True,False]).reset_index(drop=True)
Postgres SQL code ( I know SQL, I want answer in pandas)
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score)a
WHERE percentile <=5
The output I want
policy_num state claim_cost fraud_score percentile
0 ABCD1027 CA 2663 0.988 1
1 ABCD1016 CA 1639 0.964 2
2 ABCD1079 CA 4224 0.963 3
3 ABCD1081 CA 1080 0.951 4
4 ABCD1069 CA 1426 0.948 5
5 ABCD1222 FL 2392 0.988 1
6 ABCD1218 FL 1419 0.961 2
7 ABCD1291 FL 2581 0.939 3
8 ABCD1230 FL 2560 0.923 4
9 ABCD1277 FL 2057 0.923 5
10 ABCD1189 NY 3577 0.982 1
11 ABCD1117 NY 4903 0.978 2
12 ABCD1187 NY 3722 0.976 3
13 ABCD1196 NY 2994 0.973 4
14 ABCD1121 NY 4009 0.969 5
15 ABCD1361 TX 4950 0.999 1
16 ABCD1304 TX 1407 0.996 1
17 ABCD1398 TX 3191 0.978 2
18 ABCD1366 TX 2453 0.968 3
19 ABCD1386 TX 4311 0.963 4
20 ABCD1363 TX 4103 0.960 5
Thanks to Emma, I got the partial solution.
I could not get the ranks like 1,2,3,...,100 but the resultant table is at least same from the output of SQL. I am still learning how to use the pandas.
Logic:
To get the top 5 percentile, we can use quantile values >= 0.95 as shown below:
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape)
df['state_quantile'] = df.groupby('state')['fraud_score'].transform(lambda x: x.quantile(0.95))
dfx = df.query("fraud_score >= state_quantile").reset_index(drop=True)\
.sort_values(['state','fraud_score'],ascending=[True,False])
dfx
Result
policy_num state claim_cost fraud_score state_quantile
1 ABCD1027 CA 2663 0.988 0.94710
0 ABCD1016 CA 1639 0.964 0.94710
3 ABCD1079 CA 4224 0.963 0.94710
4 ABCD1081 CA 1080 0.951 0.94710
2 ABCD1069 CA 1426 0.948 0.94710
11 ABCD1222 FL 2392 0.988 0.91920
10 ABCD1218 FL 1419 0.961 0.91920
14 ABCD1291 FL 2581 0.939 0.91920
12 ABCD1230 FL 2560 0.923 0.91920
13 ABCD1277 FL 2057 0.923 0.91920
8 ABCD1189 NY 3577 0.982 0.96615
5 ABCD1117 NY 4903 0.978 0.96615
7 ABCD1187 NY 3722 0.976 0.96615
9 ABCD1196 NY 2994 0.973 0.96615
6 ABCD1121 NY 4009 0.969 0.96615
16 ABCD1361 TX 4950 0.999 0.96000
15 ABCD1304 TX 1407 0.996 0.96000
20 ABCD1398 TX 3191 0.978 0.96000
18 ABCD1366 TX 2453 0.968 0.96000
19 ABCD1386 TX 4311 0.963 0.96000
17 ABCD1363 TX 4103 0.960 0.96000
pure pandas
You can use rank() to get percentiles:
out = df.assign(
percentile=(100 * df.groupby('state')['fraud_score']
.rank(ascending=False, pct=True, method='first'))
.truncate().astype(int)
).query('percentile <= 5')
The outcome is in a different order than the original df, but contains the information you seek:
>>> out
policy_num state claim_cost fraud_score percentile
15 ABCD1016 CA 1639 0.964 2
26 ABCD1027 CA 2663 0.988 1
68 ABCD1069 CA 1426 0.948 5
78 ABCD1079 CA 4224 0.963 3
80 ABCD1081 CA 1080 0.951 4
116 ABCD1117 NY 4903 0.978 2
120 ABCD1121 NY 4009 0.969 5
186 ABCD1187 NY 3722 0.976 3
188 ABCD1189 NY 3577 0.982 1
195 ABCD1196 NY 2994 0.973 4
217 ABCD1218 FL 1419 0.961 2
221 ABCD1222 FL 2392 0.988 1
229 ABCD1230 FL 2560 0.923 4
276 ABCD1277 FL 2057 0.923 5
290 ABCD1291 FL 2581 0.939 3
303 ABCD1304 TX 1407 0.996 1
360 ABCD1361 TX 4950 0.999 0
362 ABCD1363 TX 4103 0.960 5
365 ABCD1366 TX 2453 0.968 3
385 ABCD1386 TX 4311 0.963 4
397 ABCD1398 TX 3191 0.978 2
duckdb
Having spent over a decade with PostgreSQL (and the late, wonderful Greenplum), I have grown quite fond of duckdb. It is very fast, can operate straight on (from/to) parquet files, etc. Definitely a space to watch.
Here is how it looks on your data:
duckdb.query_df(df, 'df', """
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM df) as a
WHERE percentile <=5
""").df()
And the result:
policy_num state claim_cost fraud_score percentile
0 ABCD1222 FL 2392 0.988 1
1 ABCD1218 FL 1419 0.961 2
2 ABCD1291 FL 2581 0.939 3
3 ABCD1230 FL 2560 0.923 4
4 ABCD1277 FL 2057 0.923 5
5 ABCD1361 TX 4950 0.999 1
6 ABCD1304 TX 1407 0.996 1
7 ABCD1398 TX 3191 0.978 2
8 ABCD1366 TX 2453 0.968 3
9 ABCD1386 TX 4311 0.963 4
10 ABCD1363 TX 4103 0.960 5
11 ABCD1027 CA 2663 0.988 1
12 ABCD1016 CA 1639 0.964 2
13 ABCD1079 CA 4224 0.963 3
14 ABCD1081 CA 1080 0.951 4
15 ABCD1069 CA 1426 0.948 5
16 ABCD1189 NY 3577 0.982 1
17 ABCD1117 NY 4903 0.978 2
18 ABCD1187 NY 3722 0.976 3
19 ABCD1196 NY 2994 0.973 4
20 ABCD1121 NY 4009 0.969 5
Comparison
An attentive eye will reveal that there are subtle differences between the two results above (beyond the ordering). This is due to different definitions of percentiles (vs. ntile(100)).
Here is how to see these differences:
a = out.set_index('policy_num').sort_index()
b = duck_out.set_index('policy_num').sort_index()
Then:
>>> a.equals(b)
False
>>> a[(a != b).any(1)]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 0
>>> b[(a != b).any(1)]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 1
If we look at the value (before truncation) of percentile:
>>> s = (a != b).any(1)
>>> df.assign(
... percentile=(100 * df.groupby('state')['fraud_score'].rank(
... ascending=False, pct=True, method='first'))
... ).set_index('policy_num').loc[s[s].index]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 0.990099
I have a dataframe that looks like this:
YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK ORIGIN_CITY_NAME ORIGIN_STATE_ABR DEST_CITY_NAME DEST_STATE_ABR DEP_TIME DEP_DELAY_NEW ARR_TIME ARR_DELAY_NEW CANCELLED AIR_TIME
0 2020 1 1 3 Ontario CA San Francisco CA 1851 41 2053 68 0 74
1 2020 1 1 3 Ontario CA San Francisco CA 1146 0 1318 0 0 71
2 2020 1 1 3 Ontario CA San Jose CA 2016 0 2124 0 0 57
3 2020 1 1 3 Ontario CA San Jose CA 1350 10 1505 10 0 63
4 2020 1 1 3 Ontario CA San Jose CA 916 1 1023 0 0 57
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
607341 2020 1 16 4 Portland ME New York NY 554 0 846 65 0 57
607342 2020 1 17 5 Portland ME New York NY 633 33 804 23 0 69
607343 2020 1 18 6 Portland ME New York NY 657 0 810 0 0 55
607344 2020 1 19 7 Portland ME New York NY 705 5 921 39 0 54
607345 2020 1 20 1 Portland ME New York NY 628 0 741 0 0 52
I am trying to modify columns DEP_TIME and ARR_TIME so that they have the format hh:mm. All values should be treated as strings. There are also null values present in some rows that need to be accounted for. Performance is also of consideration (albeit secondary in relation to solving the actual problem) since I need to change about 10M records total.
The challenge in this problem to me is figuring out how to modify these values iteratively based on a condition while also having access to the original value when replacing it. I simply could not find a solution for that specific problem elsewhere. Most problems are using a known constant to replace.
Thanks for your help.
The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you
Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)
Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1