We have data representing workers billing history of payments and penalties after their work shifts. Sometimes penalty for the worker is wrong, because he had technical problems with his mobile app and in reality he attended the job. Later he gets his penalty refunded which goes with description 'balance_correction'.
The goal is to show n lines (rows) in data to to find pattern for what he got the penalty.
So here is the data:
d = {'balance_id': [70775,70775 ,70775,70775,70775], 'amount': [2500, 2450,-500,500,2700]
,'description':['payment_for_job_order_080ecd','payment_for_job_order_180eca'
,'penalty_for_being_absent_at_job','balance_correction','payment_for_job_order_120ecq']}
df1 = pd.DataFrame(data=d)
df1
balance_id amount description
0 70775 2500 payment_for_job_order_080ecd
1 70775 2450 payment_for_job_order_180eca
2 70775 -500 penalty_for_being_absent_at_job
3 70775 500 balance_correction
4 70775 2700 payment_for_job_order_120ecq
I try this:
df1.loc[df1['description']=='balance_correction'].iloc[:-2]
and get nothing. Also using shift doesn't help. If we need 2 roes to show, the result should be
balance_id amount description
1 70775 2450 payment_for_job_order_180eca
2 70775 -500 penalty_for_being_absent_at_job
What can solve the issue?
If the index on your data frame is sequential (0, 1, 2, 3, ...), you can filter by the index:
idx = df1.loc[df1['description'] == 'balance_correction'].index
df1.loc[(idx - 2).append(idx - 1)]
Related
I have a data frame that is created by an application and saves info with the following structure: Each row is one mutation that affects genes and transcripts (these are the same gene but different configuration)
data = {'ID': ['mut1', 'mut1', 'mut1', 'mut1', 'mut2'],
'transcript_affected': ["00001", "00002", "00003", "00001", "00001"],
'gene_affected' : ['DIABLO','DIABLO','DIABLO','PLNH3','BRCA1']
}
df = pd.DataFrame(data)
df
ID transcript_affected gene_affected
mut1 00001 DIABLO
mut1 00002 DIABLO
mut1 00003 DIABLO
mut1 00001 PLNH3
mut2 00001 BRCA1
# Mut 1 affects 2 genes (DIABLO and PLNH3), mut2 affects 1 gene
From this I have two questions:
How many mutations affect 1 gene.
How likely (in %) more than one transcript is affected when one gene is affected. I mean, proportion of genes with more than one transcript. For this example would be 33% (there is 3 genes affected but only one affected more than 1)
Some ideas I think it could be done
To know how many mutations affect more than 1 gene I think this would be a start
df.groupby('ID')['transcript_affected'].count()
ID
mut1 4
mut2 1
And then I could count how many IDs its value is more than 1
For the second question
df.groupby('gene_affected')['transcript_affected'].count()
gene_affected
BRCA1 1
DIABLO 3
PLNH3 1
Name: transcript_affected, dtype: int64
Then I could count somehow (I don't) how many were more than 1 (>=2).
please try this:
Answer 1:
df_1 = df.groupby('ID')['gene_affected'].count().reset_index()
answer_1 = df_1[df_1['gene_affected']>1].shape[0]
Answer 2:
df_2 = df.groupby('gene_affected')['transcript_affected'].count().reset_index()
answer_2 = (df_2[df_2['transcript_affected']>1].shape[0]/df_2.shape[0]) * 100
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
In this particular issue, I have an imaginary city divided into squares - basically a MxN grid of squares covering the city. M and N can be relatively big, so I have cases with more than 40,000 square cells overall.
I have a number of customers Z distributed in this grid, some cells will contain many customers while others will be empty. I would like to find a way to place the minimum number of shops (only one per cell) to be able to serve all customers, with the restriction that all customers must be “in reach” of one shop and all customers need to be included.
As an additional couple of twist, I have these constraints/issues:
There is a maximum distance that a customer can travel - if the shop is in a cell too far away then the customer cannot be associated with that shop. Edit: it’s not really a distance, it’s a measure of how easy it is for a customer to reach a shop, so I can’t use circles...
While respecting the condition (1) above, there may well be multiple shops in reaching distance of the same customer. In this case, the closest shop should win.
At the moment I’m trying to ignore the issue of costs - many customers means bigger shops and larger costs - but maybe at some point I’ll think about that too. The problem is, I have no idea of the name of the problem I’m looking at nor about possible algorithmic solutions for it: can this be solved as a Linear Programming problem?
I normally code in Python, so any suggestions on a possible algorithmic approach and/or some code/libraries to solve it would be very much appreciated.
Thank you in advance.
Edit: as a follow up, I kind of found out I could solve this problem as a MINLP “uncapacitated facility problem”, but all the information I have found are way too complex: I don’t care to know which customer is served by which shop, I only care to know if and where a shop is built. I have a secondary way - as post processing - to associate a customer to the most appropriate shop.
All the codes I found set up this monstrous linear system associating a constraint per customer per shop (as “explained” here: https://en.m.wikipedia.org/wiki/Facility_location_problem#Uncapacitated_facility_location), so in a situation like mine I could easily end up with a linear system with millions of rows and columns, which with integer/binary variables will take about the age of the universe to solve.
There must be an easier way to handle this...
I think this can be formulated as a set covering problem link.
You say:
in a situation like mine I could easily end up with a linear system
with millions of rows and columns, which with integer/binary variables
will take about the age of the universe to solve
So let's see if that is even remotely true.
Step 1: generate some data
I generated a grid of 200 x 200, yielding 40,000 cells. I place at random M=500 customers. This looks like:
---- 22 PARAMETER cloc customer locations
x y
cust1 35 75
cust2 169 84
cust3 111 18
cust4 61 163
cust5 59 102
...
cust497 196 148
cust498 115 136
cust499 63 101
cust500 92 87
Step 2: calculate reach of each customer
The next step is to determine for each customer c the allowed locations (i,j) within reach. I created a large sparse boolean matrix reach(c,i,j) for this. I used the rule: if the manhattan distance is
|i-cloc(c,'x')|+|j-cloc(c,'y')| <= 10
then the store at (i,j) can service customer c. My data looks like:
(zeros are not stored). This data structure has 106k elements.
Step 3: Form MIP model
We form a simple MIP model:
The inequality constraint says: we need at least one store that is within reach of each customer. This is a very simple model to formulate and to implement.
Step 4: Solve
This is a large but easy MIP. It has 40,000 binary variables. It solves very fast. On my laptop it took less than 1 second with a commercial solver (3 seconds with open-source solver CBC).
The solution looks like:
---- 47 VARIABLE numStores.L = 113 number of stores
---- 47 VARIABLE placeStore.L store locations
j1 j6 j7 j8 j9 j15 j16 j17 j18
i4 1
i18 1
i40 1
i70 1
i79 1
i80 1
i107 1
i118 1
i136 1
i157 1
i167 1
i193 1
+ j21 j23 j26 j28 j29 j31 j32 j36 j38
i10 1
i28 1
i54 1
i72 1
i96 1
i113 1
i147 1
i158 1
i179 1
i184 1
i198 1
+ j39 j44 j45 j46 j49 j50 j56 j58 j59
i5 1
i18 1
i39 1
i62 1
i85 1
i102 1
i104 1
i133 1
i166 1
i195 1
+ j62 j66 j67 j68 j69 j73 j74 j76 j80
i11 1
i16 1
i36 1
i61 1
i76 1
i105 1
i112 1
i117 1
i128 1
i146 1
i190 1
+ j82 j84 j85 j88 j90 j92 j95 j96 j97
i17 1
i26 1
i35 1
i48 1
i68 1
i79 1
i97 1
i136 1
i156 1
i170 1
i183 1
i191 1
+ j98 j102 j107 j111 j112 j114 j115 j116 j118
i4 1
i22 1
i36 1
i56 1
i63 1
i68 1
i88 1
i100 1
i101 1
i111 1
i129 1
i140 1
+ j119 j121 j126 j127 j132 j133 j134 j136 j139
i11 1
i30 1
i53 1
i72 1
i111 1
i129 1
i144 1
i159 1
i183 1
i191 1
+ j140 j147 j149 j150 j152 j153 j154 j156 j158
i14 1
i35 1
i48 1
i83 1
i98 1
i117 1
i158 1
i174 1
i194 1
+ j161 j162 j163 j164 j166 j170 j172 j174 j175
i5 1
i32 1
i42 1
i61 1
i69 1
i103 1
i143 1
i145 1
i158 1
i192 1
i198 1
+ j176 j178 j179 j180 j182 j183 j184 j188 j191
i6 1
i13 1
i23 1
i47 1
i61 1
i81 1
i93 1
i103 1
i125 1
i182 1
i193 1
+ j192 j193 j196
i73 1
i120 1
i138 1
i167 1
I think we have debunked your statement that a MIP model is not a feasible approach to this problem.
Note that the age of the universe is 13.7 billion years or 4.3e17 seconds. So we have achieved a speed-up of about 1e17. This is a record for me.
Note that this model does not find the optimal locations for the stores, but only a configuration that minimizes the number of stores needed to service all customers. It is optimal in that sense. But the solution will not minimize the distances between customers and stores.
I am looking to qcut or cut my "Amount" column into bins of 10 percentiles. Basically the describe() feature but with 0-10%, 11-20%, 21-30%, 31-40%, 41-50%, 51-60%, 61-70%, 71-80%, 81-90%, 91-100% instead.
After the binning i'd like to create a column that shows 1-10 indicating the bin that particular amount is apart of.
I've tried using this code below, however, I do not believe it's achieving what I want.
groups = df.groupby(pd.cut(df['Amount'], 10)).size()
Here is my DataFrame!
df.shape
Out[5]: (1385, 2)
df.head(10)
Out[6]:
Amount New or Repeat Customer
0 23044 New
1 15509 New
2 6184 New
3 6184 New
4 5828 New
5 5461 New
6 5143 New
7 5027 New
8 4992 New
9 4698 Repeat
Use pd.qcut:
# Sample data
size = 100
df = pd.DataFrame({
'Amount': np.random.randint(5000, 20000, size),
'CustomerType': np.random.choice(['New', 'Repeat'], size)
})
# Binning
labels = ['0% to 10%'] + [f'{i+1}% to {i+10}%' for i in range(10, 100, 10)]
df['Bin'] = pd.qcut(df['Amount'], 10, labels=labels)
Result:
Amount CustomerType Bin
0 15597 Repeat 61% to 70%
1 14498 New 51% to 60%
2 6373 Repeat 0% to 10%
3 9901 Repeat 21% to 30%
4 18450 Repeat 91% to 100%
5 9337 Repeat 21% to 30%
6 19310 Repeat 91% to 100%
7 11198 New 31% to 40%
8 12485 New 41% to 50%
9 11130 New 31% to 40%
I have two data frames with each having about 250K lines. I am trying to do a fuzzy lookup between the two data frame's columns. After look up I need the indexes for those good matches with the threshold.
Some details are following,
My df1:
Name State Zip combined_1
0 Auto MN 10 Auto,MN,10
1 Rtla VI 253 Rtla,VI,253
2 Huka CO 56218 Huka,CO,56218
3 kann PR 214 Kann,PR,214
4 Himm NJ 65216 Himm,NJ,65216
5 Elko NY 65418 Elko,NY,65418
6 Tasm MA 13 Tasm,MA,13
7 Hspt OH 43218 Hspt,OH,43218
My other data frame that I am trying to look upto
Name State Zip combined_2
0 Kilo NC 69521 Kilo,NC,69521
1 Kjhl FL 3369 Kjhl,FL,3369
2 Rtla VI 25301 Rtla,VI,25301
3 Illt GA 30024 Illt,GA,30024
4 Huka CO 56218 Huka,CO,56218
5 Haja OH 96766 Haja,OH,96766
6 Auto MN 010 Auto,MN,010
7 Clps TX 44155 Clps,TX,44155
If you look close, when I do fuzzy lookup I should get a good match for indexes 0 and 2 in my df1 from df2 indexes, 6, 4.
So, I did this,
from fuzzywuzzy import fuzz
# Save df1 index
df_1index = []
# save df2 index
df2_indexes = []
# save fuzzy ratio
fazz_rat = []
for index, details in enumerate(df1['combined_1']):
for ind, information in enumerate(df2['combined_2']):
fuzmatch = fuzz.ratio(str(details), str(information))
if fuzmatch >= 94:
df_1index.append(index)
df2_indexes.append(ind)
fazz_rat.append(fuzmatch)
else:
pass
As I expected, I am getting the results for this example case,
df_1index
>> [0,2]
df2_indexes
>> [6,4]
To run against 250K * 250K lines in both data frames it takes, so much time.
How can I speed up this lookup process? Is there pandas or python way to improve performance for what want?