Pandas DataFrame - Creating dynamic number of columns - python

Data:
qid qualid val
0 1845631864 227 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1 1899053658 44 1,3,3,2,2,2,3,3,4,4,4,5,5,5,5,5,5,5
2 1192887045 197 704
3 1833579269 194 139472
4 1497352469 30 120026,170154,152723,90407,63119,80077,178871,...
Problem:
Numbers separated by commas in column val need to represented in different columns for each row.
I don't know if Pandas allows for it, but ideally, one would want to create exact n number of columns for each row, where n is the number of elements in column val.
If that is not possible, the greatest number of elements in val should be the number of columns and rows where elements are lesser than that should consist of NaNs.
Example Solution 1 for Above Problem:
qid qualid val1 val2 val3 valn-3 valn-2 valn-1 valn
0 1845631864 227 0 0 0 ...... 0 0 0 0
1 1899053658 44 1 3 3 ...... 5
2 1192887045 197 704
3 1833579269 194 139472
4 1497352469 30 120026 170154 152723.....63119 80077 178871 12313
Alternate Solution 2 for Above Problem:
qid qualid val1 val2 val3 valn-3 valn-2 valn-1 valn
0 1845631864 227 0 0 0 ...... 0 0 0 0
1 1899053658 44 1 3 3 ...... 5 NaN NaN NaN
2 1192887045 197 704 NaN NaN ...... NaN NaN NaN NaN
3 1833579269 194 139472 NaN NaN ...... NaN NaN NaN
4 1497352469 30 120026 170154 152723.....63119 80077 178871 12313

You can check str.split
pd.concat([df,df.val.str.split(',',expand=True).add_prefix('Val_')],axis=1)
Out[29]:
qid qualid ... Val_16 Val_17
0 1845631864 227 ... 0 0
1 1899053658 44 ... 5 5
2 1192887045 197 ... None None
3 1833579269 194 ... None None
4 1497352469 30 ... None None

Related

Concatenating the values of column and putting back to same row again

Customer Material ID Bill Quantity
0 1 64578 100
1 2 64579 58
2 3 64580 36
3 4 64581 45
4 5 64582 145
We have to concatenate the 0th index material id and 1st index material id and put it into the 0th index material id record.
similarly 1,2 3,4
The result should contain only catenated records.
Just shift the data and combine the columns.
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 129157.0
1 1 64579 58 NaN 129159.0
2 2 64580 36 NaN 129161.0
3 3 64581 45 NaN 129163.0
4 4 64582 145 NaN NaN
If you need to concatenate it as a str type then the following would work.
df["Material ID"] = df["Material ID"].astype(str)
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 6457864579
1 1 64579 58 NaN 6457964580
2 2 64580 36 NaN 6458064581
3 3 64581 45 NaN 6458164582
4 4 64582 145 NaN NaN

Drop rows after maximum value in a grouped Pandas dataframe

I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215

Pandas: Add new column based on comparison of two DFs

I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1
join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN
To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?
You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

iterate over unique values in PANDAS

I have a dataset in the following format:
Patient Date colA colB
1 1/3/2015 . 5
1 2/5/2015 3 10
1 3/5/2016 8 .
2 4/5/2014 2 .
2 etc
I am trying to define a function in PANDAS which treats unique patients as an item and iterates over these unique patient items to keep only to most recent observation per column (replacing all other values with missing or null). For example: for patient 1, the output would entail -
Patient Date colA colB
1 1/3/2015 . .
1 2/5/2015 . 10
1 3/5/2016 8 .
I understand that I can use something like the following with .apply(), but this does not account for duplicate patient IDs...
def getrecentobs():
for i in df['Patient']:
etc
Any help or direction is much appreciated.
There is a function in pandas called last which can be used with groupby to give you the last values for a given groupby. I'm not sure why you require the blank rows but if you need them you can join the groupby back on the original data frame. Sorry the sort is there as the date was not sorted in my sample data. Hope that helps.
Example:
DataFrame
id date amount code
0 3107 2010-10-20 136.4004 290
1 3001 2010-10-08 104.1800 290
2 3109 2010-10-08 276.0629 165
3 3001 2010-10-08 -177.9800 290
4 3002 2010-10-08 1871.1094 290
5 3109 2010-10-08 225.7038 155
6 3109 2010-10-08 98.5578 170
7 3107 2010-10-08 231.3949 165
8 3203 2010-10-08 333.6636 290
9 -9100 2010-10-08 3478.7500 290
If previous rows not needed:
b.sort_values("date").groupby(["id","date"]).last().reset_index()
The groupby aggregates the data by the "last" meaning the last value for those columns.
Output only latest rows with values:
id date amount code
0 -9100 2010-10-08 3478.7500 290
1 3001 2010-10-08 -177.9800 290
2 3002 2010-10-08 1871.1094 290
3 3107 2010-10-08 231.3949 165
4 3107 2010-10-20 136.4004 290
5 3109 2010-10-08 98.5578 170
6 3203 2010-10-08 333.6636 290
I think you can use to_numeric for convert values . to NaN, then create mask with groupby and rank and last apply mask:
print df
Patient Date colA colB
0 1 1/3/2015 . 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 .
3 2 4/5/2014 2 .
4 2 5/5/2014 4 .
df['colA'] = pd.to_numeric(df['colA'], errors='coerce')
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')
print df
Patient Date colA colB
0 1 1/3/2015 NaN 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 2 NaN
4 2 5/5/2014 4 NaN
print df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False)
colA colB
0 NaN 2
1 2 1
2 1 NaN
3 2 NaN
4 1 NaN
mask = df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False) == 1
print mask
colA colB
0 False False
1 False True
2 True False
3 False False
4 True False
df[['colA','colB']] = df[['colA','colB']][mask]
print df
Patient Date colA colB
0 1 1/3/2015 NaN NaN
1 1 2/5/2015 NaN 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 NaN NaN
4 2 5/5/2014 4 NaN
I think you are looking for pandas groupby.
For example, df.groubpy('Patient').last() will return a DataFrame with the last observation of each patient. If the patients are not sorted by date you can find the latest record date using max function.
df.groupby('Patient').last()
Date colA colB
Patient
1 3/5/2016 8 .
2 etc 2 .
You can make your own functions and then call the apply() function of groupby.

Categories

Resources