I'm using python 3.x I have pandas data frame df. df looks like
Name Maths Science Social studies
abc 80 70 90
cde 90 60 80
xyz 100 80 85
...
...
I would like to generate a pandas data frame which will store student name, maximum marks & the subject contributed maximum marks. If maximum marks is 100 then it will consider next highest instead of 100. So my output data frame will look like
Name Highest_Marks Subject_contributed_Max
abc 90 Social Studies
cde 90 Maths
xyz 85 Social Studies
Can you suggest me how to do it?
You can use:
df2 = df.drop(columns='Name').mask(df.eq(100))
df['Highest_Marks'] = df2.max(axis=1)
df['Subject_contributed_Max'] = df2.idxmax(axis=1)
output:
Name Maths Science Social studies Highest_Marks Subject_contributed_Max
0 abc 80 70 90 90.0 Social studies
1 cde 90 60 80 90.0 Maths
2 xyz 100 80 85 85.0 Social studies
For efficiency, avoiding computing twice the max/idxmax, you can compute the idxmax and use a lookup
s = (df
.drop(columns='Name')
.mask(df.eq(100))
.idxmax(axis=1)
)
idx, cols = pd.factorize(s)
df['Highest_Marks'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
df['Subject_contributed_Max'] = s
This will work
df_melt = df.melt('Name')
df_melt = df_melt.loc[df_melt['value'] < 100]
df_melt['RN'] = df_melt.sort_values(['value'], ascending=False).groupby(['Name']).cumcount() + 1
df_melt.loc[df_melt['RN'] == 1].sort_values('Name')
This will do it:
df = df.set_index('Name').stack().reset_index().rename(columns={
'level_1':'Subject_contributed_Max', 0:'Highest_Marks'}).sort_values(
['Name','Highest_Marks'])
df = df[df['Highest_Marks'] != 100].groupby('Name').last().reset_index()[[
'Name', 'Highest_Marks', 'Subject_contributed_Max']]
Input:
Name Maths Science Social studies
0 abc 80 70 90
1 cde 90 60 80
2 xyz 100 80 85
Output:
Name Highest_Marks Subject_contributed_Max
0 abc 90 Social studies
1 cde 90 Maths
2 xyz 85 Social studies
UPDATE:
Here is a faster approach than my original answer above. It's similar to one of the suggestions in #mozway's answer, though it uses where() instead of mask() (and it also only returns the high marks, not the individual columns of marks).
df2 = df.set_index('Name')
df2 = df2.where(df2 < 100)
df2 = df2.assign(**{'Highest_Marks':df2.max(axis=1).astype(int),
'Subject_contributed_Max':df2.idxmax(axis=1)}).reset_index()[[
'Name', 'Highest_Marks', 'Subject_contributed_Max']]
Prompted by a comment by OP, I ran some benchmarks on answers by #mozway and me (I also tried adding the answer by #ArchAngelPwn, but it doesn't seem to give comparable output in its current form).
Here are the results for a 1000 row by 9000 column dataframe:
Timeit results:
foo_1 (orig) ran in 2.4102477666456252 seconds using 3 iterations
foo_2 (refined) ran in 2.256996333327455 seconds using 3 iterations
foo_3 (where) ran in 1.1588773333545153 seconds using 3 iterations
foo_4 (mozway mask) ran in 1.17148056665125 seconds using 3 iterations
foo_5 (mozway mask lookup) ran in 1.1049298333236948 seconds using 3 iterations
Related
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.
I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0
I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0
EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.
What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()
Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150