Pandas multiindex into columns - python

I have a Dataframe that looks similar to this:
height length
cat max 30 50
mean 20 40
min 10 30
dog max 70 100
mean 50 90
min 30 60
and want to turn it into
height_max height_mean height_min length_max length_mean length_min
cat 30 20 10 50 40 30
dog 70 50 30 100 90 60
The column names itself arent important, so if they are numbered it is also fine.

You can unstack and rework the columns index:
df2 = df.unstack(1)
df2.columns = df2.columns.map('_'.join)
output:
height_max height_mean height_min length_max length_mean length_min
cat 30 20 10 50 40 30
dog 70 50 30 100 90 60

Related

Calculating max ,mean and min of a column in dataframe

Calculated the mean, max and mean of a column in dataframe as follows:
g['MAX range'] = g['Current_range'].max()
g['min range'] = g['Current_range'].min()
g['mean'] = g['Current_range'].mean()
The output was as follows:
current_speed current_range maxrange minrange mean
10 25 190 25 74
20 40 190 25 74
20 41 190 25 74
80 190 190 25 74
i dont want to get repeated values in max range,min range,mean but only single values in those columns.
Expected output:
current_speed current_range maxrange minrange mean
10 25 190 25 74
20 40
20 41
80 190
How can i modify it?
You can add it with .loc. Example for mean:
g.loc[g.index[0], 'mean'] = g['Current_range'].mean()
It will create column mean with mean value in the first row and NaN values for other rows.

Efficient way to get the count of all values within a range of a group

I have referred to this question and the slightly modified solution works in the sample mode but runs out of memory for my full data set (with ~3GB worth of data).
I'm trying to find the count of all values within the range of a group (grouped by anchor):
The range formula is y_val +- (anchor_val / 20).
Do note that anchor_val is consistent across all anchors, e.g.:
ID
Anchor
y_val
anchor_val
12
ab
80
40
13
ab
20
40
14
abc
80
50
15
abc
80
50
16
ab
81
40
17
abd
80
50
Which would result in:
ID
Anchor
y_val
anchor_val
(anchor_val / 20)
count
12
ab
80
40
2
1
13
ab
20
40
2
0
14
abc
80
50
2.5
1
15
abc
79
50
2.5
1
16
ab
81
40
2
1
17
abd
80
50
2.5
0
(I added the anchor_val/20 for clarity).
EDIT:
The current code that caused out of memory error:
df["rule_8_comp_low"] = df["y_val"] - df["anchor_val"] / 20
df["rule_8_comp_high"] = df["y_val"] + df["anchor_val"] / 20
m = df.reset_index().merge(
df[["anchor_col", "y_val"]].reset_index(), on="anchor_col"
)
m["rule_8_to_count"] = (
m.y_val_y.ge(m.rule_8_comp_low)
& m.y_val_y.le(m.rule_8_comp_high)
& (m.index_x != m.index_y)
)
df["y_val_between"] = m.groupby("index_x").rule_8_to_count.sum()

Python combine integer columns to create multiple columns with suffix

I have a dataframe with a sample of the employee survey results as shown below. The values in the delta columns are just the difference between the FY21 and FY20 columns.
Employee leadership_fy21 leadership_fy20 leadership_delta comms_fy21 comms_fy20 comms_delta
patrick.t#abc.com 88 50 38 90 80 10
johnson.g#abc.com 22 82 -60 80 90 -10
pamela.u#abc.com 41 94 -53 44 60 -16
yasmine.a#abc.com 90 66 24 30 10 20
I'd like to create multiple columns that
i. contain the % in the fy21 values
ii. merge it with the columns with the delta suffix such that the delta values are in a ().
example output would be:
Employee leadership_fy21 leadership_delta leadership_final comms_fy21 comms_delta comms_final
patrick.t#abc.com 88 38 88% (38) 90 10 90% (10)
johnson.g#abc.com 22 -60 22% (-60) 80 -10 80% (-10)
pamela.u#abc.com 41 -53 41% (-53) 44 -16 44% (-16)
yasmine.a#abc.com 90 24 90% (24) 30 20 30% (20)
I have tried the following code but it doesn't seem to work. It might have to do with numpy not being able to combine strings. Appreciate any form of help I can get, thank you.
#create a list of all the rating columns
ratingcollist = ['leadership','comms','wellbeing','teamwork']
#create a for loop to get all the columns that match the column list
for rat in ratingcollist:
cols = df.filter(like=rat).columns
fy21cols = df[cols].filter(like='_fy21').columns
deltacols = df[cols].filter(like='_delta').columns
if len(cols) > 0:
df[f'{rat.lower()}final'] = (df[fy21cols].values.astype(str) + '%' + '(' + df[deltacols].values.astype(str) + ')')
You can do this:
def yourfunction(ratingcol):
x=df.filter(regex=f'{ratingcol}(_delta|_fy21)')
fy=x.filter(regex='21').iloc[:,0].astype(str)
delta=x.filter(regex='_delta').iloc[:,0].astype(str)
return(fy+"%("+delta+")")
yourfunction('leadership')
0 88%(38)
1 22%(-60)
2 41%(-53)
3 90%(24)
Then, using a for loop you can create your columns
for i in ratingcollist:
df[f"{i}_final"]=yourfunction(i)

combining three different timestamp dataframes using duration match

I have three data frames with different dataframes and frequencies. I want to combine them into one dataframe.
First dataframe collects sunlight from sun as given below:
df1 =
index light_data
05/01/2019 06:54:00.000 10
05/01/2019 06:55:00.000 20
05/01/2019 06:56:00.000 30
05/01/2019 06:57:00.000 40
05/01/2019 06:59:00.000 50
05/01/2019 07:01:00.000 60
05/01/2019 07:03:00.000 70
05/01/2019 07:04:00.000 80
05/01/2019 07:06:00.000 90
Second dataframe collects solar power from unit-A
df2 =
index P1
05/01/2019 06:54:24.000 100
05/01/2019 06:59:32.000 200
05/01/2019 07:04:56.000 300
Third dataframe collects solar power from unit-B
df3 =
index P2
05/01/2019 06:56:45.000 400
05/01/2019 07:01:21.000 500
05/01/2019 07:06:34.000 600
Above three are measurements coming from the field. Three have different timestamps. Now I want to combine all three into data frame with one timestamp.
df1 data occurs every minute
df2 and df3 occur every five minutes at different times.
Combine three data frames with df2 timestamp as a reference index with no seconds information.
Finally, I want the output something like as given below:
df_combine =
combine_index P1 light_data1 P2 light_data2
05/01/2019 06:54:00 100 10 400 30
05/01/2019 06:59:00 200 50 500 60
05/01/2019 07:04:00 300 80 600 90
# Note: combine_index is df2 index with no seconds
Nice question I am using reindex with nearest as method 1
df1['row']=df1.index
s1=df1.reindex(df2.index,method='nearest')
s2=df1.reindex(df3.index,method='nearest')
s1=s1.join(df2).set_index('row')
s2=s2.join(df3).set_index('row')
pd.concat([s1,s2.reindex(s1.index,method='nearest')],1)
Out[67]:
light_data A light_data B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 60 500
2019-05-01 07:04:00 80 300 90 600
Or at the last line using merge_asof
pd.merge_asof(s1,s2,left_index=True,right_index=True,direction='nearest')
Out[81]:
light_data_x A light_data_y B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 40 400
2019-05-01 07:04:00 80 300 90 600
Make it extendable
df1['row']=df1.index
l=[]
for i,x in enumerate([df2,df3]):
s1=df1.reindex(x.index,method='nearest')
if i==0:
l.append(s1.join(x).set_index('row').add_suffix(x.columns[0].str[-1]))
else :
l.append(s1.join(x).set_index('row').reindex(l[0].index,method='nearest').add_suffix(x.columns[0].str[-1]))
pd.concat(l,1)

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Categories

Resources