Normalize multivalue column - python

I have dataset like this:
data = {'Host': ['A','A','A', 'A'], 'Seq': ['0, 1, 2, 99',' 4, 5, 6', '999, 8', '100']}
df = pd.DataFrame(data)
I want to normalize all values.
First i go to this shape:
host Seq
A 0
A 1
A 2
A 99
A 4
A 5
A 6
A 999
A 8
A 100
By this code:
df.join(df.pop('Seq')
.str.split(',',expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('Seq')).reset_index(drop=True)
After just normalize by StandartScaler:
df['Seq'] = scaler.fit_transform(np.array(df.Seq.values).reshape(-1, 1)).reshape(-1)
And now i dont know how return to start view.
Waiting for ideas and comments

Assuming you didn't destroy the index information from the original
d_ = df.assign(Seq=df.Seq.str.split(',\s*')).explode('Seq')
d_
Host Seq
0 A 0
0 A 1
0 A 2
0 A 99
1 A 4
1 A 5
1 A 6
2 A 999
2 A 8
3 A 100
Then you can group by the index and 'Host' column
d_.groupby([d_.index, 'Host']).Seq.apply(', '.join).reset_index('Host')
Host Seq
0 A 0, 1, 2, 99
1 A 4, 5, 6
2 A 999, 8
3 A 100

Related

pandas series add with previous row on condition

I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)
The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3
You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned

How to remove rows from DF as result of a groupby query?

I have this Pandas dataframe:
df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]})
# site day hour clicks
# 0 a 1 1 100
# 1 a 1 2 200
# 2 a 1 3 50
# 3 b 1 1 0
# 4 b 1 2 0
# 5 b 1 3 0
# 6 a 2 1 10
# 7 a 2 2 0
# 8 a 2 3 20
And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1.
I can basically group them and show where the sum is 0 for a day/site:
print(df.groupby(['site', 'day'])['clicks'].sum() == 0)
But how would now be straight-forward way to remove the rows from original dataframe where that condition applies?
Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?
Option 1
Using groupby, transform and boolean indexing:
df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0]
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20
Option 2
Using groupby and filter:
df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0)
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20

Pandas groupby cumcount starting on row with a certain column value

I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3

Selecting columns with startswith in pandas

Hi I have a data and I want to rename one of the column and select columns starts with t string.
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
'tr': [1,2,3,4,5],
'tk': [6,7,8,9,10],
'ak': [11,12,13,14,15]
}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','tr','tk','ak'])
df
patient obs treatment score tr tk ak
0 1 1 0 strong 1 6 11
1 1 2 1 weak 2 7 12
2 1 3 0 normal 3 8 13
3 2 1 1 weak 4 9 14
4 2 2 0 strong 5 10 15
So I tried by following python-pandas-renaming-column-name-startswith
df.rename(columns = {'treatment':'treat'})[['score','obs',df[df.columns[pd.Series(df.columns).str.startswith('t')]]]]
but getting this error
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
How can I select the columns that starts with t ?
Thx
Convert to Series is not necessary, but if want add to another list of columns convert output to list:
cols = df.columns[df.columns.str.startswith('t')].tolist()
df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})
Another idea is use 2 masks and chain by | for bitwise OR:
Notice:
Columns names are filtered from original columns names before rename in your solution, so is necessary rename later.
m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])
df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
obs treat score tr tk
0 1 0 strong 1 6
1 2 1 weak 2 7
2 3 0 normal 3 8
3 1 1 weak 4 9
4 2 0 strong 5 10
If need rename first, is necessary reassign back for filter by renamed columns names:
df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]
#Select columns startswith "t"
df = df[df.columns[df.columns.str.startswith('t')]]
#Rename your column
df.rename(columns = {'treatment':'treat'})

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Categories

Resources