pandas dataframe and external list interaction - python

I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL

Related

Calculate Product of length of lists in dataframe and store in a new column

I have a dataframe, whose values are lists. How can I calculate product of lengths of all lists in a row, and store in a separate column? Maybe the following example will make it clear:
test_1 = ['Protocol', 'SCADA', 'SHM System']
test_2 = ['CM', 'Finances']
test_3 = ['RBA', 'PBA']
df = pd.DataFrame({'a':[test_1,test_2,test_3],'b':[test_2]*3, 'c':[test_3]*3, 'product of len(lists)':[12,8,8]})
This is a sample code which shows that in first row, the product is 3 * 2 * 2 = 12 which are lengths of each list in first row...and simlarly for other rows.
How can I compute these products and store in a new column, for a dataframe whose all values are lists?
Thank you.
Try using DataFrame.applymap and DataFrame.product:
df['product of len(lists)'] = df[['a', 'b', 'c']].applymap(len).product(axis=1)
[out]
a b c product of len(lists)
0 [Protocol, SCADA, SHM System] [CM, Finances] [RBA, PBA] 12
1 [CM, Finances] [CM, Finances] [RBA, PBA] 8
2 [RBA, PBA] [CM, Finances] [RBA, PBA] 8

combine pandas apply results as multiple columns in a single dataframe

Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

Optimization of date subtraction on large dataframe - Pandas

I'm a beginner learning Python. I have a very large dataset - I'm having trouble optimizing my code to make this run faster.
My goal is to optimize all of this (my current code works but slow):
Subtract two date columns
Create new column with the result of that subtraction
Remove original two columns
Do all of this in a fast manner
Random finds:
Thinking about changing the initial file read method...
https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file
I have parse_dates=True when reading the CSV file - so could this be a slowdown? I have 50+ columns but only 1 timestamp column and 1 year column.
This column:
saledate
1 3/26/2004 0:00
2 2/26/2004 0:00
3 5/19/2011 0:00
4 7/23/2009 0:00
5 12/18/2008 0:00
Subtracted by (Should this be converted to a format like 1/1/1996?):
YearMade
1 1996
2 2001
3 2001
4 2007
5 2004
Current code:
mean_YearMade = dfx[dfx['YearMade'] > 1000]['YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[:, 'YearMade'][df['YearMade'] == 1000] = mean_YearMade
# Column has tons of erroneous years with 1000
df['saledate'] = pd.to_datetime(df['saledate'])
df['saleyear'] = df['saledate'].dt.year
df['Age_at_Sale'] = df['saleyear'] - df['YearMade']
df = df.drop('saledate', axis=1)
df = df.drop('YearMade', axis=1)
df = df.drop('saleyear', axis=1)
return df
Any optimization tricks would be much appreciated...
You can try use sub for substract and for select by condition use loc with mask like dfx['YearMade'] > 1000. Also creating column saleyear is not necessary.
dfx['saledate'] = pd.to_datetime(dfx['saledate'])
mean_YearMade = dfx.loc[dfx['YearMade'] > 1000, 'YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[df['YearMade'] == 1000, 'YearMade'] = mean_YearMade
df['Age_at_Sale'] = df['saledate'].dt.year.sub(df['YearMade'])
df = df.drop(['saledate', 'YearMade'], axis=1)
return df

How to "melt" `pandas.DataFrame` objects in Python 3?

I'm trying to melt certain columns of a pd.DataFrame while preserving columns of the other. In this case, I want to melt sine and cosine columns into values and then which column they came from (i.e. sine or cosine) into a new columns entitled data_type then preserving the original desc column.
How can I use pd.melt to achieve this without melting and concatenating each component manually?
# Data
a = np.linspace(0,2*np.pi,100)
DF_data = pd.DataFrame([a, np.sin(np.pi*a), np.cos(np.pi*a)], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
The round about way I did it:
# Melt each part
DF_melt_A = pd.DataFrame([DF_data["t"],
DF_data["sine"],
pd.Series(DF_data.shape[0]*["sine"], index=DF_data.index, name="data_type"),
DF_data["desc"]]).T.reset_index()
DF_melt_A.columns = ["idx","t","values","data_type","desc"]
DF_melt_B = pd.DataFrame([DF_data["t"],
DF_data["cosine"],
pd.Series(DF_data.shape[0]*["cosine"], index=DF_data.index, name="data_type"),
DF_data["desc"]]).T.reset_index()
DF_melt_B.columns = ["idx","t","values","data_type","desc"]
# Merge
pd.concat([DF_melt_A, DF_melt_B], axis=0, ignore_index=True)
If I do pd.melt(DF_data I get a complete meltdown
In response to the comments:
allright so I had to create a similar df because I did not have access to your a variable. I change your a variable for a list from 0 to 99... so t will be 0 to 99
you could do this :
a = range(0, 100)
DF_data = pd.DataFrame([a, [np.sin(x)for x in a], [np.cos(x)for x in a]], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
df = pd.melt(DF_data, id_vars=['t','desc'])
df.head(5)
this should return what you are looking for.
t desc variable value
0 0.0 info about this sine 0.000000
1 1.0 info about this sine 0.841471
2 2.0 info about this sine 0.909297
3 3.0 info about this sine 0.141120
4 4.0 info about this sine -0.756802

Categories

Resources