Dataframe:
0 1 2 3 4 slicing
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN NaN 6
12 NaN Object 3 NaN NaN 12
18 NaN Object 4 NaN NaN 18
23 NaN Object 5 NaN NaN 23
desired output:
0 1 2 3 4 slicing
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2 NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN Object4 NaN NaN NaN 18
23 NaN Object5 NAN NaN NaN 23
library pandas
iterate through each row in the dataset (since there are only NA's and str'Object' with its corresponding str'1-10' number)
replace str numbers with Na and concatenate data in the same row
Code for now:
df= df[df.apply(lambda row: row.astype(str).str.contains('Desk').any().df[row]+df[row], axis=1)]
Index 0 1 2 3 4
0 NaN Desk 1 NaN NaN
5 NaN Desk 2 NaN NaN
10 NaN Desk 3 NaN NaN
15 NaN Desk 4 NaN NaN
20 NaN Desk 5 NaN NaN
Here's what I did:
Using the following dataframe as an example:
0 1 2 3 4 slicing
index
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN A 6
12 NaN Object 3 NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 Stuff Object NaN 5 NaN 23
I perform 4 steps in the below 4 lines of code, when 'Object' exists in column 1: 1) replace nans with nothing; 2) set everything to string type; 3) join the row, to column 1, 4) replace all the other columns with nan
df.loc[df['1']=='Object',['0', '2', '3','4']] = df.loc[df['1']=='Object',['0', '2', '3','4']].fillna('')
df.loc[df['1']=='Object',['0','1', '2', '3','4']] = df.loc[df['1']=='Object',['0','1', '2', '3','4']].astype(str)
df.loc[df['1']=='Object', ['1','0', '2', '3','4']] = df.loc[df['1']=='Object', ['1', '0', '2', '3','4']].agg(''.join, axis=1)
df.loc[df['1'].str.contains('Object', na = False), ['0', '2', '3','4']] = np.nan
df
0 1 2 3 4 slicing
index
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2A NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 NaN ObjectStuff5 NaN NaN NaN 23
If I understand what you are trying to achieve, you should really try to wok with columns instead of iterating. It is way faster. You can try something like this :
import numpy as np
columns = df.columns.tolist()
ix = df[df[columns[1]].str.contains('Object')].index
df.loc[ix:columns[1]] = df.loc[ix:columns[1]]+df.loc[ix:columns[2]]
df.loc[ix:columns[2]] = np.nan
Related
I've reviewed several posts on here about better ways to loop through dataframes, but can't seem to figure out how to apply them to my specific situation.
I have a dataframe of about 2M rows and I need to calculate six statistics for each row, one per column. There are 3 columns so 18 total. However, the issue is that I need to update those stats using a sample of the dataframe so that the mean/median, etc is different per row.
Here's what I have so far:
r = 0
for i in imputed_df.iterrows():
t = imputed_df.sample(n=10)
for (columnName) in cols:
imputed_df.loc[r,columnName + '_mean'] = t[columnName].mean()
imputed_df.loc[r,columnName + '_var'] = t[columnName].var()
imputed_df.loc[r,columnName + '_std'] = t[columnName].std()
imputed_df.loc[r,columnName + '_skew'] = t[columnName].skew()
imputed_df.loc[r,columnName + '_kurt'] = t[columnName].kurt()
imputed_df.loc[r,columnName + '_med'] = t[columnName].median()
But this has been running for two days without finishing. I tried to take a subset of 2000 rows from the original dataframe and even that one has been running for hours.
Is there a better way to do this?
EDIT: Added a sample dataset of what it should look like. each suffixed column should have the calculated value of the subset of 10 rows.
timestamp activityID w2 w3 w4
0 41.21 1.0 -1.34587 9.57245 2.83571
1 41.22 1.0 -1.76211 10.63590 2.59496
2 41.23 1.0 -2.45116 11.09340 2.23671
3 41.24 1.0 -2.42381 11.88590 1.77260
4 41.25 1.0 -2.31581 12.45170 1.50289
The problem is that you do the operation for each column using unnecessary loops.
We could use
DataFrame.agg with DataFrame.unstack and Series.set_axis to get correct names of columns.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 100))).add_prefix('col')
new_serie = df.agg(['sum', 'mean',
'var', 'std',
'skew', 'kurt', 'median']).unstack()
new_df = pd.concat([df, new_serie.set_axis([f'{x}_{y}'
for x, y in new_serie.index])
.to_frame().T], axis=1)
# if new_df already exist:
#new_df.loc[0, :] = new_serie.set_axis([f'{x}_{y}' for x, y in new_serie.index])
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 ... \
0 8 7 6 7 6 5 8 7 8 4 ...
1 8 1 8 7 0 8 8 4 6 1 ...
2 5 6 3 5 4 9 3 0 2 5 ...
3 3 3 3 3 5 4 5 1 3 5 ...
4 7 9 4 5 6 7 0 3 4 6 ...
5 0 5 2 0 8 0 3 7 6 5 ...
6 7 0 1 4 8 9 4 9 2 9 ...
7 0 6 1 0 6 1 3 0 3 4 ...
8 3 6 1 8 3 0 7 6 8 6 ...
9 2 5 8 5 8 4 9 1 9 9 ...
col98_skew col98_kurt col98_median col99_sum col99_mean col99_var \
0 0.456435 -0.939607 3.0 39.0 3.9 6.322222
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
col99_std col99_skew col99_kurt col99_median
0 2.514403 0.402601 1.099343 4.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
If I have a pandas dataframe like this:
2 3 4 NaN NaN NaN
1 NaN NaN NaN NaN NaN
5 6 7 2 3 NaN
4 3 NaN NaN NaN NaN
and an array for the number I would like to shift:
array = [2, 4, 0, 3]
How do I iterate through each row to shift the columns by the number in my array to get something like this:
NaN NaN 2 3 4 NaN
NaN NaN NaN NaN 1 NaN
5 6 7 2 3 NaN
NaN NaN NaN 3 4 NaN
I was trying to do something like this but had no luck.
df = pd.DataFrame(values)
for rows in df.iterrows():
df[rows] = df.shift[change_in_bins[rows]]
Use for loop with loc and shift:
for index,value in enumerate([2, 4, 0, 3]):
df.loc[index,:] = df.loc[index,:].shift(value)
print(df)
0 1 2 3 4 5
0 NaN NaN 2.0 3.0 4.0 NaN
1 NaN NaN NaN NaN 1.0 NaN
2 5.0 6.0 7.0 2.0 3.0 NaN
3 NaN NaN NaN 4.0 3.0 NaN
So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks
rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN
Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help
We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)
In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)
You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.
if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)
How to combine this line into pandas dataframe to drop columns which its missing rate over 90%?
this line will show all the column and its missing rate:
percentage = (LoanStats_securev1_2018Q1.isnull().sum()/LoanStats_securev1_2018Q1.isnull().count()*100).sort_values(ascending = False)
Someone familiar with pandas please kindly help.
You can use dropna with a threshold
newdf=df.dropna(axis=1,thresh=len(df)*0.9)
axis=1 indicates column and thresh is the
minimum number of non-NA values required.
I think need boolean indexing with mean of boolean mask:
df = df.loc[:, df.isnull().mean() < .9]
Sample:
np.random.seed(2018)
df = pd.DataFrame(np.random.randn(20,3), columns=list('ABC'))
df.iloc[3:8,0] = np.nan
df.iloc[:-1,1] = np.nan
df.iloc[1:,2] = np.nan
print (df)
A B C
0 -0.276768 NaN 2.148399
1 -1.279487 NaN NaN
2 -0.142790 NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 -0.172797 NaN NaN
9 -1.604543 NaN NaN
10 -0.276501 NaN NaN
11 0.704780 NaN NaN
12 0.138125 NaN NaN
13 1.072796 NaN NaN
14 -0.803375 NaN NaN
15 0.047084 NaN NaN
16 -0.013434 NaN NaN
17 -1.580231 NaN NaN
18 -0.851835 NaN NaN
19 -0.148534 0.133759 NaN
print(df.isnull().mean())
A 0.25
B 0.95
C 0.95
dtype: float64
df = df.loc[:, df.isnull().mean() < .9]
print (df)
A
0 -0.276768
1 -1.279487
2 -0.142790
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.172797
9 -1.604543
10 -0.276501
11 0.704780
12 0.138125
13 1.072796
14 -0.803375
15 0.047084
16 -0.013434
17 -1.580231
18 -0.851835
19 -0.148534