I have a data frame similar to one below.
Date 20180601T32 20180604T33 20180605T32 20180610T33
2018-06-04 0.1 0.5 4.5 nan
2018-06-05 1.5 0.2 nan 0
2018-06-07 1.1 1.6 nan nan
2018-06-10 0.4 1.1 0 0.3
The values in columns '20180601', '20180604', '20180605' and '20180607' needs to be coalesced into a new column.
I am using the method bfill as below but it selects first value in the row.
coalsece_columns = ['20180601', '20180604', '20180605', '20180610]
df['obs'] = df[coalesce_columns].bfill(axis=1).iloc[:,0]
But instead of taking value from first column, value should match 'Date' and respective column names. The expected output should be:
Date 20180601T32 20180604T33 20180605T32 20180610T33 Obs
2018-06-04 0.1 0.5 4.5 nan 0.5
2018-06-05 1.5 0.2 1.7 0 1.7
2018-06-07 1.1 1.6 nan nan nan
2018-06-10 0.4 1.1 0 0.3 0.3
Any suggestions?
Use lookup with convert Datecolumn to same format like columns names:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
If possible columnsnames are integers:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.rename(columns=str).reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
Related
I have a pd Dataframe like this:
df = pd.DataFrame({'val':[0.1,0.2,0.3,None,None],'parent':[None,None,None,0,2]})
parent val
0 NaN 0.1
1 NaN 0.2
2 NaN 0.3
3 0.0 NaN
4 2.0 NaN
where parent represents an index within the pandas df. I want to create a new column that has either the value, or the value of the parent.
that would look like this:
parent val val_full
0 NaN 0.1 0.1
1 NaN 0.2 0.2
2 NaN 0.3 0.3
3 0.0 NaN 0.1
4 2.0 NaN 0.3
This is a fairly large dataframe (10k+ rows), so something efficient would be preferable. How can I do this without using something like .iterrows()?
In your case do
df['new'] = df.val
df.loc[df.new.isna(),'new'] = df.loc[df.parent.dropna().values,'val'].values
df
Out[289]:
val parent new
0 0.1 NaN 0.1
1 0.2 NaN 0.2
2 0.3 NaN 0.3
3 NaN 0.0 0.1
4 NaN 2.0 0.3
Or try fillna with replace
df['new'] = df.val.fillna(df.parent.replace(df.val))
Out[290]:
0 0.1
1 0.2
2 0.3
3 0.1
4 0.3
Name: val, dtype: float64
I have the following df1:
col1 col2 col3 col4 col5
A 3 4 1 2 1
B 2 1 2 3 1
C 2 3 4 2 1
On the other hand I have the df2:
type col1 col2 col3
j A 0.5 0.7 0.1
k B 0.2 0.3 0.9
l A 0.5 0.3 0.2
m C 0.8 0.7 0.1
n A 0.3 0.3 0.2
o B 0.1 0.7 0.3
Given the column type in df2 I would like to generate like a pivot table like this:
col1 col2 col3 col4 col5
A 3 4 1 2 1
j 0.5 0.7 0.1
l 0.5 0.3 0.2
n 0.3 0.3 0.2
B 2 1 2 3 1
k 0.2 0.3 0.9
o 0.1 0.7 0.3
C 2 3 4 2 1
m 0.8 0.7 0.1
Is there premade function in pandas I could used to append each line in df2 below its corresponding index in df1?
Sorry I do not include my try , but I have no idea on how to approach this problem.
It seems you need MultiIndex here. You should not use NaN indices as shown in your desired result: the label lacks meaning. One idea is to use a non-letter indicator such as 0:
# set index as (type, current_index) for df2
df2 = df2.reset_index().set_index(['type', 'index']).sort_index()
# reassign index as (type, 0) for df1
df1.index = pd.MultiIndex.from_tuples([(i, 0) for i in df1.index])
# concatenate df1 and df2
res = pd.concat([df1, df2]).sort_index()
print(res)
col1 col2 col3 col4 col5
A 0 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 0 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 0 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Using pd.merge and sort_index specifying na_position='first'
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.set_index(['type', 'index'])\
.sort_index(na_position='first')
col1 col2 col3 col4 col5
type index
A NaN 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B NaN 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C NaN 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
As highlighted by #jpp, in the docs for sort_index it says that
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end. Not implemented for MultiIndex.
even though it actually seems to be, indeed, implemented.
However, if you think this behavior could be inconsistent, an alternative is to sort_values first, and just then setting the index. In sort_values Docs, no such not implemented warning exists.
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.sort_values(['type', 'index'], na_position='first')\
.set_index(['type', 'index'])
Similar to #jpp
d2 = df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
d1 = df1.set_index(np.zeros(len(df1), str), append=True).rename_axis(['type', 'k'])
d1.append(d2).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Alternate
df1.rename_axis('type').assign(k='').set_index('k', append=True).append(
df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Consider the following pandas dataframe (df):
index A B C D E F G weights
1 NaN 1 NaN NaN NaN 3 2 [0.6 , 0.2 , 0.2]
2 3 2 NaN 1 NaN NaN NaN [0.5 , 0.4 , 0.1]
3 NaN NaN 1 2 3 NaN NaN [0.8 , 0.1 , 0.1]
4 NaN 3 1 NaN NaN 2 NaN [0.9 , 0.1 , 0.0]
Desired output (values matched to their corresponding weights at row-level) :
1 NaN 0.6 NaN NaN NaN 0.2 0.2
2 0.1 0.4 NaN 0.5 NaN NaN NaN
3 NaN NaN 0.8 0.1 0.1 NaN NaN
4 NaN 0.0 0.9 NaN NaN 0.1 NaN
My current solution :
def assign_weights(row):
for i in range(1,4):
row.replace(i, row.weights[i-1], inplace=True)
return row
df.apply(assign_weights, axis = 1)
Is there a faster way (for a big dataframe with more weights to be assigned) ?
Not sure if this will be faster, though:
>>> def worker(row):
... n = np.array(row['weights'])
... i = (row.notnull()) & (row.index != 'weights')
... row[i] = n[row[i].astype('int').values - 1]
... return row
>>>
>>> df.apply(worker, axis=1)
A B C D E F G weights
index
1 NaN 0.6 NaN NaN NaN 0.2 0.2 [0.6, 0.2, 0.2]
2 0.1 0.4 NaN 0.5 NaN NaN NaN [0.5, 0.4, 0.1]
3 NaN NaN 0.8 0.1 0.1 NaN NaN [0.8, 0.1, 0.1]
4 NaN 0.0 0.9 NaN NaN 0.1 NaN [0.9, 0.1, 0.0]
I am working with an irregular df. I am trying to get rid of the initial NaNs and shift all values to the top leaving NaNs at the bottom. I want to perform a realignment of the value at the top which ignores the date.
==================================================================
STRIP col1 col2 col3 col4 col5 col6 col7 col8
01/12/2011 0.8 NaN NaN NaN NaN NaN NaN NaN
01/01/2012 0.8 0.8 NaN NaN NaN NaN NaN NaN
01/02/2012 0.8 0.8 0.78 NaN NaN NaN NaN NaN
01/03/2012 0.8 0.8 0.75 0.7 0.6 NaN NaN NaN
01/04/2012 0.7 0.7 0.73 0.7 0.6 0.6 NaN NaN
01/05/2012 0.7 0.7 0.72 0.7 0.6 0.6 0.1 NaN
01/06/2012 0.7 0.7 0.70 0.7 0.6 0.6 0.2 0.7
01/07/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.3 0.9
01/08/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.4 0.6
01/09/2012 0.7 0.7 0.67 0.7 0.6 0.6 0.5 0.4
02/01/2013 NaN NaN NaN NaN 0.5 0.6 0.8 0.3
03/01/2013 NaN NaN NaN NaN 0.5 0.6 0.7 0.2
===================================================================
the final DataFrame should look like the following:
STRIP col1 col2 col3 col4 col5 col6 col7 col8
01/12/2011 0.8 0.8 0.78 0.7 0.6 0.6 0.1 0.7
01/01/2012 0.8 0.8 0.75 0.7 0.6 0.6 0.2 0.9
01/02/2012 0.8 0.8 0.73 0.7 0.6 0.6 0.3 0.6
01/03/2012 0.8 0.7 0.72 0.7 0.6 0.6 0.4 0.4
01/04/2012 0.7 0.7 0.7 0.7 0.6 0.6 0.5 0.3
01/05/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.6 0.2
01/06/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.7 NaN
01/07/2012 0.7 0.7 0.67 NaN 0.5 0.6 NaN NaN
01/08/2012 0.7 0.7 NaN NaN 0.5 NaN NaN NaN
01/09/2012 0.7 NaN NaN NaN NaN NaN NaN NaN
02/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
03/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
You iterate over the cols and using first_valid_index and get_loc shift the col values:
In [314]:
for col in df:
df[col] = df[col].shift(-df.index.get_loc(df[col].first_valid_index()))
df
Out[314]:
col1 col2 col3 col4 col5 col6 col7 col8
STRIP
01/12/2011 0.8 0.8 0.78 0.7 0.6 0.6 0.1 0.7
01/01/2012 0.8 0.8 0.75 0.7 0.6 0.6 0.2 0.9
01/02/2012 0.8 0.8 0.73 0.7 0.6 0.6 0.3 0.6
01/03/2012 0.8 0.7 0.72 0.7 0.6 0.6 0.4 0.4
01/04/2012 0.7 0.7 0.70 0.7 0.6 0.6 0.5 0.3
01/05/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.8 0.2
01/06/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.7 NaN
01/07/2012 0.7 0.7 0.67 NaN 0.5 0.6 NaN NaN
01/08/2012 0.7 0.7 NaN NaN 0.5 NaN NaN NaN
01/09/2012 0.7 NaN NaN NaN NaN NaN NaN NaN
02/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
03/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
Another method using apply:
In [317]:
df.apply(lambda x: x.shift(-x.index.get_loc(x.first_valid_index())))
Out[317]:
col1 col2 col3 col4 col5 col6 col7 col8
STRIP
01/12/2011 0.8 0.8 0.78 0.7 0.6 0.6 0.1 0.7
01/01/2012 0.8 0.8 0.75 0.7 0.6 0.6 0.2 0.9
01/02/2012 0.8 0.8 0.73 0.7 0.6 0.6 0.3 0.6
01/03/2012 0.8 0.7 0.72 0.7 0.6 0.6 0.4 0.4
01/04/2012 0.7 0.7 0.70 0.7 0.6 0.6 0.5 0.3
01/05/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.8 0.2
01/06/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.7 NaN
01/07/2012 0.7 0.7 0.67 NaN 0.5 0.6 NaN NaN
01/08/2012 0.7 0.7 NaN NaN 0.5 NaN NaN NaN
01/09/2012 0.7 NaN NaN NaN NaN NaN NaN NaN
02/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
03/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
EDIT
If 'STRIP' is a column then you don't need get_loc:
In [319]:
df.apply(lambda x: x.shift(-x.first_valid_index()))
Out[319]:
STRIP col1 col2 col3 col4 col5 col6 col7 col8
0 01/12/2011 0.8 0.8 0.78 0.7 0.6 0.6 0.1 0.7
1 01/01/2012 0.8 0.8 0.75 0.7 0.6 0.6 0.2 0.9
2 01/02/2012 0.8 0.8 0.73 0.7 0.6 0.6 0.3 0.6
3 01/03/2012 0.8 0.7 0.72 0.7 0.6 0.6 0.4 0.4
4 01/04/2012 0.7 0.7 0.70 0.7 0.6 0.6 0.5 0.3
5 01/05/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.8 0.2
6 01/06/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.7 NaN
7 01/07/2012 0.7 0.7 0.67 NaN 0.5 0.6 NaN NaN
8 01/08/2012 0.7 0.7 NaN NaN 0.5 NaN NaN NaN
9 01/09/2012 0.7 NaN NaN NaN NaN NaN NaN NaN
10 02/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
11 03/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
I think you can just stack the valid numbers and nan's back together:
In [95]:
df2 = df.apply(lambda x: np.hstack((x[~x.isnull()], x[x.isnull()])), axis=0)
print df2
STRIP col1 col2 col3 col4 col5 col6 col7 col8
0 01/12/2011 0.8 0.8 0.78 0.7 0.6 0.6 0.1 0.7
1 01/01/2012 0.8 0.8 0.75 0.7 0.6 0.6 0.2 0.9
2 01/02/2012 0.8 0.8 0.73 0.7 0.6 0.6 0.3 0.6
3 01/03/2012 0.8 0.7 0.72 0.7 0.6 0.6 0.4 0.4
4 01/04/2012 0.7 0.7 0.70 0.7 0.6 0.6 0.5 0.3
5 01/05/2012 0.7 0.7 0.69 0.7 0.6 0.6 0.8 0.2
6 01/06/2012 0.7 0.7 0.68 0.7 0.6 0.6 0.7 NaN
7 01/07/2012 0.7 0.7 0.67 NaN 0.5 0.6 NaN NaN
8 01/08/2012 0.7 0.7 NaN NaN 0.5 NaN NaN NaN
9 01/09/2012 0.7 NaN NaN NaN NaN NaN NaN NaN
10 02/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
11 03/01/2013 NaN NaN NaN NaN NaN NaN NaN NaN
I've got two input dataframes
df1 (note, this DF could have more columns of data)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
and df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
and I'd like to combine them so that I get the following:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
This is how I'm doing things:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
This works just fine, but could potentially be slow for large datasets. I'm wondering if any panda-pros see a better (read faster, more efficient)? I'm new to pandas and can imagine there are some shortcuts here that I don't know about.
A few steps here. The key is that in order to create columns like one_a one_b .... two_c, we need add Time column to Sample index to build a multi-level index and then unstack to get the required form. Then, a groupby on Animal index is required to aggregate and reduce the number of NaNs. The rest are just some manipulations on format.
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN