retrieve list element from pandas dataframe column - python

Assuming there is a pandas.DataFrame like:
pd.DataFrame([[np.nan,np.nan],[[1,2],[3,4]],[[11,22],[33,44]]],columns=['A','B'])
What's the easiest way to produce 2 pandas.DataFrames that each contains the 1st and 2nd element from every value list in the frame (nan if the position is nan).
pd.DataFrame([[np.nan,np.nan],[1,3],[11,33]],columns=['A','B'])
pd.DataFrame([[np.nan,np.nan],[2,4],[22,44]],columns=['A','B'])

You can use:
#replace NaN to [] - a bit hack
df = df.mask(df.isnull(), pd.Series([[]] * len(df.columns), index=df.columns), axis=1)
print (df)
A B
0 [] []
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
#create new df by each column, concanecate together
df3 = pd.concat([pd.DataFrame(df[col].values.tolist()) for col in df],
axis=1,
keys=df.columns)
print (df3)
A B
0 1 0 1
0 NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0
2 11.0 22.0 33.0 44.0
#select by xs
df1 = df3.xs(0, level=1, axis=1)
print (df1)
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
df2 = df3.xs(1, level=1, axis=1)
print (df2)
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0

You can do what you need with a function that return the n'th element of each column.
Code:
def row_element(elem_num):
def func(row):
ret = []
for item in row:
try:
ret.append(item[elem_num])
except:
ret.append(item)
return ret
return func
Test Code:
df = pd.DataFrame(
[[np.nan, np.nan], [[1, 2], [3, 4]], [[11, 22], [33, 44]]],
columns=['A', 'B'])
print(df)
print(df.apply(row_element(0), axis=1))
print(df.apply(row_element(1), axis=1))
Results:
A B
0 NaN NaN
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0

Related

Add new values to a dataframe from a specific index?

In an existing dataframe, how can I add a column with new values in it, but throw these new values from a specific index and increase the dataframe's index size?
As in this example, put the new values from index 2, and go to index 6:
Dataframe:
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
df
Output:
Col1 Col2
0 1 2
1 1 2
2 1 2
3 1 2
New values:
new_values = [3, 3, 3, 3, 3]
Desired Result:
Col1 Col2 Col3
0 1 2 NaN
1 1 2 NaN
2 1 2 3
3 1 2 3
4 NaN NaN 3
5 NaN NaN 3
6 NaN NaN 3
First create a new list and add NaN values that total to the number you want to offset.
Then do a concat.
You can set the series name when you concatnate it and that will be the new column name.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
new_values = [3, 3, 3, 3, 3]
offset = 2 # set your offset here
new_values = [np.NaN] * offset + new_values # looks like [np.NaN, np.NaN, 3, 3, ... ]
new = pd.concat([df, pd.Series(new_values).rename('Col3')], axis=1)
new looks like this,
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
The answer by #anarchy works nicely, a similar approach
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
specific_index_offset = 2
new_data = [3,3,3,3,3]
new_df = pd.DataFrame(
data=[[e] for e in new_data],
columns=['Col3'],
index=range(specific_index_offset, specific_index_offset + len(new_data))
)
desired_df = pd.concat([df, new_df], axis=1)
You can also try doing some changes from list level
n = 2 #Input here
xs = [None] * 2
new_values = [3, 3, 3, 3, 3]
new= xs+new_values
#create new df
df2 = pd.DataFrame({'col3':new})
df_final = pd.concat([df, df2], axis=1)
print(df_final)
output#
Col1 Col2 col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
Another option is to create a Series with the specific index, before concatenating:
new_values = [3, 3, 3, 3, 3]
arr = pd.Series(new_values, index = range(2, 7), name = 'Col3')
pd.concat([df, arr], axis = 1)
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0

Boolean indexing in Pandas DataFrame with MultiIndex columns

I have a DataFrame with MultiIndex columns:
import numpy as np
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'], ['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[np.nan, 6, 7, 8],
[np.nan, 10, np.nan, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1.0 2 3.0 4
1 NaN 6 7.0 8
2 NaN 10 NaN 12
Now I want to set m to NaN whenever p is NaN. Here's the result I'm looking for:
n1 n2
p m p m
0 1.0 2.0 3.0 4.0
1 NaN NaN 7.0 8.0
2 NaN NaN NaN NaN
I know how to find out where p is NaN, for example using
mask = df.xs('p', level=1, axis=1).isnull()
n1 n2
0 False False
1 True False
2 True True
However, I don't know how to use this mask to set the corresponding m values in df to NaN.
You can use pd.IndexSlice to obtain a boolean ndarray indicating whether values are NaN or not in the p column on level 1 and then replacing False to NaN, and also to replace the values in m by multiplying the result:
x = df.loc[:, pd.IndexSlice[:,'p']].notna().replace({False:float('nan')}).values
df.loc[:, pd.IndexSlice[:,'m']] *= x
n1 n2
p m p m
0 1.0 2 3.0 4
1 NaN NaN 7.0 8
2 NaN NaN NaN NaN
You can stack and unstack the transposed dataframe to be able to easily select and change values, and then again stack, unstack and transpose to get it back:
df = df.T.stack(dropna=False).unstack(level=1)
df.loc[df['p'].isna(), 'm'] = np.nan
df = df.stack(dropna=False).unstack(1).T
After first line, df is:
m p
n1 0 2.0 1.0
1 6.0 NaN
2 10.0 NaN
n2 0 4.0 3.0
1 8.0 7.0
2 12.0 NaN
And after last:
n1 n2
m p m p
0 2.0 1.0 4.0 3.0
1 NaN NaN 8.0 7.0
2 NaN NaN NaN NaN

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

pandas combine_first but always overwrite

Can the following be improved upon?
It achieved the desired result of copying values from df2 to df1 where the index can be matched. It seems inefficient and clunky.
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=pd.MultiIndex.from_tuples(['AB', 'AC']), columns=['X', 'Y', 'Z'])
df2 = pd.DataFrame([102, 103], index=pd.MultiIndex.from_tuples(['AC', 'AD']), columns=['Y'])
desired = df2.combine_first(df1).combine_first(df2)
print(df1)
print(df2)
print(desired)
Output:
df1
X Y Z
A B 0 1 2
C 3 4 5
df2
Y
A C 102
D 103
desired
X Y Z
A B 0 1 2
C 3 102 5
D NaN 103 NaN
The closest I could come to using slicing was
print(df1.loc[df2.index, df2.columns]) # This works, demonstrated lhs of below is OK
df1.loc[df2.index, df2.columns] = df2 # This fails, as does df2.values
Why not use merge ??
>>df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
>>df3
X Y_x Z Y_y
A B 0.0 1.0 2.0 NaN
C 3.0 4.0 5.0 102.0
D NaN NaN NaN 103.0
>>df3['Y'] = df3['Y_y'].combine_first(df3['Y_x'])
>>df3.drop(['Y_x', 'Y_y'], axis=1)
X Z Y
A B 0.0 2.0 1.0
C 3.0 5.0 102.0
D NaN NaN 103.0

Replace Nulls in DataFrame with Max in Row

Is there a way (more efficient than using a for loop) to replace all the nulls in a Pandas' DataFrame with the max value in its respective row.
I guess that is what you are looking for:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 0], 'b': [3, 0, 10], 'c':[0, 5, 34]})
a b c
0 1 3 0
1 2 0 5
2 0 10 34
You can use apply, iterate over all rows and replace 0 by the maximal number of the row by using the replace function which gives you the expected output:
df.apply(lambda row: row.replace(0, max(row)), axis=1)
a b c
0 1 3 3
1 2 5 5
2 34 10 34
If you want to to replace NaN - which seemed to be your actual goal according to your comment - you can use
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})
a b c
0 1.0 3.0 NaN
1 2.0 NaN 5.0
2 NaN 10.0 34.0
df.T.fillna(df.max(axis=1)).T
yielding
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 34.0 10.0 34.0
which might be more efficient (have not done the timings) than
df.apply(lambda row: row.fillna(row.max()), axis=1)
Please note that
df.apply(lambda row: row.fillna(max(row)), axis=1)
does not work in each case as explained here.

Categories

Resources