How do I merge multiple pandas dataframe columns - python

I have a dataframe similar to the one seen below.
In[2]: df = pd.DataFrame({'P1': [1, 2, None, None, None, None],'P2': [None, None, 3, 4, None, None],'P3': [None, None, None, None, 5, 6]})
Out[2]:
P1 P2 P3
0 1.0 NaN NaN
1 2.0 NaN NaN
2 NaN 3.0 NaN
3 NaN 4.0 NaN
4 NaN NaN 5.0
5 NaN NaN 6.0
And I am trying to merge all of the columns into a single P column in a new dataframe (see below).
P
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
In my actual code, I have an arbitrary list of columns that should be merged, not necessarily P1, P2, and P3 (between 1 and 5 columns). I've tried something along the following lines:
new_series = pd.Series()
desired_columns = ['P1', 'P2', 'P3']
for col in desired_columns:
other_series=df[col]
new_series = new_series.align(other_series)
However this results in a tuple of Series objects, and neither of them appear to contain the data I need. I could iterate through every row, then check each column, but I feel that there is likely an easy pandas solution that I am missing.

If there is only one non None value per row forward filling Nones and select last column by position:
df['P'] = df[['P1', 'P2', 'P3']].ffill(axis=1).iloc[:, -1]
print (df)
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0

Another alternate solution:
So, if we are not column specific within the DataFrame to choose about then we can use bfill() function to populate the non-nan values in the dataframe across columns So, when axis='columns', then the current nan cells will be filled from the value present in the next column in the same row.
>>> df['P'] = df.bfill(axis=1).iloc[:, 0]
>>> df
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0

Related

Conditional pairwise calculations in pandas

For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN

Manipulating value in a column based on a rule

I have 3 columns -A, B and C in a pandas dataframe. What i want to do is, where ever A is not null AND B|C are not null, that row in A should be set to null.
if(dffinal['A'].loc[dffinal['A'].notnull()] &
(dffinal['B'].loc[dffinal['B'].notnull()] |
dffinal['C'].loc[dffinal['C'].notnull()])):
dffinal['A'] = np.nan
this is the error I'm getting: cannot do a non-empty take from an empty axes.
Use df.loc[]:
df.loc[df.A.notna() & (df.B.notna()|df.C.notna()),'A']=np.nan
Here first condition is not necessary, so solution should be simplify:
dffinal = pd.DataFrame({
'A':[np.nan,np.nan,4,5,5,np.nan],
'B':[7,np.nan,np.nan,4,np.nan,np.nan],
'C':[1,3,5,7,np.nan,np.nan],
})
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 4.0 NaN 5.0
3 5.0 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
mask = (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
Same output like in first condition:
mask = dffinal['A'].notnull() & (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN

Fill missing value by averaging previous row value

I want to fill missing value with the average of previous N row value, example is shown below:
N=2
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
DataFrame is like:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN NaN
Result should be:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN (4+2)/2 NaN 5
3 NaN 3.0 NaN (1+5)/2
I am wondering if there is elegant and fast way to achieve this without for loop.
rolling + mean + shift
You will need to modify the below logic to interpret the mean of NaN and another value, in the case where one of the previous two values are null.
df = df.fillna(df.rolling(2).mean().shift())
print(df)
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 3.0 NaN 5.0
3 NaN 3.0 NaN 3.0

Cuting dataframe loop

I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!
This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources