Pandas dataframe a particular case of merging - python

How can I merge the rows of the dataframe1 into the dataframe2 ?
If one of the corresponding values is NaN then the value should be
copied from the other.
If both are NaN then NaN.
If none are NaN then the first one.
Dataframe1
Dataframe2
Thanks in advance

You can use combine_first:
df
Out:
col1 col2 col3 col4
0 NaN NaN 3.0 4
1 1.0 2.0 NaN 5
df.loc[0].combine_first(df.loc[1])
Out:
col1 1.0
col2 2.0
col3 3.0
col4 4.0
Name: 0, dtype: float64
In the specified format:
df.loc[0].combine_first(df.loc[1]).to_frame('Row1-2').T
Out:
col1 col2 col3 col4
Row1-2 1.0 2.0 3.0 4.0
An alternative:
df.loc[[0]].fillna(df.loc[1])
Out:
col1 col2 col3 col4
0 1.0 2.0 3.0 4
And a cleaner version of filling from #MaxU:
df.bfill().iloc[[0]]
Out:
col1 col2 col3 col4
0 1.0 2.0 3.0 4

Related

How to insert NaN if value is not between fence_high and fence_low columns

I have to replace the values from the first three columns with NaN if they are >= than fence_high or <= than fence_low.
I have a dataframe like this:
col1 col2 col3 fence_high fence_low
0 1 3 9 9 1.5
1 2 4 6 7 1
2 4 7 -1 6.5 0
This is what I would like to achieve:
col1 col2 col3 fence_high fence_low
0 NaN 3 NaN 9 1.5
1 2 4 6 7 1
2 4 NaN NaN 6.5 0
So far I tried df_new = df[(df < df["fence_high"]) & (df > df["fence_low"])], but this gives me all NaN.
We can simply keep values where they fall between fence_low and fence_high using gt and lt to maintain index alignment:
df.loc[:, 'col1':'col3'] = df.loc[:, 'col1':'col3'].where(
lambda x: x.gt(df['fence_low'], axis=0) & x.lt(df['fence_high'], axis=0)
)
df
col1 col2 col3 fence_high fence_low
0 NaN 3.0 NaN 9.0 1.5
1 2.0 4.0 6.0 7.0 1.0
2 4.0 NaN NaN 6.5 0.0
If needing a new DataFrame we can join after where to restore the columns that were not considered:
new_df = df.loc[:, 'col1':'col3'].where(
lambda x: x.gt(df['fence_low'], axis=0) & x.lt(df['fence_high'], axis=0)
).join(df[['fence_high', 'fence_low']])
new_df:
col1 col2 col3 fence_high fence_low
0 NaN 3.0 NaN 9.0 1.5
1 2.0 4.0 6.0 7.0 1.0
2 4.0 NaN NaN 6.5 0.0
One of the ways is to use apply
See if this helps:
import pandas as pd
import numpy as np
cols_list = ["col1", "col2", "col3"]
def compare_val(val, high, low):
if val >= high or val <= low:
return np.nan
return val
def compare(row):
result = []
for i in cols_list:
result.append(
compare_val(val=row[i], high=row["fence_high"], low=row["fence_low"])
)
return pd.Series(result)
data = [[1, 3, 9, 9, 1.5], [2, 4, 6, 7, 1], [4, 7, -1, 6.5, 0]]
df = pd.DataFrame(data, columns=[*cols_list, "fence_high", "fence_low"])
print("Original:\n", df.head())
df[cols_list] = df.apply(compare, axis=1)
print("Transformed:\n", df.head())
Output:
Original:
col1 col2 col3 fence_high fence_low
0 1 3 9 9.0 1.5
1 2 4 6 7.0 1.0
2 4 7 -1 6.5 0.0
Transformed:
col1 col2 col3 fence_high fence_low
0 NaN 3.0 NaN 9.0 1.5
1 2.0 4.0 6.0 7.0 1.0
2 4.0 NaN NaN 6.5 0.0

Shift all NaN values in pandas to the left

I have a (250, 33866) dataframe. As you can see in the picture, all the NaN values are at the end of each row. I would like to shift those NaNvalues ti the left of the dataframe. At the same time I wanna keep the 0 column (which refers to the Id) in its place (stays the first one).
I was trying to define a function that loops over all rows and columns to do that, but figured it will be very inefficient for large data. Any other options? Thanks
You could reverse the columns of df, drop NaNs; build a DataFrame and reverse it back:
out = pd.DataFrame(df.iloc[:,::-1].apply(lambda x: x.dropna().tolist(), axis=1).tolist(),
columns=df.columns[::-1]).iloc[:,::-1]
For example, for a DataFrame that looks like below:
col0 col1 col2 col3 col4
1 1.0 2.0 3.0 10.0 20.0
2 1.0 2.0 3.0 NaN NaN
3 1.0 2.0 NaN NaN NaN
the above code produces:
col0 col1 col2 col3 col4
0 1.0 2.0 3.0 10.0 20.0
1 NaN NaN 1.0 2.0 3.0
2 NaN NaN NaN 1.0 2.0

Fill NaN values in a pandas DataFrame depending on values of cells to its left

I'm trying to fill NaN's in a very large pandas dataframe with zeros, but only if there are non-NaN values in the same row but in a cell to its left. So for example, from this input DataFrame,
input = pd.DataFrame([[1, np.NaN, 1.5, np.NaN], [np.NaN, 2, np.NaN, np.NaN]], index=['A', 'B'], columns=['col1', 'col2', 'col3', 'col4'])
which looks like:
col1 col2 col3 col4
A 1.0 NaN 1.5 NaN
B NaN 2.0 NaN NaN
The expected output would be:
col1 col2 col3 col4
A 1.0 0 1.5 0
B NaN 2.0 0 0
See how [B, col1] remains a Nan because there's no not-NaN value to its left, but all four [A,col2], [A, col4], [B,col3] and [B, col4] have been filled with zeros (because there are leftier non-NaN values).
Does anyone have any idea on how to go on about this? Thanks a lot!
Use forward filling missing values and test not missing and chain by test missing values and by this mask assign 0:
df[df.ffill(axis=1).notna() & df.isna()] = 0
print (df)
col1 col2 col3 col4
A 1.0 0.0 1.5 0.0
B NaN 2.0 0.0 0.0
Or you can use cumulative sum with test not equal 0 values:
df[df.fillna(0).cumsum(axis=1).ne(0) & df.isna()] = 0
print (df)
col1 col2 col3 col4
A 1.0 0.0 1.5 0.0
B NaN 2.0 0.0 0.0

pandas shift rows NaNs

Say we have a dataframe set up as follows:
x = pd.DataFrame(np.random.randint(1, 10, 30).reshape(5,6),
columns=[f'col{i}' for i in range(6)])
x['col6'] = np.nan
x['col7'] = np.nan
col0 col1 col2 col3 col4 col5 col6 col7
0 6 5 1 5 2 4 NaN NaN
1 8 8 9 6 7 2 NaN NaN
2 8 3 9 6 6 6 NaN NaN
3 8 4 4 4 8 9 NaN NaN
4 5 3 4 3 8 7 NaN NaN
When calling x.shift(2, axis=1), col2 -> col5 shifts correctly, but col6 and col7 stays as NaN?
How can I overwrite the NaN in col6 and col7 values with col4 and col5's values? Is this a bug or intended?
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 NaN NaN
1 NaN NaN 8.0 8.0 9.0 6.0 NaN NaN
2 NaN NaN 8.0 3.0 9.0 6.0 NaN NaN
3 NaN NaN 8.0 4.0 4.0 4.0 NaN NaN
4 NaN NaN 5.0 3.0 4.0 3.0 NaN NaN
It's possible this is a bug, you can use np.roll to achieve this:
In[11]:
x.apply(lambda x: np.roll(x, 2), axis=1)
Out[11]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
Speedwise, it's probably quicker to construct a df and reuse the existing columns and pass the result of np.roll as the data arg to the constructor to DataFrame:
In[12]:
x = pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
x
Out[12]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
timings
In[13]:
%timeit pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
%timeit x.fillna(0).astype(int).shift(2, axis=1)
10000 loops, best of 3: 117 µs per loop
1000 loops, best of 3: 418 µs per loop
So constructing a new df with the result of np.roll is quicker than first filling the NaN values, cast to int, and then shifting.

How to get and show counts of values each column

I have a pandas dataframe.
df:
col1 col2 col3 col4 col5
0 1.0 1.0 NaN NaN 1.0
1 NaN 1.0 1.0 2.0 1.0
2 2.0 NaN 1.0 NaN 1.0
I want to get the count number of rows that have the same data each column like following.
OutPut:
col1 col2 col3 col4 col5
1.0 1 2 2 0 3
2.0 1 0 0 1 0
or only the count of a value.
col1 col2 col3 col4 col5
1.0 1 2 2 0 3
Are there any ways to get my expected output?
You could use value_counts method of pandas Series and then fillna for filling NaN values with 0:
In [7]: df
Out[7]:
col1 col2 col3 col4 col5
0 1.0 1.0 NaN NaN 1.0
1 NaN 1.0 1.0 2.0 1.0
2 2.0 NaN 1.0 NaN 1.0
In [8]: df.apply(lambda x: x.value_counts()).fillna(0)
Out[8]:
col1 col2 col3 col4 col5
1.0 1 2.0 2.0 0.0 3.0
2.0 1 0.0 0.0 1.0 0.0
If you need int values instead of float you could also use astype with int:
In [9]: df.apply(lambda x: x.value_counts()).fillna(0).astype(int)
Out[9]:
col1 col2 col3 col4 col5
1.0 1 2 2 0 3
2.0 1 0 0 1 0
Edit: df.replace(np.NaN, 0) does not work reliably across versions, so updated to use df.fillna(0) instead.
To count the occurrences of a value in each column use value_counts. Non-occurring values become NaN, so need to be replaced with 0:
>>> df.apply(pd.value_counts).fillna(0)
col1 col2 col3 col4 col5
1 1 2 2 0 3
2 1 0 0 1 0
To retrieve a particular row:
>>> df.apply(pd.value_counts).fillna(0).loc[1:1]
col1 col2 col3 col4 col5
1 1 2 2 0 3

Categories

Resources