Assign a Series to several Rows of a Pandas DataFrame - python

I have a pandas DataFrame prepared with an Index and columns, all values are NaN.
Now I computed a result, which can be used for more than one row of a DataFrame, and I would like to assign them all at once. This can be done by a loop, but I am pretty sure that this assignment can be done at once.
Here is a scenario:
import pandas as pd
df = pd.DataFrame(index=['A', 'B', 'C'], columns=['C1', 'C2']) # original df
s = pd.Series({'C1': 1, 'C2': 'ham'}) # a computed result
index = pd.Index(['A', 'C']) # result is valid for rows 'A' and 'C'
The naive approach is
df.loc[index, :] = s
But this does not change the DataFrame at all. It remains as
C1 C2
A NaN NaN
B NaN NaN
C NaN NaN
How can this assignment be done?

It seems we can use the underlying array data to assign -
df.loc[index, :] = s.values
Now, this assumes that the order of index in s is same as in the columns of df. If that's not the case, as suggested by #Nras, we could use s[df.columns].values for right side assignment.

Related

Why seting a column in a Dataframe from a Series with different index produces a column with NaNs?

In the following code, I have one DataFrame with two rows and a series with two values.
I would like to set the Series values in the column of my DataFrame.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(2, 1), index=["one", "two"])
print(df)
s = pd.Series(np.random.randn(2), index=["four", "five"])
df.loc[:, 0] = s
print(df)
However, the Series and the Dataframe doesn't have the same index. This results in NaNs in the Dataframe.
0
one NaN
two NaN
In order to have my values in the column, I can simply use the .values attribute of s.
df.loc[:, 0] = s.values
I would like to understand what is the logic behind getting NaNs when doing the former.
Before adding values to a Series/column, pandas aligns the indices.
This enables you to assign data when indices are missing or not in the same order.
For example:
df = pd.DataFrame(np.random.randn(2, 1), index=["one", "two"])
s = pd.Series([2, 1], index=["two", "one"]) # notice the different order
df.loc[:, 0] = s
print(df)
0
one 1
two 2
You can check what should happen using reindex:
s = pd.Series(np.random.randn(2), index=["four", "five"])
s.reindex(df.index)
one NaN
two NaN
dtype: float64
Using values/to_numpy(), this converts the Series to numpy array and reindexing is no longer performed.
Because indexing are not matching between df & series (s)....
on the flip; if you do s.values or s.to_list() ... it basially converts the series to array or list respectively... so no angle of indexwise mactching...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(2, 1), index=["one", "two"])
print(df)
s = pd.Series(np.random.randn(2), index=["one", "two"]) #edited here
df.loc[:, 0] = s
print(df)
0
one -0.560306
two -0.762751
0
one 0.281997
two 0.361495

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

Fill values in pandas column with condition involving 2 other columns

I am trying to fill this 'C' column in such a way that when the value in 'A' is not NaN, 'C' takes value from 'B', else the value in 'C' remains unchanged.
Heres the code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['greek', 'indian', np.nan, np.nan, 'australian'], 'B':
np.random.random(5)})
df['C'] = np.nan
df
I tried df.C = df.B.where(df.A != np.nan, np.nan), but it isnt working as the condition involves another column i think, for loop isnt yielding the desired result either.
How to get there using shortest lines of codes as possible?
The problem is not with np.where, the problem is that you are comparing the value directly against np.nan using !=
>>> np.nan == np.nan
False
So, use a function/method that allows you to check if the value is nan or not:
>>> df.C = df.B.where(df.A.notna(), np.nan)
A B C
0 greek 0.030809 0.030809
1 indian 0.545261 0.545261
2 NaN 0.470802 NaN
3 NaN 0.716640 NaN
4 australian 0.148297 0.148297

How to replace with values with adjacent column using pandas

I have dataframe, df1,
After outer join the df is below
df1 have 4 columns ['A','B','C','D']
ID,A,B,C,D
1,Nan,Nan,c,d
1,a,b,c,d
I need to replace the Nan values in df['A'] is with df['C']
I need to replace the Nan values in df['B'] is with df['D']
expected out is below
ID,A,B,C,D
1,c,d,c,d
1,a,b,c,d
in the first row df['A'] replaced with df['C'], if df['A'] then it has to retrieve df['A'] only
in the first row df['B'] replaced with df['D'], if df['B'] then it has to retrieve df['D'] only
You need to fill the column with the second-after column, one way is to fillna specifying the value parameter:
df.A.fillna(value=df.C, inplace=True)
df.B.fillna(value=df.D, inplace=True)
If for some reason you have a lot of columns and wants to keep filling NaN using values on the second-after column then use a for loop on the first n-2 columns
columns = ['A', 'B', 'C', 'D']
for i in range(len(columns)-2):
df[columns[i]].fillna(df[columns[i+2]], inplace=True)

How to record the "least occuring" item in a pandas DataFrame?

I have the following pandas DataFrame, with only three columns:
import pandas pd
dict_example = {'col1':['A', 'A', 'A', 'A', 'A'],
'col2':['A', 'B', 'A', 'B', 'A'], 'col3':['A', 'A', 'A', 'C', 'B']}
df = pd.DataFrame(dict_example)
print(df)
col1 col2 col3
0 A A A
1 A B A
2 A A A
3 A B C
4 A A B
For the rows with differing elements, I'm trying to write a function which will return the column names of the "minority" elements.
As an example, in row 1, there are 2 A's and 1 B. Given there is only one B, I consider this the "minority". If all elements are the same, there's naturally no minority (or majority). However, if each column has a different value, I consider these columns to be minorities.
Here is what I have in mind:
col1 col2 col3 min
0 A A A []
1 A B A ['col2']
2 A A A []
3 A B C ['col1', 'col2', 'col3']
4 A A B ['col3']
I'm stumped how to computationally efficiently calculate this.
Finding the maximum number of items appears straightfoward, either with using pandas.DataFrame.mode() or one could find the maximum item in a list as follows:
lst = ['A', 'B', 'A']
max(lst,key=lst.count)
But I'm not sure how I could find either the least occurring items.
This solution is not simple - but I could not think of a pandas native solution without apply, and numpy does not seemingly provide much help without the below complex number trick for inner-row uniqueness and value counts.
If you are not fixed on adding this min column, we can use some numpy tricks to nan out the non-least-occuring entries. First, given your dataframe we can make a numpy array of integers to help.
v = pd.factorize(df.stack())[0].reshape(df.shape)
v = pd.factorize(df.values.flatten())[0].reshape(df.shape)
(should be faster, as stack is unecessary)
Then, using some tricks for numpy row-wise unique elements (using complex numbers to mark elements as unique in each row, find the least occurring elements, and mask them in). This method is mostly from user unutbu used in several answers.
def make_mask(a):
weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
b = a + weight[:, np.newaxis]
u, ind, c = np.unique(b, return_index=True, return_counts=True)
b = np.full_like(a, np.nan, dtype=float)
np.put(b, ind, c)
m = np.nanmin(b, axis=1)
# remove only uniques
b[(~np.isnan(b)).sum(axis=1) == 1, :] = np.nan
# remove lower uniques
b[~(b == m.reshape(-1, 1))] = np.nan
return b
m = np.isnan(make_mask(v))
df[m] = np.nan
Giving
col1 col2 col3
0 NaN NaN NaN
1 NaN B NaN
2 NaN NaN NaN
3 A B C
4 NaN NaN B
Hopefully this achieves what you want in a performant way (say if this dataframe is quite large). If there is a faster way to achieve the first line (without using stack), I would imagine this is quite fast for even very large dataframes.

Categories

Resources