Combine 3 dataframe columns into 1 with priority while avoiding apply - python

Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?

Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]

You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])

Related

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

python pandas changing several columns in dataframe based on one condition

I am new in Python and Pandas. I worked with SAS. In SAS I can use IF statement with "Do; End;" to update values of several columns based on one condition.
I tried np.where() clause but it updates only one column. The "apply(function, ...)" also updates only one column. Positioning extra update statement inside the function body didn't help.
Suggestions?
You can select which columns you want to alter, then use .apply():
df = pd.DataFrame({'a': [1,2,3],
'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df[['a','b']].apply(lambda x: x+1)
a b
0 2 5
1 3 6
2 4 7
This link may help:
You could use:
for col in df:
df[col] = np.where(df[col] == your_condition, value_if, value_else)
eg:
a b
0 0 2
1 2 0
2 1 1
3 2 0
for col in df:
df[col] = np.where(df[col]==0,12, df[col])
Output:
a b
0 12 2
1 2 12
2 1 1
3 2 12
Or if you want apply the condition only on some columns, select them in the for loop:
for col in ['a','b']:
or just in this way:
df[['a','b']] = np.where(df[['a','b']]==0,12, df[['a','b']])

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

Issue dividing one dataframe by another

I have rewritten this question several times now as I thought I had solved the issue, however it appears not. I am currently trying to loop through columns of df1 and df2, dividing one column by the other to populate columns of a new df3 but I am having the issue that that all my cells are NaN.
My code for the loop is as follows:
#Divide One by the Other. Set up for loop
i = 0
for country in df3.columns:
df3[country] = df1.iloc[:, [i]].div(df2.iloc[:, [i]])
i += 1
The resulting df3 is a matrix full of NaNs only.
My df1 is of the structure:
And my df2 of the structure:
And I am creating my df3 as:
df3 = pd.DataFrame(index = df1.index, columns=tickers.index)
Which looks like (before population):
The only potential issue is the multi index in df3 perhaps? Struggling to see why they don't divide through.
The reason why your current approach does not work is because you're dividing pd.Series objects. pandas automatically tries to align the indices when dividing. Here's an example.
df1
5 0
4 1
3 2
2 3
1 4
dtype: int64
df2
5 0
6 1
7 2
8 3
9 4
dtype: int64
df1 / df2 # you'd expect all 1's in each row, but...
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Ensure that you have the same number of rows and columns in df1 and df2, and then this should becomes easy if you divide the np.array counterparts of the dataframes.
v = df1.values / df2.values
df3 = pd.DataFrame(v, index=df1.index, columns=tickers.index)

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources