Pandas Apply function on Column - python

I have a few columns in the Pandas dataframe which are sparsely populated, and I wish to modify their respective values to, whether they are filled or not.
For instance, my dataframe is:
A B C D
0 8 8
1 9 4 4
2 5 3 8
3 4 1
I want it to be :
A B C D
0 8 F F 8
1 9 T F 4
2 5 F T 8
3 4 F F 1
F = False, T= True
Running pandas.isnull(df['B']) returns a new series with boolean values, how do I update it efficiently in the original dataframe?

You can just do df['B'] = df.B.notnull().

Related

Copy a Multi-Index column of a Pandas Dataframe including the second header

Background - I have a dataset with two headers that I've read from a CSV...
df = pd.read_csv(file, header=[0,1])
print(df)
A B C
a b c
---------------
0 1 2 3
1 4 5 6
2 7 8 9
...
What I Want - I'd like to duplicate one of the columns (say, A) into a new column D, such that...
A B C D
a b c a
----------------------
0 1 2 3 1
1 4 5 6 4
2 7 8 9 7
...
What I'm Getting - ...but when I do this using df['D'] = df['A'], what I get is a copy without the content of the second header...
A B C D
a b c
----------------------
0 1 2 3 1
1 4 5 6 4
2 7 8 9 7
...
TLDR - How can I capture the content of the second header of my multi-index dataframe when copying a column?
Need to specify both levels of multiindex
# assign ('A', 'a') values to ('D', 'a') column
df[('D','a')] = df[('A','a')] # or df['D','a'] = df['A']

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

How do I convert my 2D numpy array to a pandas dataframe with given categories?

I have an array called 'values' which features 2 columns of mean reaction time data from 10 individuals. The first column refers to data collected for a single individual in condition A, the second for that same individual in condition B:
array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
I would like to plot these data using seaborn, to display the mean values and connected single values for each individual across the two conditions. What is the simplest way to convert the array 'values' into a 3 column DataFrame, whereby one column features all the values, another features a label distinguishing that value as condition A or condition B, and a final column which provides a number for each individual (i.e., 1-10)? For example, as follows:
Value Condition Individual
451.75 A 1
488.56 B 1
488.55 A 2
...etc
melt
You can do that using pd.melt:
pd.DataFrame(data, columns=['A','B']).reset_index().melt(id_vars = 'index')\
.rename(columns={'index':'Individual'})
Individual variable value
0 0 A 451.750000
1 1 A 552.444444
2 2 A 629.875000
3 3 A 454.666667
4 4 A 637.166667
5 5 A 538.833333
6 6 A 463.833333
7 7 A 429.296296
8 8 A 524.666667
9 0 B 488.555556
10 1 B 590.407407
11 2 B 637.629630
12 3 B 421.888889
13 4 B 539.944444
14 5 B 516.333333
15 6 B 448.833333
16 7 B 497.166667
17 8 B 458.833333
This should work
import pandas as pd
import numpy as np
np_array = np.array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
pd_df = pd.DataFrame(np_array, columns=["A", "B"])
num_individuals = len(pd_df.index)
pd_df = pd_df.melt()
pd_df["INDIVIDUAL"] = [(i)%(num_individuals) + 1 for i in pd_df.index]
pd_df
variable value INDIVIDUAL
0 A 451.750000 1
1 A 552.444444 2
2 A 629.875000 3
3 A 454.666667 4
4 A 637.166667 5
5 A 538.833333 6
6 A 463.833333 7
7 A 429.296296 8
8 A 524.666667 9
9 B 488.555556 1
10 B 590.407407 2
11 B 637.629630 3
12 B 421.888889 4
13 B 539.944444 5
14 B 516.333333 6
15 B 448.833333 7
16 B 497.166667 8
17 B 458.833333 9

How do I get panda's "update" function to overwrite numbers in one column but not another?

Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?
Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6
If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])

Categories

Resources