How to add pandas data frame column based on other rows values - python

I am trying to add a new column and set its value based on other rows values. Lets say we have the following data frame:
df = pd.DataFrame({
'B':[1,2,3,4,5,6],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
})
With this corresponding output
B C D
1 7 1
2 8 3
3 9 5
4 4 7
5 2 1
6 3 0
I want to add a new column 'E', which has the following value:
E = df.C value where B = B + 2.
For example, the first value of E should be 3 (we select the row where B = 0+2 = 2, and select C value from that row).
I tried the following
f['E'] = np.where(f.B == (f['B']+2))['C']
But it's not working

You can set B and index and use that to map the modified data:
df['E'] = df['B'].add(2).map(df.set_index('B')['C'])
Output:
B C D E
0 0 1 7 3.0
1 1 2 8 4.0
2 2 3 9 5.0
3 3 4 4 6.0
4 4 5 2 NaN
5 5 6 3 NaN

Related

Split DataFrame and loop functions over subframes

I have a dataframe with the following structure:
df = pd.DataFrame({'TIME':list('12121212'),'NAME':list('aabbccdd'), 'CLASS':list("AAAABBBB"),
'GRADE':[4,5,4,5,4,5,4,5]}, columns = ['TIME', 'NAME', 'CLASS','GRADE'])
print(df):
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5
What I need to do is split the above dataframe into multiple dataframes according to the variable CLASS, convert the dataframe from long to wide (such that we have NAMES as columns and GRADE as the main entry in the data matrix) and then iterate other functions over the smaller CLASS dataframes. If I create a dict object as suggested here, I obtain:
d = dict(tuple(df.groupby('CLASS')))
print(d):
{'A': TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5, 'B': TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5}
In order to convert the dataframe from long to wide, I used the function pivot_table from pandas:
for names, classes in d.items():
newdata=df.pivot_table(index="TIME", columns="NAME", values="GRADE")
print(newdata):
NAME a b c d
TIME
1 4 4 4 4
2 5 5 5 5
So far so good. However, once I obtain the newdata dataframe I am not able to access the smaller dataframes created in d, since the variable CLASS is now missing from the dataframe (as it should be). Suppose I then need to iterate a function over the two smaller subframes CLASS==A and CLASS==B. How would I be able to do this using a for loop if I am not able to define the dataset structure using the column CLASS?
Try using groupby+apply to conserve the group names:
(df.groupby('CLASS')
.apply(lambda d: d.pivot_table(index="TIME", columns="NAME", values="GRADE"))
)
output:
a b c d
CLASS TIME
A 1 4.0 4.0 NaN NaN
2 5.0 5.0 NaN NaN
B 1 NaN NaN 4.0 4.0
2 NaN NaN 5.0 5.0
Other possibility, loop over the groups, keeping CLASS as column:
for group_name, group_df in df.groupby('CLASS', as_index=False):
print(f'working on group {group_name}')
print(group_df)
output:
working on group A
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
working on group B
TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5

Compute difference between values in dataframe column

i have this dataframe:
a b c d
4 7 5 12
3 8 2 8
1 9 3 5
9 2 6 4
i want the column 'd' to become the difference between n-value of column a and n+1 value of column 'a'.
I tried this but it doesn't run:
for i in data.index-1:
data.iloc[i]['d']=data.iloc[i]['a']-data.iloc[i+1]['a']
can anyone help me?
Basically what you want is diff.
df = pd.DataFrame.from_dict({"a":[4,3,1,9]})
df["d"] = df["a"].diff(periods=-1)
print(df)
Output
a d
0 4 1.0
1 3 2.0
2 1 -8.0
3 9 NaN
lets try simple way:
df=pd.DataFrame.from_dict({'a':[2,4,8,15]})
diff=[]
for i in range(len(df)-1):
diff.append(df['a'][i+1]-df['a'][i])
diff.append(np.nan)
df['d']=diff
print(df)
a d
0 2 2.0
1 4 4.0
2 8 7.0
3 15 NaN

How to change values of dataframe cells based on following row cell value?

I'm working in Python, with Pandas DataFrames.
I have a problem where my dataframe looks like this:
Index A B Copy_of_B
1 a 0 0
2 a 1 1
3 a 5 5
4 b 0 0
5 b 4 4
6 c 6 6
My expected output is:
Index A B Copy_of_B
1 a 0 1
2 a 1 1
3 a 5 5
4 b 0 4
5 b 4 4
6 c 6 6
I would like to replace the 0 values in the Copy_of_B column with the values in the following row, but I don't want to use a for loop to iterate.
Is there an easy solution for this?
Thanks,
Barna
I make use of fact that your DataFrame has index composed of consecutive numbers.
Start from creating 2 indices:
ind = df[df.Copy_of_B == 0].index
ind2 = ind + 1
The first contains index values of rows where Copy_of_B == 0.
The second contains indices of subsequent rows.
Then, to "copy" data from subsequent rows to rows containing zeroes, run:
df.loc[ind, 'Copy_of_B'] = df.loc[ind2, 'Copy_of_B'].tolist()
As you can see, without any loop running over the whole DataFrame.
You can use mask and bfill:
df['Copy_of_B'] = df['B'].mask(df['B'].eq(0)).bfill()
Output:
Index A B Copy_of_B
0 1 a 0 1.0
1 2 a 1 1.0
2 3 a 5 5.0
3 4 b 0 4.0
4 5 b 4 4.0
5 6 c 6 6.0

Get only two values from 4 specified columns and merge valid values into 2 columns

df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Categories

Resources