Pandas - adding another column with the same name as another column - python

for example see below:
1
2
3
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
how would I add another column with a 4 in it? I have used:
df = df.assign(4 = np.zeros(shape=(df.shape[0],1))
however, it just changes the columns of 4 to what I have entered.
I hope this question is clear enough! Future state should look like this:
1
2
3
4
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X

As #anon01 states, this is not a good idea, but you can use pd.concat
df = pd.DataFrame(np.arange(25).reshape(-1, 5))
pd.concat([df,pd.Series([np.nan]*5).rename(4)], axis=1)
And as #CameronRiddell states:
pd.concat([df, pd.Series(np.nan, name=4)], axis=1)
Output:
0 1 2 3 4 4
0 0 1 2 3 4 NaN
1 5 6 7 8 9 NaN
2 10 11 12 13 14 NaN
3 15 16 17 18 19 NaN
4 20 21 22 23 24 NaN

Related

how to repeat each row n times in pandas so that it looks like this?

I want to know how to repeat each row n times in pandas in this fashion
I want this below result(With df_repeat = pd.concat([df]*2, ignore_index=False) I can't get expected result ):
Original Dataset:
index
value
0
x
1
x
2
x
3
x
4
x
5
x
Dataframe I want:
index
value
0
x
0
x
1
x
1
x
2
x
2
x
3
x
3
x
4
x
4
x
5
x
5
x
You can repeat the index:
df_repeat = df.loc[df.index.repeat(2)]
output:
index value
0 0 x
0 0 x
1 1 x
1 1 x
2 2 x
2 2 x
3 3 x
3 3 x
4 4 x
4 4 x
5 5 x
5 5 x
For a clean, new index:
df_repeat = df.loc[df.index.repeat(2)].reset_index(drop=True)
output:
index value
0 0 x
1 0 x
2 1 x
3 1 x
4 2 x
5 2 x
6 3 x
7 3 x
8 4 x
9 4 x
10 5 x
11 5 x
on Series
Should you have a Series as input, there is a Series.repeat method:
# create a Series from the above DataFrame
s = df.set_index('index')['value']
s.repeat(2)
output:
index
0 x
0 x
1 x
1 x
2 x
2 x
3 x
3 x
4 x
4 x
5 x
5 x
Name: value, dtype: object

Add label-column to DataFrame

I have two DataFrames for example
df1:
0 1 2 3
a 1 2 3 4
b 10 20 30 40
c 100 200 300 400
------------------
df2:
0
0 x
1 y
2 z
Now I want to combine both like:
df_new:
value label
0 1 x
1 2 x
2 3 x
3 4 x
0 10 y
1 20 y
2 30 y
3 40 y
0 100 z
1 200 z
2 300 z
3 400 z
I wrote a really awkward code like:
df_new=pd.DataFrame()
for i,j in zip(df1.index, df2.index):
x=df1.loc[i]
y=df2.loc[j]
label=np.full(x.shape[0],y)
df=pd.DataFrame({'value':x,'label':label})
df_new=pd.concat([df_new,df],axis=0)
print(df_new)
But I can imagine that there is a pandas-function like pd.melt or something which can do that better for bigger scale.
If there is same length of both DataFrames is possible create index in df1 by column 0 in df2 and reshape by DataFrame.stack, last encessary some data processing:
df = (df1.set_index(df2[0])
.stack()
.reset_index(level=1, drop=True)
.rename_axis('lab')
.reset_index(name='val')[['val','lab']])
print (df)
val lab
0 1 x
1 2 x
2 3 x
3 4 x
4 10 y
5 20 y
6 30 y
7 40 y
8 100 z
9 200 z
10 300 z
11 400 z
Solution with DataFrame.melt and append second df to first by DataFrame.join:
df = (df1.reset_index(drop=True)
.join(df2.add_prefix('label'))
.melt(['label0', 'label1'], ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1)[['value','label0','label1']]
)
print (df)
value label0 label1
0 1 x xx
1 2 x xx
2 3 x xx
3 4 x xx
4 10 y yy
5 20 y yy
6 30 y yy
7 40 y yy
8 100 z zz
9 200 z zz
10 300 z zz
11 400 z zz

Conditionally delete rows by ID in Pandas

I have a pandas df:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
2 2 x y
3 2 x y
3 4 x y
3 3 x y
For each ID, like to remove rows where df.Score = 2 but only when there is a 3 or 4 present for that ID. I'd like to keep nansand 2 when the only score per ID = 2.
So I get:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
3 4 x y
3 3 x y
Any help, much appreciated
Use:
df[~df.groupby('ID')['Score'].apply(lambda x:x.eq(2)&x.isin([3,4]).any())]
ID Score C D
0 1 2.0 x y
1 1 NaN x y
2 1 2.0 x y
3 2 3.0 x y
6 3 4.0 x y
7 3 3.0 x y

Splitting a dataframe in python

I have a dataframe df=
Type ID QTY_1 QTY_2 RES_1 RES_2
X 1 10 15 y N
X 2 12 25 N N
X 3 25 16 Y Y
X 4 14 62 N Y
X 5 21 75 Y Y
Y 1 10 15 y N
Y 2 12 25 N N
Y 3 25 16 Y Y
Y 4 14 62 N N
Y 5 21 75 Y Y
I want the result data set of two different data frames with QTY which has Y in their respective RES.
Below is my expected result
df1=
Type ID QTY_1
X 1 10
X 3 25
X 5 21
Y 1 10
Y 3 25
Y 5 21
df2 =
Type ID QTY_2
X 3 16
X 4 62
X 5 75
Y 3 16
Y 5 75
You can do this:
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.isin(['Y', 'y'])]
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.isin(['Y', 'y'])]
or
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.str.lower() == 'y']
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.str.lower() == 'y']
Output:
>>> df1
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
>>> df2
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75
Use a dictionary
It's good practice to use a dictionary for a variable number of variables. Although in this case there may be only a couple of categories, you benefit from organized data. For example, you can access RES_1 data via dfs[1].
dfs = {i: df.loc[df['RES_'+str(i)].str.lower() == 'y', ['Type', 'ID', 'QTY_'+str(i)]] \
for i in range(1, 3)}
print(dfs)
{1: Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21,
2: Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75}
You need:
df1 = df.loc[(df['RES_1']=='Y') | (df['RES_1']=='y')].drop(['QTY_2', 'RES_1', 'RES_2'], axis=1)
df2 = df.loc[(df['RES_2']=='Y') | (df['RES_2']=='y')].drop(['QTY_1', 'RES_1', 'RES_2'], axis=1)
print(df1)
print(df2)
Output:
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75

pandas dataframe inserting null values

I have two dataframes:
index a b c d
1 x x x x
2 x nan x x
3 x x x x
4 x nan x x
index a b e
3 x nan x
4 x x x
5 x nan x
6 x x x
I want to make it into the following, where we simply get rid of the NaN values. An easier version of this question is where the second dataframe has no nan values....
index a b c d e
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
6 x x x x x
You may use combine_first with fillna:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
You can read the doc from here
import pandas as pd
d1 = pd.DataFrame([[nan,1,1],[2,2,2],[3,3,3]], columns=['a','b','c'])
d1
a b c
0 NaN 1 1
1 2 2 2
2 3 3 3
d2 = pd.DataFrame([[1,nan,1],[nan,2,2],[3,3,nan]], columns=['b','d','e'])
d2
b d e
0 1 NaN 1
1 NaN 2 2
2 3 3 NaN
d2.combine_first(d1) # d1's values are prioritized, if d2 has no NaN
a b c d e
0 NaN 1 1 NaN 1
1 2 2 2 2 2
2 3 3 3 3 NaN
d2.combine_first(d1).fillna(5) # simply fill NaN with a value
a b c d e
0 5 1 1 5 1
1 2 2 2 2 2
2 3 3 3 3 5
Use nan_to_num to replace a nan with a number:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
Just apply this:
from numpy import nan_to_num
df2 = df.apply(nan_to_num)
Then you can merge the arrays however you want.

Categories

Resources