Add label-column to DataFrame - python

I have two DataFrames for example
df1:
0 1 2 3
a 1 2 3 4
b 10 20 30 40
c 100 200 300 400
------------------
df2:
0
0 x
1 y
2 z
Now I want to combine both like:
df_new:
value label
0 1 x
1 2 x
2 3 x
3 4 x
0 10 y
1 20 y
2 30 y
3 40 y
0 100 z
1 200 z
2 300 z
3 400 z
I wrote a really awkward code like:
df_new=pd.DataFrame()
for i,j in zip(df1.index, df2.index):
x=df1.loc[i]
y=df2.loc[j]
label=np.full(x.shape[0],y)
df=pd.DataFrame({'value':x,'label':label})
df_new=pd.concat([df_new,df],axis=0)
print(df_new)
But I can imagine that there is a pandas-function like pd.melt or something which can do that better for bigger scale.

If there is same length of both DataFrames is possible create index in df1 by column 0 in df2 and reshape by DataFrame.stack, last encessary some data processing:
df = (df1.set_index(df2[0])
.stack()
.reset_index(level=1, drop=True)
.rename_axis('lab')
.reset_index(name='val')[['val','lab']])
print (df)
val lab
0 1 x
1 2 x
2 3 x
3 4 x
4 10 y
5 20 y
6 30 y
7 40 y
8 100 z
9 200 z
10 300 z
11 400 z
Solution with DataFrame.melt and append second df to first by DataFrame.join:
df = (df1.reset_index(drop=True)
.join(df2.add_prefix('label'))
.melt(['label0', 'label1'], ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1)[['value','label0','label1']]
)
print (df)
value label0 label1
0 1 x xx
1 2 x xx
2 3 x xx
3 4 x xx
4 10 y yy
5 20 y yy
6 30 y yy
7 40 y yy
8 100 z zz
9 200 z zz
10 300 z zz
11 400 z zz

Related

pandas dataframe wide to long

I have a dataframe like so:
id type count_x count_y count_z sum_x sum_y sum_z
1 A 12 1 6 34 43 25
1 B 4 5 8 12 37 28
Now I want to transform it by grouping by id and type, then transforming from wide to long like so:
id type variable value calc
1 A x 12 count
1 A y 1 count
1 A z 6 count
1 B x 4 count
1 B y 5 count
1 B z 8 count
1 A x 34 sum
1 A y 43 sum
1 A z 25 sum
1 B x 12 sum
1 B y 37 sum
1 B z 28 sum
How can I achieve this?
try using melt:
res = pd.melt(df,id_vars=['id', 'type'])
res[['calc', 'variable']] = res.variable.str.split('_', expand=True)
id
type
variable
value
calc
0
1
A
x
12
count
1
1
B
x
4
count
2
1
A
y
1
count
3
1
B
y
5
count
4
1
A
z
6
count
5
1
B
z
8
count
6
1
A
x
34
sum
7
1
B
x
12
sum
8
1
A
y
43
sum
9
1
B
y
37
sum
10
1
A
z
25
sum
11
1
B
z
28
sum
Update:
Using stack:
df1 = (df.set_index(['id', 'type']).stack().rename('value').reset_index())
df1 = df1.drop('level_2',axis=1).join(df1['level_2'].str.split('_', 1, expand=True).rename(columns={0:'calc', 1:'variable'}))
id
type
value
calc
variable
0
1
A
12
count
x
1
1
A
1
count
y
2
1
A
6
count
z
3
1
A
34
sum
x
4
1
A
43
sum
y
5
1
A
25
sum
z
6
1
B
4
count
x
7
1
B
5
count
y
8
1
B
8
count
z
9
1
B
12
sum
x
10
1
B
37
sum
y
11
1
B
28
sum
z
You can use a combination of melt and split()
df = pd.DataFrame({'id': [1,1], 'type': ['A', 'B'], 'count_x':[12,4], 'count_y': [1,5], 'count_z': [6,8], 'sum_x': [34, 12], 'sum_y': [43, 37], 'sum_z': [25, 28]})
df_melt = df.melt(id_vars=['id', 'type'])
df_melt[['calc', 'variable']] = df_melt['variable'].str.split("_", expand=True)
df_melt
id type variable value calc
0 1 A x 12 count
1 1 B x 4 count
2 1 A y 1 count
3 1 B y 5 count
4 1 A z 6 count
5 1 B z 8 count
6 1 A x 34 sum
7 1 B x 12 sum
8 1 A y 43 sum
9 1 B y 37 sum
10 1 A z 25 sum
11 1 B z 28 sum
Assuming your pandas DataFrame is df_wide, you can get the desired result in df_long as,
df_long = df.melt(id_vars=['id', 'type'], value_vars=['count_x', 'count_y', 'count_z', 'sum_x', 'sum_y', 'sum_z'])
df_long['calc'] = df_long['variable'].apply(lambda x: x.split('_')[0])
df_long['variable'] = df_long['variable'].apply(lambda x: x.split('_')[1])
You could reshape the data using pivot_longer from pyjanitor:
df.pivot_longer(index = ['id', 'type'],
# names of the new columns
# note the order
names_to = ['calc', 'variable'],
values_to = 'value',
# delimiter for the columns
# first value is assigned to `calc`,
# the other goes to `variable`
names_sep='_')
id type calc variable value
0 1 A count x 12
1 1 B count x 4
2 1 A count y 1
3 1 B count y 5
4 1 A count z 6
5 1 B count z 8
6 1 A sum x 34
7 1 B sum x 12
8 1 A sum y 43
9 1 B sum y 37
10 1 A sum z 25
11 1 B sum z 28

Pandas - adding another column with the same name as another column

for example see below:
1
2
3
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
how would I add another column with a 4 in it? I have used:
df = df.assign(4 = np.zeros(shape=(df.shape[0],1))
however, it just changes the columns of 4 to what I have entered.
I hope this question is clear enough! Future state should look like this:
1
2
3
4
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
As #anon01 states, this is not a good idea, but you can use pd.concat
df = pd.DataFrame(np.arange(25).reshape(-1, 5))
pd.concat([df,pd.Series([np.nan]*5).rename(4)], axis=1)
And as #CameronRiddell states:
pd.concat([df, pd.Series(np.nan, name=4)], axis=1)
Output:
0 1 2 3 4 4
0 0 1 2 3 4 NaN
1 5 6 7 8 9 NaN
2 10 11 12 13 14 NaN
3 15 16 17 18 19 NaN
4 20 21 22 23 24 NaN

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Splitting a dataframe in python

I have a dataframe df=
Type ID QTY_1 QTY_2 RES_1 RES_2
X 1 10 15 y N
X 2 12 25 N N
X 3 25 16 Y Y
X 4 14 62 N Y
X 5 21 75 Y Y
Y 1 10 15 y N
Y 2 12 25 N N
Y 3 25 16 Y Y
Y 4 14 62 N N
Y 5 21 75 Y Y
I want the result data set of two different data frames with QTY which has Y in their respective RES.
Below is my expected result
df1=
Type ID QTY_1
X 1 10
X 3 25
X 5 21
Y 1 10
Y 3 25
Y 5 21
df2 =
Type ID QTY_2
X 3 16
X 4 62
X 5 75
Y 3 16
Y 5 75
You can do this:
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.isin(['Y', 'y'])]
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.isin(['Y', 'y'])]
or
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.str.lower() == 'y']
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.str.lower() == 'y']
Output:
>>> df1
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
>>> df2
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75
Use a dictionary
It's good practice to use a dictionary for a variable number of variables. Although in this case there may be only a couple of categories, you benefit from organized data. For example, you can access RES_1 data via dfs[1].
dfs = {i: df.loc[df['RES_'+str(i)].str.lower() == 'y', ['Type', 'ID', 'QTY_'+str(i)]] \
for i in range(1, 3)}
print(dfs)
{1: Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21,
2: Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75}
You need:
df1 = df.loc[(df['RES_1']=='Y') | (df['RES_1']=='y')].drop(['QTY_2', 'RES_1', 'RES_2'], axis=1)
df2 = df.loc[(df['RES_2']=='Y') | (df['RES_2']=='y')].drop(['QTY_1', 'RES_1', 'RES_2'], axis=1)
print(df1)
print(df2)
Output:
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75

How to reshape multiple time-series signals for use with sns.tsplot?

I'm trying to reshape data that looks like this:
t y0 y1 y2
0 0 -1 0 1
1 1 0 1 2
2 2 1 2 3
3 3 2 3 4
4 4 3 4 5
into something like this:
t trial signal value
0 0 0 y -1
1 0 1 y 0
2 0 2 y 1
3 1 0 y 0
4 1 1 y 1
5 1 2 y 2
6 2 0 y 1
7 2 1 y 2
8 2 2 y 3
9 3 0 y 2
10 3 1 y 3
11 3 2 y 4
12 4 0 y 3
13 4 1 y 4
14 4 2 y 5
so that I can feed it into sns.tsplot.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
fig = plt.figure()
num_points = 5
# Create some dummy line signals and assemble a data frame
t = np.arange(num_points)
y0 = t - 1
y1 = t
y2 = t + 1
df = pd.DataFrame(np.vstack((t, y0, y1, y2)).transpose(), columns=['t', 'y0', 'y1', 'y2'])
print(df)
# Do some magic transformations
df = pd.melt(df, id_vars=['t'])
print(df)
# Plot the time-series data
sns.tsplot(time="t", value="value", unit="trial", condition="signal", data=df, ci=[68, 95])
plt.savefig("dummy.png")
plt.close()
I'm hoping to achieve this for lines:
https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.tsplot.html
http://pandas.pydata.org/pandas-docs/stable/reshaping.html
I think you can use melt for reshaping, get first and second char by indexing with str and last sort_values with reordering columns:
df1 = pd.melt(df, id_vars=['t'])
#create helper Series
variable = df1['variable']
#extract second char, convert to int
df1['trial'] = variable.str[1].astype(int)
#extract first char
df1['signal'] = variable.str[0]
#sort values by column t, reset index
df1 = df1.sort_values('t').reset_index(drop=True)
#reorder columns
df1 = df1[['t','trial','signal','value']]
print df1
t trial signal value
0 0 0 y -1
1 0 1 y 0
2 0 2 y 1
3 1 0 y 0
4 1 1 y 1
5 1 2 y 2
6 2 0 y 1
7 2 1 y 2
8 2 2 y 3
9 3 0 y 2
10 3 1 y 3
11 3 2 y 4
12 4 0 y 3
13 4 1 y 4
14 4 2 y 5
Another solution, if all values in column signal are only y:
#remove y from column name, first value of column names is same
df.columns = df.columns[:1].tolist() + [int(col[1]) for col in df.columns[1:]]
print df
t 0 1 2
0 0 -1 0 1
1 1 0 1 2
2 2 1 2 3
3 3 2 3 4
4 4 3 4 5
df1 = pd.melt(df, id_vars=['t'], var_name=['trial'])
#all values in column signal are y
df1['signal'] = 't'
#sort values by column t, reset index
df1 = df1.sort_values('t').reset_index(drop=True)
#reorder columns
df1 = df1[['t','trial','signal','value']]
print df1
t trial signal value
0 0 0 t -1
1 0 1 t 0
2 0 2 t 1
3 1 0 t 0
4 1 1 t 1
5 1 2 t 2
6 2 0 t 1
7 2 1 t 2
8 2 2 t 3
9 3 0 t 2
10 3 1 t 3
11 3 2 t 4
12 4 0 t 3
13 4 1 t 4
14 4 2 t 5

Categories

Resources