keeping first column value .melt func

keeping first column value .melt func - python

I want to use dataframe.melt function in pandas lib to convert data format from rows into column but keeping first column value. I ve just tried also .pivot, but it is not working good. Please look at the example below and please help:
ID Alphabet Unspecified: 1 Unspecified: 2
0 1 A G L
1 2 B NaN NaN
2 3 C H NaN
3 4 D I M
4 5 E J NaN
5 6 F K O
Into this:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
11 6 O

Try (assuming ID is unique and sorted):
df = (
pd.melt(df, "ID")
.sort_values("ID", kind="stable")
.drop(columns="variable")
.dropna()
.reset_index(drop=True)
.rename(columns={"value": "Alphabet"})
)
print(df)
Prints:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O

Don't melt but rather stack, this will directly drop the NaNs and keep the order per row:
out = (df
.set_index('ID')
.stack().droplevel(1)
.reset_index(name='Alphabet')
)
Output:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O

One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'ID',
names_to = 'Alphabet',
names_pattern = ['.+'],
sort_by_appearance = True)
.dropna()
)
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
6 3 C
7 3 H
9 4 D
10 4 I
11 4 M
12 5 E
13 5 J
15 6 F
16 6 K
17 6 O
In the code above, the names_pattern accepts a list of regular expression to match the desired columns, all the matches are collated into one column names Alphabet in names_to.

Related

Pandas split and append

I'm new to working with pandas, I don't know how to solve the following problem.
I have the following dataframe:
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
and I have to turn into the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9

Try this:
data = {
0: pd.concat(df[c] for c in df.columns[0::2]).reset_index(drop=True),
1: pd.concat(df[c] for c in df.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Explanation
First, we select every even column and group them together:
>>> df
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
>>> df.columns
Index(['0', '1', '2', '3', '4', '5'], dtype='object')
>>> even_col_names = df.columns[0::2] # slice syntax: start:stop:step (start with the 0th item, end with the <unspecified> (last) item, select every 2 items)
Index(['0', '2', '4'], dtype='object')
>>> even_cols = df[even_col_names]
>>> even_cols
0 2 4
0 a d g
1 b e h
2 c f i
Then, we select every odd column and group them together:
>>> odd_col_names = df.columns[1::2] # start with the 1st item, select every 2 items
>>> odd_col_names
Index(['1', '3', '5'], dtype='object')
>>> odd_cols = df[odd_col_names]
>>> odd_cols
1 3 5
0 1 4 7
1 2 5 8
2 3 6 9
Then, we concatenate the even columns into a single column:
>>> even_cols_list = [df[c] for c in even_col_names]
>>> even_cols_list
[0 a
1 b
2 c
Name: 0, dtype: object,
0 d
1 e
2 f
Name: 2, dtype: object,
0 g
1 h
2 i
Name: 4, dtype: object]
>>> even_col = pd.concat(even_cols_list).reset_index(drop=True)
>>> even_col
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
dtype: object
Then we concatenate the odd columns into a single column:
>>> odd_cols_list = [df[c] for c in odd_col_names]
>>> odd_cols_list
[0 1
1 2
2 3
Name: 1, dtype: int64,
0 4
1 5
2 6
Name: 3, dtype: int64,
0 7
1 8
2 9
Name: 5, dtype: int64]
>>> odd_col = pd.concat(odd_cols_list).reset_index(drop=True)
>>> odd_col
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
Finally, we create a new dataframe with these two columns:
>>> df = pd.DataFrame({0: even_col, 1: odd_col})
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

Convert data to numpy, reshape within numpy (two columns), and create a new pandas dataframe (convert relevant column to integer):
df = df.to_numpy()
df = np.reshape(df, (-1, 2)) # have a look at the docs for np.reshape
df = pd.DataFrame(df).transform(pd.to_numeric, errors='ignore')
df.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Another option would be to individually stack the numbers and strings, before recombining into a single dataframe:
numbers = df.select_dtypes('number').stack().array
strings = df.select_dtypes('object').stack().array
out = pd.concat([pd.Series(strings), pd.Series(numbers)], axis = 1)
out.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
One more option, which takes advantage of patterns here is pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index=None,
names_to=['0','1'],
names_pattern= ['0|2|4', '1|3|5'])
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?

I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.

Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20

I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

Divide the data set by rows

Given dataframe:
df = pd.DataFrame({'a':[1,2,4,5,6,8],
'b':[5,6,4,8,9,6],
'c':[6,3,3,7,8,4],
'd':[1,2,3,8,7,3],
'e':[3,2,4,4,6,2],
'f':[3,2,6,4,5,5]})
I want to divide/split df several parts (into 2,3,4.. n parts)
Desired output:
df1 =
a b c d e f
0 1 5 6 1 3 3
1 2 6 3 2 2 2
df2 =
a b c d e f
2 4 4 3 3 4 6
3 5 8 7 8 4 4
df3 =
a b c d e f
4 6 9 8 7 6 5
5 8 6 4 3 2 5
UPDATED
Real data has not equal dividable size!
real data 4351 rows × 3 columns

Use qcut to split. How you want to store it after is up to you
import pandas as pd
gp = df.groupby(pd.qcut(range(df.shape[0]), 3)) # N = 3
d = {f'df{i+1}': x[1] for i, x in enumerate(gp)}
d['df1']
# a b c d e f
#0 1 5 6 1 3 3
#1 2 6 3 2 2 2

Assuming your DataFrame can be evenly divided into n chunks:
n = 3
dfs = [df.loc[i] for i in np.split(df.index, n)]
dfs is a list containing 3 dataframes.

Creating python function to create categorical bins in pandas

I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)

You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other

Pandas - Merge multiple columns and sum

I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y

Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w

Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

keeping first column value .melt func - python

Don't melt but rather stack, this will directly drop the NaNs and keep the order per row: out = (df .set_index('ID') .stack().droplevel(1) .reset_index(name='Alphabet') ) Output: ID Alphabet 0 1 A 1 1 G 2 1 L 3 2 B 4 3 C 5 3 H 6 4 D 7 4 I 8 4 M 9 5 E 10 5 J 11 6 F 12 6 K 13 6 O

Related

Pandas split and append

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Divide the data set by rows

Creating python function to create categorical bins in pandas

Pandas - Merge multiple columns and sum

Categories

Resources