Pandas split and append - python

I'm new to working with pandas, I don't know how to solve the following problem.
I have the following dataframe:
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
and I have to turn into the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9

Try this:
data = {
0: pd.concat(df[c] for c in df.columns[0::2]).reset_index(drop=True),
1: pd.concat(df[c] for c in df.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Explanation
First, we select every even column and group them together:
>>> df
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
>>> df.columns
Index(['0', '1', '2', '3', '4', '5'], dtype='object')
>>> even_col_names = df.columns[0::2] # slice syntax: start:stop:step (start with the 0th item, end with the <unspecified> (last) item, select every 2 items)
Index(['0', '2', '4'], dtype='object')
>>> even_cols = df[even_col_names]
>>> even_cols
0 2 4
0 a d g
1 b e h
2 c f i
Then, we select every odd column and group them together:
>>> odd_col_names = df.columns[1::2] # start with the 1st item, select every 2 items
>>> odd_col_names
Index(['1', '3', '5'], dtype='object')
>>> odd_cols = df[odd_col_names]
>>> odd_cols
1 3 5
0 1 4 7
1 2 5 8
2 3 6 9
Then, we concatenate the even columns into a single column:
>>> even_cols_list = [df[c] for c in even_col_names]
>>> even_cols_list
[0 a
1 b
2 c
Name: 0, dtype: object,
0 d
1 e
2 f
Name: 2, dtype: object,
0 g
1 h
2 i
Name: 4, dtype: object]
>>> even_col = pd.concat(even_cols_list).reset_index(drop=True)
>>> even_col
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
dtype: object
Then we concatenate the odd columns into a single column:
>>> odd_cols_list = [df[c] for c in odd_col_names]
>>> odd_cols_list
[0 1
1 2
2 3
Name: 1, dtype: int64,
0 4
1 5
2 6
Name: 3, dtype: int64,
0 7
1 8
2 9
Name: 5, dtype: int64]
>>> odd_col = pd.concat(odd_cols_list).reset_index(drop=True)
>>> odd_col
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
Finally, we create a new dataframe with these two columns:
>>> df = pd.DataFrame({0: even_col, 1: odd_col})
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

Convert data to numpy, reshape within numpy (two columns), and create a new pandas dataframe (convert relevant column to integer):
df = df.to_numpy()
df = np.reshape(df, (-1, 2)) # have a look at the docs for np.reshape
df = pd.DataFrame(df).transform(pd.to_numeric, errors='ignore')
df.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Another option would be to individually stack the numbers and strings, before recombining into a single dataframe:
numbers = df.select_dtypes('number').stack().array
strings = df.select_dtypes('object').stack().array
out = pd.concat([pd.Series(strings), pd.Series(numbers)], axis = 1)
out.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
One more option, which takes advantage of patterns here is pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index=None,
names_to=['0','1'],
names_pattern= ['0|2|4', '1|3|5'])
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

Related

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

How two merge two dataframes without any index being based

Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how?
Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}
Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11

Add a name to pandas dataframe index

As the picture shows , how can I add a name to index in pandas dataframe?And when added it should be like this:
You need set index name:
df.index.name = 'code'
Or rename_axis:
df = df.rename_axis('code')
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10,size=(5,5)),columns=list('ABCDE'),index=list('abcde'))
print (df)
A B C D E
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df.index.name = 'code'
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df = df.rename_axis('code')
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4

TypeError: unhashable type: 'dict' when using pandas Multi Index

I have tried to add one dataframe that has 2 rows and about 200 columns to the top of another dataframe, but I got TypeError: unhashable type: 'dict' .
This is code I'm using:
df is first dataframe with 2 rows and about 200 columns that I am trying to add to finaldata dataframe.
finaldata.columns = pd.MultiIndex.from_arrays([df.values[0], finaldata.columns])
When I check type of dataframes with type(), I got pandas.core.frame.DataFrame
It seems you need iloc for select by position first and second row of df:
finaldata.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.iloc[1], finaldata.columns])
Sample:
df = pd.DataFrame({'a':[2,3],
'b':[5,6],
'c':[1,5],
'd':[4,5],
'e':[1,5],
'f':[8,9]})
print (df)
a b c d e f
0 2 5 1 4 1 8
1 3 6 5 5 5 9
finaldata = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (finaldata)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays([df.iloc[0],
df.iloc[1],
finaldata.columns], names=names)
print (finaldata)
first 2 5 1 4 1 8
second 3 6 5 5 5 9
third A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
Another solution with numpy.concatenate:
a = np.concatenate([df.values, np.array(finaldata.columns).reshape(-1,df.shape[1])]).tolist()
print (a)
[[2, 5, 1, 4, 1, 8], [3, 6, 5, 5, 5, 9], ['A', 'B', 'C', 'D', 'E', 'F']]
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays(a, names=names)
print (finaldata)
first 2 5 1 4 1 8
second 3 6 5 5 5 9
third A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
EDIT:
Solution is very similar, only need reindex columns:
df = pd.DataFrame({'A':[2,3],
'B':[5,6],
'C':[1,5],
'D':[4,5],
'E':[1,5],
'F':[8,9]})
print (df)
A B C D E F
0 2 5 1 4 1 8
1 3 6 5 5 5 9
finaldata = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'E':[7,8,9],
'F':[1,3,5]})
print (finaldata)
A B E F
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
df1 = df.reindex(columns=finaldata.columns)
print (df1)
A B E F
0 2 5 1 8
1 3 6 5 9
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays([df1.iloc[0],
df1.iloc[1],
finaldata.columns], names=names)
print (finaldata)
first 2 5 1 8
second 3 6 5 9
third A B E F
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Categories

Resources