Combining columns within the same df Python/Pandas

Combining columns within the same df Python/Pandas - python

I'm new to the programming world and can't figure out how to concatenate columns in pandas. I'm not looking to join these columns, but rather stack them on top of each other.
This is the code I have so far:
import pandas as pd
import numpy as np
df = pd.read_excel("C:\\Users\\Kit Wesselhoeft\\Documents\\NEM\\Northend Manufacturing_deletecol.xlsx")
print(df)
df = pd.concat(['A','A'])
print(df)
image here
I want to combine all the columns so that all the A's sit on top of each other, Same with the B's - E's.
How can I do this? Am I missing something?

It looks like you are looking for "append":
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10, (3,2)),columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(1,10, (3,2)),columns=list('AB'))
df3=df.append(df2)
In [2]: df3
Out[2]:
A B
0 7 6
1 8 3
2 2 1
0 2 2
1 1 3
2 5 5

If you are certain column ordering is consistent and tiled [A,B,C,A,B,C...], then you can create a new DataFrame by reshaping the old. Otherwise safer alternatives exist with pd.wide_to_long which uses the actual column names.
Sample Data
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (3, 15)),
columns=list('BACDE')*3)
# B A C D E B A C D E B A C D E
#0 3 3 7 2 4 7 2 1 2 1 1 4 5 1 1
#1 5 2 8 4 3 5 8 3 5 9 1 8 4 5 7
#2 2 6 7 3 2 9 4 6 1 3 7 3 5 5 7
Reshape
cols = pd.unique(df.columns) # Preserves Order
pd.DataFrame(df.values.reshape(-1, len(cols)), columns=cols)
# B A C D E
#0 3 3 7 2 4
#1 7 2 1 2 1
#2 1 4 5 1 1
#3 5 2 8 4 3
#4 5 8 3 5 9
#5 1 8 4 5 7
#6 2 6 7 3 2
#7 9 4 6 1 3
#8 7 3 5 5 7
pd.wide_to_long
Useful when your columns are not in the same tiling order, or if you have more of some than others. Requires you to modify the column names by adding _N for which occurrence of the column it is.
cols = pd.unique(df.columns)
s = pd.Series(df.columns).groupby(df.columns).cumcount()
df.columns = [f'{col}_{N}' for col,N in zip(df.columns, s)]
pd.wide_to_long(df.reset_index(), stubnames=cols, i='index', j='num', sep='_').reset_index(drop=True)
# B A C D E
#0 3 3 7 2 4
#1 5 2 8 4 3
#2 2 6 7 3 2
#3 7 2 1 2 1
#4 5 8 3 5 9
#5 9 4 6 1 3
#6 1 4 5 1 1
#7 1 8 4 5 7
#8 7 3 5 5 7

The following example is relevant when you exactly know where your columns are. Building on ALollz's code:
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (3, 15)),
columns=list('BACDE')*3)
# B A C D E B A C D E B A C D E
#0 3 3 7 2 4 7 2 1 2 1 1 4 5 1 1
#1 5 2 8 4 3 5 8 3 5 9 1 8 4 5 7
#2 2 6 7 3 2 9 4 6 1 3 7 3 5 5 7
# Using iloc
df1 = df.iloc[:, :5]
df2 = df.iloc[:,5:10]
df3 = df.iloc[:,10:]
df_final= pd.concat([df1,df2,df3]).reset_index(drop=True)
Result df_final:
B A C D E
0 3 3 7 2 4
1 5 2 8 4 3
2 2 6 7 3 2
3 7 2 1 2 1
4 5 8 3 5 9
5 9 4 6 1 3
6 1 4 5 1 1
7 1 8 4 5 7
8 7 3 5 5 7

Related

How to add dataframe to multiindex dataframe at specific location

I'm organizing data from separate files into one portable, multiindex
dataframe, with multiindex ("A", "B", "C"). Some of the info is gathered
from the filenames read in, and should populate the "A", and "B", of the
multiindex. "C" should take the form of the index of the file read in.
The columns should take the form of the columns read in.
Let's say the files read in become:
df1
0 1 2 3 4
0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
df2
0 1 2 3 4
0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
How do I get to this end result:
multiindex_df
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
1 2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Starting from:
import pandas as pd
import numpy as np
multiindex_df = pd.DataFrame(
index=pd.MultiIndex.from_arrays(
[[], [], []], names=["A", "B", "C"]))
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
breakpoint()
This is what I have in mind, but gives a key error:
multiindex_df.loc[(df1_a, df1_b, slice(None))] = df1
multiindex_df.loc[(df2_a, df2_b, slice(None))] = df2

You could do this as follows:
multiindex_df = pd.concat([df1, df2], keys=[1,2])
multiindex_df = pd.concat([multiindex_df], keys=[1])
multiindex_df.index.names = ['A','B','C']
print(multiindex_df)
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Alternatively, you could do it like below:
# collect your dfs inside a dict
dfs = {'1': df1, '2': df2}
# create list for index tuples
multi_index = []
for idx, val in enumerate(dfs):
for x in dfs[val].index:
# append tuple to row, e.g (1,1,0), (1,1,1) etc.
multi_index.append((1,idx+1,x))
# concat your dfs
multiindex_df_two = pd.concat([df1, df2])
# create multiindex from tuples, and add names
multiindex_df_two.index = pd.MultiIndex.from_tuples(multi_index, names=['A','B','C'])
# check
multiindex_df.equals(multiindex_df_two) # True

Ouroboros' answer made
me realize that instead of trying to fit the read-in file dfs into a
formatted df, the cleaner solution is to format each individual file df,
then concat.
In order to do that, I have to re-format the file df indices into a
multiindex, which prompted this question and
answer.
Having that down, and using Ouroboros'
answer, the solution
becomes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df1.index = pd.MultiIndex.from_product(
[[df1_a], [df1_b], df1.index], names=["A", "B", "C"])
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
df2.index = pd.MultiIndex.from_product(
[[df2_a], [df2_b], df2.index], names=["A", "B", "C"])
multiindex_df = pd.concat([df1, df2])
Which is obviously well suited for a loop.
Output:
df1
0 1 2 3 4
0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
df2
0 1 2 3 4
0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6
multiiindex_df
0 1 2 3 4
A B C
1 1 0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
2 0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6

How to add multiindex columns to existing df, preserving original index

I start with:
df
0 1 2 3 4
0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
and want to end up with:
df
0 1 2 3 4
A B C
1 2 0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
where A and B are known after df creation, and C is the original
index of the df.
MWE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df_a = 1
df_b = 2
breakpoint()
What I have in mind, but gives unhashable type error:
df.reindex([df_a, df_b, df.index])

Try with pd.MultiIndex.from_product:
df.index = pd.MultiIndex.from_product(
[[df_a], [df_b], df.index], names=['A','B','C'])
df
Out[682]:
0 1 2 3 4
A B C
1 2 0 7 0 1 9 9
1 0 4 7 3 2
2 7 2 0 0 4
3 5 5 6 8 4
4 1 4 9 8 1

Sort a subset of columns of a pandas dataframe alphabetically by column name

I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.

You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3

Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6

You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]

Pandas Split DataFrame using row index

I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8

You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8

I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7

I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7

You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14

Converting one to many mapping dictionary to Dataframe

I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.

Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}

Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining columns within the same df Python/Pandas - python

It looks like you are looking for "append": import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(1,10, (3,2)),columns=list('AB')) df2 = pd.DataFrame(np.random.randint(1,10, (3,2)),columns=list('AB')) df3=df.append(df2) In [2]: df3 Out[2]: A B 0 7 6 1 8 3 2 2 1 0 2 2 1 1 3 2 5 5

Related

How to add dataframe to multiindex dataframe at specific location

How to add multiindex columns to existing df, preserving original index

Sort a subset of columns of a pandas dataframe alphabetically by column name

Pandas Split DataFrame using row index

Converting one to many mapping dictionary to Dataframe

Categories

Resources