Concatinating multiindex dataframe with non-unique multiindex - python

I have a dataframes dfa:
y X1 X2 X3
Company Period
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
2 1 1 2 3 4
2 3 4 5 6
3 7 8 9 10
...
and dfb
Company Period
1 1
2
3
7 1
2
3
1 1
2
3
...
As you can see dfb has a non-unique multiindex. I'd like to concatinate both dfs in a way that can handle the non-uniquness and add the vlaues of dfa to dfb everywhere, where the indexes are equal. So the desired result would look like that:
y X1 X2 X3
Company Period
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
7 1 1 2 3 4
2 1 5 5 6
3 1 6 8 9
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
...
I have tried the following:
dfb.join(dfa, how='left') #results in dfb
dfb = pd.concat([dfb, dfa], axis = 1, join = 'inner') #raises: ValueError: cannot handle a non-unique multi-index!
bs_df.merge(dfa.reset_index(), left_on=['Company', 'PeriodQ'], right_on=['Company', 'PeriodQ'], how='left') #results in dfb
What am I doing wrong?
I saw similar question here but the solution did not work for me

You can reindex your DataFrame with duplicate indices as well and it will just repeat the corresponding rows.
In [11]: df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['X', 'Y', 'Z'], index=pd.MultiIndex.from_product([[1], [1,2,3]]))
Out[12]:
X Y Z
1 1 1 2 3
2 4 5 6
3 7 8 9
In [15]: df.loc[pd.MultiIndex.from_product([[1], [1,2,1,2]]), :]
Out[15]:
X Y Z
1 1 1 2 3
2 4 5 6
1 1 2 3
2 4 5 6

Related

How to add dataframe to multiindex dataframe at specific location

I'm organizing data from separate files into one portable, multiindex
dataframe, with multiindex ("A", "B", "C"). Some of the info is gathered
from the filenames read in, and should populate the "A", and "B", of the
multiindex. "C" should take the form of the index of the file read in.
The columns should take the form of the columns read in.
Let's say the files read in become:
df1
0 1 2 3 4
0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
df2
0 1 2 3 4
0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
How do I get to this end result:
multiindex_df
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
1 2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Starting from:
import pandas as pd
import numpy as np
multiindex_df = pd.DataFrame(
index=pd.MultiIndex.from_arrays(
[[], [], []], names=["A", "B", "C"]))
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
breakpoint()
This is what I have in mind, but gives a key error:
multiindex_df.loc[(df1_a, df1_b, slice(None))] = df1
multiindex_df.loc[(df2_a, df2_b, slice(None))] = df2
You could do this as follows:
multiindex_df = pd.concat([df1, df2], keys=[1,2])
multiindex_df = pd.concat([multiindex_df], keys=[1])
multiindex_df.index.names = ['A','B','C']
print(multiindex_df)
0 1 2 3 4
A B C
1 1 0 0 9 9 8 5
1 0 8 2 1 2
2 9 1 6 4 3
3 1 4 1 4 4
4 5 4 6 6 2
2 0 4 5 0 7 3
1 8 2 9 1 0
2 5 9 1 6 6
3 4 1 4 6 5
4 3 0 0 8 8
Alternatively, you could do it like below:
# collect your dfs inside a dict
dfs = {'1': df1, '2': df2}
# create list for index tuples
multi_index = []
for idx, val in enumerate(dfs):
for x in dfs[val].index:
# append tuple to row, e.g (1,1,0), (1,1,1) etc.
multi_index.append((1,idx+1,x))
# concat your dfs
multiindex_df_two = pd.concat([df1, df2])
# create multiindex from tuples, and add names
multiindex_df_two.index = pd.MultiIndex.from_tuples(multi_index, names=['A','B','C'])
# check
multiindex_df.equals(multiindex_df_two) # True
Ouroboros' answer made
me realize that instead of trying to fit the read-in file dfs into a
formatted df, the cleaner solution is to format each individual file df,
then concat.
In order to do that, I have to re-format the file df indices into a
multiindex, which prompted this question and
answer.
Having that down, and using Ouroboros'
answer, the solution
becomes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df1_a = 1
df1_b = 1
df1.index = pd.MultiIndex.from_product(
[[df1_a], [df1_b], df1.index], names=["A", "B", "C"])
df2 = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df2_a = 1
df2_b = 2
df2.index = pd.MultiIndex.from_product(
[[df2_a], [df2_b], df2.index], names=["A", "B", "C"])
multiindex_df = pd.concat([df1, df2])
Which is obviously well suited for a loop.
Output:
df1
0 1 2 3 4
0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
df2
0 1 2 3 4
0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6
multiiindex_df
0 1 2 3 4
A B C
1 1 0 5 3 1 1 3
1 8 9 7 5 6
2 8 6 6 7 7
3 3 4 9 7 2
4 3 2 1 6 2
2 0 5 0 6 9 3
1 7 5 5 9 6
2 2 1 9 6 3
3 9 4 3 7 0
4 5 9 5 9 6

Sort a subset of columns of a pandas dataframe alphabetically by column name

I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]

Indexing new dataframes into new columns with pandas

I need to create a new dataframe from an existing one by selecting multiple columns, and appending those column values to a new column with it's corresponding index as a new column
So, lets say I have this as a dataframe:
A B C D E F
0 1 2 3 4 0
0 7 8 9 1 0
0 4 5 2 4 0
Transform into this by selecting columns B through E:
A index_value
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
So, for the new dataframe, column A would be all of the values from columns B through E in the old dataframe, and column index_value would correspond to the index value [starting from zero] of the selected columns.
I've been scratching my head for hours. Any help would be appreciated, thanks!
Python3, Using pandas & numpy libraries.
#Another way
A B C D E F
0 0 1 2 3 4 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
# Select columns to include
start_colum ='B'
end_column ='E'
index_column_name ='A'
#re-stack the dataframe
df = df.loc[:,start_colum:end_column].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
#Create the "index_value" column
df['index_value'] =pd.Categorical(df.index).codes+1
df.rename(columns={0:index_column_name}, inplace=True)
df.set_index(index_column_name, inplace=True)
df
index_value
A
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
This is just melt
df.columns = range(df.shape[1])
s = df.melt().loc[lambda x : x.value!=0]
s
variable value
3 1 1
4 1 7
5 1 4
6 2 2
7 2 8
8 2 5
9 3 3
10 3 9
11 3 2
12 4 4
13 4 1
14 4 4
Try using:
df = pd.melt(df[['B', 'C', 'D', 'E']])
# Or df['variable'] = df[['B', 'C', 'D', 'E']].melt()
df['variable'].shift().eq(df['variable'].shift(-1)).cumsum().shift(-1).ffill()
print(df)
Output:
variable value
0 1.0 1
1 1.0 7
2 1.0 4
3 2.0 2
4 2.0 8
5 2.0 5
6 3.0 3
7 3.0 9
8 3.0 2
9 4.0 4
10 4.0 1
11 4.0 4

Append Pandas disjunction of 2 dataframes to first dataframe

Given 2 pandas tables, both with the 3 columns id, x and y coordinates. So several rows of same id represent a graph with its x-yvalues. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different.
Example:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]})
(df1 intersect df2 ) ---------> df1
id x y id x y id x y
1 1 1 1 1 4 1 1 1
1 1 2 1 1 5 1 1 2
2 5 4 1 1 6 2 5 4
2 4 4 2 1 1 2 4 4
2 4 3 2 1 2 2 4 3
3 1 4 3 5 4 3 1 4
3 1 5 3 4 4 3 1 5
3 1 6 3 4 3 3 1 6
4 10 1 4 10 1
4 10 2 4 10 2
4 9 2 4 9 2
Should become:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]})
As you can see until id= 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1. In that case the 4th path should be detected and appended to df1. Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id, so to say the order of the paths can be different from one and another.
Imports:
import pandas as pd
Set starting DataFrames:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
Outer Merge:
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
produces:
df_merged =
id_x x y id_y
0 1.0 1 1 2
1 1.0 1 2 2
2 2.0 5 4 3
3 2.0 4 4 3
4 2.0 4 3 3
5 3.0 1 4 1
6 3.0 1 5 1
7 3.0 1 6 1
8 NaN 10 1 4
9 NaN 10 2 4
10 NaN 9 2 4
Note: Why does id_x become floats?
Fill NaN:
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
produces:
df_merged =
id_x x y id_y
0 1 1 1 2
1 1 1 2 2
2 2 5 4 3
3 2 4 4 3
4 2 4 3 3
5 3 1 4 1
6 3 1 5 1
7 3 1 6 1
8 4 10 1 4
9 4 10 2 4
10 4 9 2 4
Drop id_y:
df_merged = df_merged.drop(['id_y'], axis=1)
produces:
df_merged =
id_x x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Rename id_x to id:
df_merged = df_merged.rename(columns={'id_x': 'id'})
produces:
df_merged =
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Final Program is 4 lines of code:
import pandas as pd
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
df_merged = df_merged.drop(['id_y'], axis=1)
df_merged = df_merged.rename(columns={'id_x': 'id'})
Please remember to put a check next to the selected answer.
Mauritius, try this code:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]})
df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()]
def f(df2):
data = {(x,y) for x, y in df2[['x','y']].values}
if data not in df1_s:
return True
else:
return False
check = df2.groupby('id').apply(f).apply(pd.Series)
ids = check[check[0]].index.values
df2 = df2.set_index('id').loc[ids].reset_index()
df1 = df1.append(df2)
OUT:
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
0 4 10 1
1 4 10 2
2 4 9 2
3 5 1 2
I think it can be done more simple and pythonic, but I think a lot and still don't know how = )
And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later.
Does this code do what you want?

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.
A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a
You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Categories

Resources