complete and repeat one dataframe along another one - python

How do you combine 2 dataframes so that one is repeated over and over and combined for every line of the other dataframe, for example :
d1 = pd.DataFrame([[1,3],[2,4]])
print(d1)
0 1
0 1 3
1 2 4
and
d2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
print(d2)
0 1
0 A D
1 B E
2 C F
combining in :
d3 = pd.DataFrame([[1,3,'A','D'],[1,3,'B','E'],[1,3,'C','F'],[2,4,'A','D'],[2,4,'B','E'],[2,4,'C','F']])
print(d3)
0 1 2 3
0 1 3 A D
1 1 3 B E
2 1 3 C F
3 2 4 A D
4 2 4 B E
5 2 4 C F
I can loop over d1 and concat, but is there any implemented functionnality already doing this ?
Thanks

I believe what you are searching for is a cross-join.
You can use the following code to get your answer, you will just need to clean up the column naming
df1 = pd.DataFrame([[1,3],[2,4]])
df2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
df1.merge(df2, how = 'cross')

I hope, this works for your solution. Create a key column with value of 1 in both dataframes and join with that key and then drop it.
import pandas as pd
d1 = pd.DataFrame([[1,3],[2,4]])
print(d1)
d2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
print(d2)
d1['key'] = 1
d2['key'] = 1
d1.merge(d2, on='key').drop('key', axis=1)

Here is an alternative solution using pd.merge() and df.assign()
d2.columns = ['2', '3']
d3 = pd.merge(d1.assign(key=1), d2.assign(key=1), on='key', suffixes=('', '')).drop('key', axis=1)
print(d3)
0 1 2 3
0 1 3 A D
1 1 3 B E
2 1 3 C F
3 2 4 A D
4 2 4 B E
5 2 4 C F

Related

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.
You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c
Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c
You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

pandas groupby transpose str column

here is what I am trying to do:
>>>import pandas as pd
>>>dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 3, 'b': 'a a b c d e'.split()})
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 2 e
6 3 f
how to transpose column 'b' grouped by column 'a', so that output looks like:
a b0 b1 b2
0 1 a a b
3 2 c d e
6 3 f NaN NaN
Using pivot_table with cumcount:
(df.assign(flag=df.groupby('a').b.cumcount())
.pivot_table(index='a', columns='flag', values='b', aggfunc='first')
.add_prefix('B'))
flag B0 B1 B2
a
1 a a b
2 c d e
3 f NaN NaN
You can try of grouping by column and flattening the values associated with group and reframe it as dataframe
df = df.groupby(['a'])['b'].apply(lambda x: x.values.flatten())
pd.DataFrame(df.values.tolist(),index=df.index).add_prefix('B')
Out:
B0 B1 B2
a
1 a a b
2 c d e
3 f None None
you could probably try something like this :
>>> dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 2 + [3]*1, 'b': 'a a b c d e'.split()})
>>> dftemp
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 3 e
>>> dftemp.groupby('a')['b'].apply(lambda df: df.reset_index(drop=True)).unstack()
0 1 2
a
1 a a b
2 c d None
3 e None None
Given the ordering of your DataFrame you could find where the group changes and use np.split to create a new DataFrame.
import numpy as np
import pandas as pd
splits = dftemp[(dftemp.a != dftemp.a.shift())].index.values
df = pd.DataFrame(np.split(dftemp.b.values, splits[1:])).add_prefix('b').fillna(np.NaN)
df['a'] = dftemp.loc[splits, 'a'].values
Output
b0 b1 b2 a
0 a a b 1
1 c d e 2
2 f NaN NaN 3

Shuffle rows by a column in pandas

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows
If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c
merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.
Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1
Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389
Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c
If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Categories

Resources