Converting one to many mapping dictionary to Dataframe

Converting one to many mapping dictionary to Dataframe - python

I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.

Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}

Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

Related

Sort a subset of columns of a pandas dataframe alphabetically by column name

I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.

You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3

Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6

You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]

append same Series to data frame columns

I have the following dataframe:
data=pd.DataFrame(data=[[8,4,2,6,0],[3,4,5,6,7]],columns=["a","b","c","d","e"])
Output is like this:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
I also have the following Series:
a=pd.Series([3,4])
I want to attach the series (a) to each of the columns in data. I tried few things with concat but I never seem to get it right.
Expected result is:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
Thanks in advance

You can do:
out=data.append(pd.concat([a]*data.shape[1],axis=1,keys=data.columns),ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4

Here is a method from for loop
for x ,y in a.iteritems():
data.loc[data.index[-1]+x+1]=y
data
Out[106]:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
4 4 4 4 4 4

pandas.DataFrame.apply
with pandas.Series.append
I like this because it's pretty
data.apply(pd.Series.append, to_append=a, ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
A golfier answer
data.apply(pd.Series.append, args=(a, 1))
numpy.row_stack
Very similar to rafaelc's answer
pd.DataFrame(np.row_stack([
data,
a.to_numpy()[:, None].repeat(data.shape[1], axis=1)
]), columns=data.columns)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4

Using broadcast_to
df.append(pd.DataFrame(np.broadcast_to(a.to_frame(), (len(a), df.shape[1])), columns=df.columns), ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4

Pandas - Giving all rows (particularly) duplicate rows a unique identifier

Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.

On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).

try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.

Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]

Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964

Pandas Split DataFrame using row index

I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8

You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8

I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7

I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7

You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14

Operations on two lists with in a DataFrame using Python

I have two lists ListA = [In_3M,Out_3M, Go_3M] and ListB = [In_6M,Out_6M, Go_6M]. The elements in the two list are the variables of Input DF. I want to subtract the first element of "list B" i.e, In_6M with first element of "list A" i.e, In_3M in the Input DF and it store it as a separate variable in the Output df.Then repeat the similar process until the end of the list and store in Output df.
ListA = [In_3M,Out_3M, Go_3M]
ListB = [In_6M,Out_6M, Go_6M]
Input df:
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M
A 2 3 4 4 6 6
B 3 3 5 5 6 7
C 2 3 6 4 6 8
D 3 3 7 5 6 9
Output df:
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M IN_3M-6M Out_3M-6M Go_3M-6M
A 2 3 4 4 6 6 2 3 2
B 3 3 5 5 6 7 2 3 2
C 2 3 6 4 6 8 2 3 2
D 3 3 7 5 6 9 2 3 2
I have tried many ways to do this but cannot able to solve this. The number of elements in the list are around 20. Please help me if there any efficient way to do this. Thanks in advance

This is simple enough to do with loops, just loop over the zipped column names:
>>> df = pd.read_clipboard()
>>> df
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M
0 A 2 3 4 4 6 6
1 B 3 3 5 5 6 7
2 C 2 3 6 4 6 8
3 D 3 3 7 5 6 9
>>> ListA = ['In_3M','Out_3M', 'Go_3M']
>>> ListB = ['In_6M','Out_6M', 'Go_6M']
>>> for b, a in zip(ListB, ListA):
... newcol = "{}-{}".format(b, a)
... df[newcol] = df[b] - df[a]
...
>>> df
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M In_6M-In_3M Out_6M-Out_3M \
0 A 2 3 4 4 6 6 2 3
1 B 3 3 5 5 6 7 2 3
2 C 2 3 6 4 6 8 2 3
3 D 3 3 7 5 6 9 2 3
Go_6M-Go_3M
0 2
1 2
2 2
3 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting one to many mapping dictionary to Dataframe - python

pd.DataFrame( [[k, v] for k, a in d.items() for v in a.tolist()], columns=['A', 'B'] ) A B 0 1 2 1 1 3 2 2 8 3 2 4 4 2 5 5 3 6 6 3 7 7 3 8 8 3 9 Setup d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}

Here's my version: (pd.DataFrame.from_dict(d, orient='index').rename_axis('A') .stack() .reset_index(name='B') .drop('level_1', axis=1) .astype('int')) Out[63]: A B 0 1 2 1 1 3 2 2 8 3 2 4 4 2 5 5 3 6 6 3 7 7 3 8 8 3 9

Related

Sort a subset of columns of a pandas dataframe alphabetically by column name

append same Series to data frame columns

Pandas - Giving all rows (particularly) duplicate rows a unique identifier

Pandas Split DataFrame using row index

Operations on two lists with in a DataFrame using Python

Categories

Resources