I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]
I have the following dataframe:
data=pd.DataFrame(data=[[8,4,2,6,0],[3,4,5,6,7]],columns=["a","b","c","d","e"])
Output is like this:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
I also have the following Series:
a=pd.Series([3,4])
I want to attach the series (a) to each of the columns in data. I tried few things with concat but I never seem to get it right.
Expected result is:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
Thanks in advance
You can do:
out=data.append(pd.concat([a]*data.shape[1],axis=1,keys=data.columns),ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
Here is a method from for loop
for x ,y in a.iteritems():
data.loc[data.index[-1]+x+1]=y
data
Out[106]:
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
4 4 4 4 4 4
pandas.DataFrame.apply
with pandas.Series.append
I like this because it's pretty
data.apply(pd.Series.append, to_append=a, ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
A golfier answer
data.apply(pd.Series.append, args=(a, 1))
numpy.row_stack
Very similar to rafaelc's answer
pd.DataFrame(np.row_stack([
data,
a.to_numpy()[:, None].repeat(data.shape[1], axis=1)
]), columns=data.columns)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
Using broadcast_to
df.append(pd.DataFrame(np.broadcast_to(a.to_frame(), (len(a), df.shape[1])), columns=df.columns), ignore_index=True)
a b c d e
0 8 4 2 6 0
1 3 4 5 6 7
2 3 3 3 3 3
3 4 4 4 4 4
Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.
On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).
try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.
Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]
Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964
I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.
Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}
Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9