Reorder certain columns in pandas dataframe - python

I have a dataframe, with the following columns, in this order;
'2','4','9','A','1','B','C'
I want the first 3 columns to be ABC but the rest it doesn't matter.
Output:
'A','B','C','3','2','9'... and so on
Is this possible?
(there are 100's of columns, so i can't put them all in a list

You can try to reorder like this:
first_cols = ['A','B','C']
last_cols = [col for col in df.columns if col not in first_cols]
df = df[first_cols+last_cols]

Setup
cols = ['2','4','9','A','1','B','C']
df = pd.DataFrame(1, range(3), cols)
df
2 4 9 A 1 B C
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
sorted with key
key = lambda x: (x != 'A', x != 'B', x != 'C')
df[sorted(df, key=key)]
A B C 2 4 9 1
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
Better suited for longer length of first column members
first_cols = ['A', 'B', 'C']
key = lambda x: tuple(y != x for y in first_cols)
df[sorted(df, key=key)]

Related

Change values in each cell based on a comparison with another cell using pandas

I want to compare two values in column 0 to the values in all the other columns and change to values of those columns appropriately.
I have 4329 rows x 197 columns.
From this:
0 1 2 3
0 G G G T
1 A A G A
2 C C C C
3 T A T G
To this:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
I've tried a nested for loop, which does not work and is slow.
for index, row in df.iterrows():
for name, value in row.iteritems():
if name == 0:
c = value
continue
if value == c:
value = 1
else:
value = 0
I haven't been able to piece together a way to use apply or applymap for the problem.
Here's an approach with iloc and eq:
df.iloc[:,1:] = df.iloc[:,1:].eq(df.iloc[:,0], axis=0).astype(int)
Output:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
df = pandas.DataFrame([['G', 'G', 'G', 'T'],
['A', 'A', 'G', 'A'],
['C', 'C', 'C', 'C'],
['T', 'A', 'T', 'G']])
df2 = df[0] + df.apply(lambda c:df[0]==c)[[1,2,3]].astype(int)
print(df2)
I guess ... theres probably a better way though
you could also do something like
df.apply(lambda c:(df[0]==c).astype(int) if c.name > 0 else c)

How to delete row when iterating into pandas Dataframe column?

This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7

Convert python dictionary to dataframe with dict values(list) as columns and 1,0 if that column is in dict list

I want to create a dataframe from a dictionary which is of the format
Dictionary_ = {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}
I am using
df = pd.DataFrame.from_dict(Dictionary_, orient ='index')
But it creates its own columns till max length of values and put values of dictionary as values in a dataframe.
I want a df with keys as rows and values as columns like
a b c d e f m n
Key 1 1 1 1 1 0 0 0 0
Key 2 0 0 0 1 0 1 0 0
Key 3 1 0 1 0 0 0 1 1
I can do it by appending all values of dict and create an empty dataframe with dict keys as rows and values as columns and then iterating over each row to fetch values from dict and put 1 where it matches with column, but this will be too slow as my data has 200 000 rows and .loc is slow. I feel i can use pandas dummies somehow but don't know how to apply it here.
I feel there will be a smarter way to do this.
If performance is important, use MultiLabelBinarizer and pass keys and values:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
columns=mlb.classes_,
index=Dictionary_.keys()))
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative, but slowier is create Series, then str.join for strings and last call str.get_dummies:
df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative with input DataFrame - use pandas.get_dummies, but then is necessary aggregate max per columns:
df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')
df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
Use get_dummies:
>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
>>>

Move data from one column to one of two others based on a fourth column in pandas

So in Pandas I have the following dataframe
A B C D
0 X
1 Y
0 Y
1 Y
0 X
1 X
I want to move the value in A to either C or D depending on B. The output should be something like this;
A B C D
0 X 0
1 Y 1
0 Y 0
1 Y 1
0 X 0
1 X 1
I've tried using multiple where statements like
df['C'] = np.where(str(df.B).find('X'), df.A, '')
df['D'] = np.where(str(df.B).find('Y'), df.A, '')
But that results in;
A B C D
0 X 0 0
1 Y 1 1
0 Y 0 0
1 Y 1 1
0 X 0 0
1 X 1 1
So I guess it's checking if the value exists in the column at all, which makes sense. Do I need to iterate row by row?
Dont convert to str with find, because it return scalar and 0 is convert to False and another integers to Trues:
print (str(df.B).find('X'))
5
Simpliest is compare values for boolean Series:
print (df.B == 'X')
0 True
1 False
2 False
3 False
4 True
5 True
Name: B, dtype: bool
df['C'] = np.where(df.B == 'X', df.A, '')
df['D'] = np.where(df.B == 'Y', df.A, '')
Another solution with assign + where:
df = df.assign(C=df.A.where(df.B == 'X', ''),
D=df.A.where(df.B == 'Y', ''))
And if need check substrings use str.contains:
df['C'] = np.where(df.B.str.contains('X'), df.A, '')
df['D'] = np.where(df.B.str.contains('Y'), df.A, '')
Or:
df['C'] = df.A.where(df.B.str.contains('X'), '')
df['D'] = df.A.where(df.B.str.contains('Y'), '')
All return:
print (df)
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
Using slice assignment
n = len(df)
f, u = pd.factorize(df.B.values)
a = np.empty((n, 2), dtype=object)
a.fill('')
a[np.arange(n), f] = df.A.values
df.loc[:, ['C', 'D']] = a
df
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.
Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1
Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389
Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c
If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

Categories

Resources