Python: Creating new column names in a for loop - python

I am trying to make custom column header names for the dataframe using a for loop. Currently I am using two for loops to iterate through a dataframe, but don't know how to put new column headers in without hardcoding them. I have
df = pandas.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],})
result = []
for i in range(len(df.columns)):
SelectedCol = (df.iloc[:,i])
for c in range(i+1, len(df.columns)):
result.append(((SelectedCol+1)/ (df.iloc[:,c]+1)))
df1 = pandas.DataFrame(result)
df1=df1.transpose()
In df, the first column is taken and multiplied to the second, third, and fourth. And then the code takes the second, and multiples it by the third and fourth, and continues in the for loop so the output columns are
'AB' , 'AC', 'AD', 'BC', 'BD', and 'CD'.
What could I add to my for loop to extract the column names so each column name of df1 can be 'Long A, Short B' , 'Long A, Short C'.... and finally 'Long C, Short D'
Thanks for your help

from itertools import combinations
for x,y in combinations(df.columns,2):
df['Long '+x+' Short '+y]=df[x]*df[y]

import pandas
from itertools import combinations
df = pandas.DataFrame({
'A': [5, 3, 6, 9, 2, 4],
'B': [4, 5, 4, 5, 5, 4],
'C': [7, 8, 9, 4, 2, 3],
'D': [1, 3, 5, 7, 1, 0], })
# get all col name
for index, row in df.iteritems():
print(index)
# get all combinations
result = combinations(df.iteritems(), 2)
# calc
for name, data in result:
_name = name[0] + data[0]
_data = name[1] * data[1]
df[_name] = _data
print(df)
A
B
C
D
A B C D AB AC AD BC BD CD
0 5 4 7 1 20 35 5 28 4 7
1 3 5 8 3 15 24 9 40 15 24
2 6 4 9 5 24 54 30 36 20 45
3 9 5 4 7 45 36 63 20 35 28
4 2 5 2 1 10 4 2 10 5 2
5 4 4 3 0 16 12 0 12 0 0

Related

Delete rows in apply() function or depending on apply() result

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

Mapping Tuple Dictionary to Multiple columns of a DataFrame

Ive got a PDB DataFrame with residue insertion codes. Simplified example.
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
Dataframe:
ATOM residue_number insertion
0 1 2
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
6 7 3 A
7 8 3 A
8 9 3 A
9 10 5
10 11 5
11 12 5
I need to renumber the residues according to a different numbering and insertion scheme. Output from the renumbering script can be formatted into a dictionary of tuples, e.g.
my_dict = {(2,): 1, (3,): 2, (3, 'A') : 3, (5, ) : (4, 'A') }
Is it possible to map this dictionary of tuples onto the two columns ['ATOM']['insertion']? The desired output would be:
ATOM residue_number insertion
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4 A
10 11 4 A
11 12 4 A
I've been searching and banging my head on this for a few days, I've tried mapping and multindex but cant seem to find a way to map a dictionary of tuples across multiple columns. I feel like I'm thinking about it wrong somehow. Thanks for any advice!
in this case I think that you need to define a function that gets as input your old residue_number and insertion and outputs the new ones. For that, I will work directly from the df, so, to avoid extra coding, I will redefine your my_dict from (2,) to this (2,'')
here is the code:
import pandas as pd
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
# Our new dict with keys and values as tuples
my_new_dict = {(2,''): (1,''), (3,''): (2,''), (3,'A'): (3,''), (5,''): (4,'A') }
# We need a function that maps a tuple (residue_number, insertion) into your new_residue_number and new_insertion values
def new_residue_number(residue_number, insertion, my_new_dict):
# keys are tuples
key = (residue_number, insertion)
# Return new residue_number and insertion values
return my_new_dict[key]
# Example to see how this works
print(new_residue_number(2, '', my_new_dict)) # Output (1,'')
print(new_residue_number(5, '', my_new_dict)) # Output (4, 'A')
print(new_residue_number(3, 'A', my_new_dict)) # Output (3,'')
# Now we apply this to our df and save it in the same df in two new columns
df[['new_residue_number','new_insertion']] = df.apply(lambda row: pd.Series(new_residue_number(row['residue_number'], row['insertion'], my_new_dict)), axis=1)
I hope this can solve your problem!
I think we can create a DataFrame with your dictionary after modifying it to set all values ​​as tuples. So we can use DataFrame.join or . I think this is easier(
and recommended) if we convert the blank values ​​of the insertion column to NaN.
import numpy as np
new_df = ( df.assign(insertion = df['insertion'].replace(r'^\s*$',
np.nan,
regex=True)
.mask(df['insertion'].isnull()))
.join(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T,
on = ['residue_number','insertion'])
.fillna('')
.drop(['residue_number','insertion'],axis=1)
.rename(columns = {'new_residue_number':'residue_number',
'new_insertion':'insertion'}))
print(new_df)
ATOM residue_number insertion
0 1 1.0
1 2 1.0
2 3 1.0
3 4 2.0
4 5 2.0
5 6 2.0
6 7 3.0
7 8 3.0
8 9 3.0
9 10 4.0 A
10 11 4.0 A
11 12 4.0 A
Detail
print(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T)
new_residue_number new_insertion
2 NaN 1 NaN
3 NaN 2 NaN
A 3 NaN
5 NaN 4 A
The logic here is a simple merge. But we need to do a lot of work to turn that dictionary into a suitable DataFrame for mapping. I'd reconsider whether you can store the renumbering output as my final s DataFrame from the start.
#Turn the dict into a mapping
s = pd.DataFrame(my_dict.values())[0].explode().to_frame()
s['idx'] = s.groupby(level=0).cumcount()
s = (s.pivot(columns='idx', values=0)
.rename_axis(None, axis=1)
.rename(columns={0: 'new_res', 1: 'new_ins'}))
s.index = pd.MultiIndex.from_tuples([*my_dict.keys()], names=['residue_number', 'insertion'])
s = s.reset_index().fillna('') # Because you have '' not NaN
# residue_number insertion new_res new_ins
#0 2 1
#1 3 2
#2 3 A 3
#3 5 4 A
The mapping is now a merge. I'll leave all columns in for clarity of the logic, but you can use the commented out code to drop the original columns and rename the new columns.
df = df.merge(s, how='left')
# Your real output with
#df = (df.merge(s, how='left')
# .drop(columns=['residue_number', 'insertion'])
# .rename(columns={'new_res': 'residue_number',
# 'new_ins': 'insertion'}))
ATOM residue_number insertion new_res new_ins
0 1 2 1
1 2 2 1
2 3 2 1
3 4 3 2
4 5 3 2
5 6 3 2
6 7 3 A 3
7 8 3 A 3
8 9 3 A 3
9 10 5 4 A
10 11 5 4 A
11 12 5 4 A

How to use a Series to filter a DataFrame

I have a pandas Series with the following content.
$ import pandas as pd
$ s = pd.Series(
data = [True, False, True, True],
index = ['A', 'B', 'C', 'D']
)
$ s.index.name = 'my_id'
$ print(s)
my_id
A True
B False
C True
D True
dtype: bool
and a DataFrame like this.
$ df = pd.DataFrame({
'A': [1, 2, 9, 4],
'B': [9, 6, 7, 8],
'C': [10, 91, 32, 13],
'D': [43, 12, 7, 9]
})
$ print(df)
A B C D
0 1 9 10 43
1 2 6 91 12
2 9 7 32 7
3 4 8 13 9
s has A, B, C, and D as its indices. df also has A, B, C, and D as it column names.
True in s means that the corresponding column in df will be preserved. False in s means that the corresponding column in df will be removed.
How can I generate another DataFrame with column B removed using s?
I mean I want to create the following DataFrame using s and df.
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9
Use boolean indexing with DataFrame.loc. The : means filter all rows. Columns are filtered by Series filled with boolean - mask:
df1 = df.loc[:, s]
print (df1)
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9

change column name using index

import pandas as pd
d = {
'one': [1, 2, 3, 4, 5],
'one': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
I have bigger dataframe with multiple columns of having same name .
I want to change the column name from number of column as in r.
e.g. colnames(df)[2]='two'
I want to change second column name 'one' to 'two' .I want to do
that in python .
I think the simpliest is assign new columns names by np.arange or range:
#valid dictionary have unique keys
d = {
'one1': [1, 2, 3, 4, 5],
'one2': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
df.columns = ['one'] * 2 + ['three']
print (df)
one one three
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
df.columns = np.arange(len(df.columns))
#alternative
#df.columns = range(len(df.columns))
print (df)
0 1 2
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
Then select by name:
print (df[1])
0 9
1 8
2 7
3 6
4 5
Name: 1, dtype: int64

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Categories

Resources