Selecting columns with startswith in pandas - python

Hi I have a data and I want to rename one of the column and select columns starts with t string.
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
'tr': [1,2,3,4,5],
'tk': [6,7,8,9,10],
'ak': [11,12,13,14,15]
}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','tr','tk','ak'])
df
patient obs treatment score tr tk ak
0 1 1 0 strong 1 6 11
1 1 2 1 weak 2 7 12
2 1 3 0 normal 3 8 13
3 2 1 1 weak 4 9 14
4 2 2 0 strong 5 10 15
So I tried by following python-pandas-renaming-column-name-startswith
df.rename(columns = {'treatment':'treat'})[['score','obs',df[df.columns[pd.Series(df.columns).str.startswith('t')]]]]
but getting this error
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
How can I select the columns that starts with t ?
Thx

Convert to Series is not necessary, but if want add to another list of columns convert output to list:
cols = df.columns[df.columns.str.startswith('t')].tolist()
df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})
Another idea is use 2 masks and chain by | for bitwise OR:
Notice:
Columns names are filtered from original columns names before rename in your solution, so is necessary rename later.
m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])
df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
obs treat score tr tk
0 1 0 strong 1 6
1 2 1 weak 2 7
2 3 0 normal 3 8
3 1 1 weak 4 9
4 2 0 strong 5 10
If need rename first, is necessary reassign back for filter by renamed columns names:
df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]

#Select columns startswith "t"
df = df[df.columns[df.columns.str.startswith('t')]]
#Rename your column
df.rename(columns = {'treatment':'treat'})

Related

How to multiply pandas dataframe columns with dictionary value where dictionary key matches dataframe index

Is there a better way than iterating over columns to multiply column values by dictionary values where the dictionary key matches a specific dataframe column? Give a dataframe of:
import pandas as pd
df = pd.DataFrame({
'category': [1,2,3,4,5],
'a': [5,4,3,3,4],
'b': [3,2,4,3,10],
'c': [3, 2, 1, 1, 1]
})
And a dictionary of:
lookup = {1:0, 2:4, 3:1, 4:6, 5:2}
I can multiply each column other than 'category' by the dictionary value where the key matches 'category' this way:
for t in df.columns[1:]:
df[t] = df[t].mul(df['category'].map(lookup)).fillna(df[t])
But there must be a more succinct way to do this other than iterating over columns?
import pandas as pd
df = pd.DataFrame({
'category': [1,2,3,4,5],
'a': [5,4,3,3,4],
'b': [3,2,4,3,10],
'c': [3, 2, 1, 1, 1]
})
lookup = {1:0, 2:4, 3:1, 4:6, 5:2}
out = df.set_index("category").mul(lookup, axis=0).reset_index()
print(out)
Output:
category a b c
0 1 0 0 0
1 2 16 8 8
2 3 3 4 1
3 4 18 18 6
4 5 8 20 2
Another way
df.iloc[:,1:] =df.iloc[:,1:].mul(pd.Series(df['category'].map(lookup)), axis=0)
category a b c
0 1 0 0 0
1 2 16 8 8
2 3 3 4 1
3 4 18 18 6
4 5 8 20 2

How to create multiple columns in Pandas Dataframe?

I have data as you can see in the terminal. I need it to be converted to the Excel sheet format as you can see in the Excel sheet file by creating multi-levels in columns.
I researched this and reached many different things but cannot achieve my goal then, I reached "transpose", and it gave me the shape that I need but unfortunately that it did reshape from a column to a row instead where I got the wrong data ordering.
Current result:
Desired result:
What can I try next?
You can use pivot() function and reorder multi-column levels.
Before that, index/group data for repeated iterations/rounds:
data=[
(2,0,0,1),
(10,2,5,3),
(2,0,0,0),
(10,1,1,1),
(2,0,0,0),
(10,1,2,1),
]
columns = ["player_number", "cel1", "cel2", "cel3"]
df = pd.DataFrame(data=data, columns=columns)
df_nbr_plr = df[["player_number"]].groupby("player_number").agg(cnt=("player_number","count"))
df["round"] = list(itertools.chain.from_iterable(itertools.repeat(x, df_nbr_plr.shape[0]) for x in range(df_nbr_plr.iloc[0,0])))
[Out]:
player_number cel1 cel2 cel3 round
0 2 0 0 1 0
1 10 2 5 3 0
2 2 0 0 0 1
3 10 1 1 1 1
4 2 0 0 0 2
5 10 1 2 1 2
Now, pivot and reorder the colums levels:
df = df.pivot(index="round", columns="player_number").reorder_levels([1,0], axis=1).sort_index(axis=1)
[Out]:
player_number 2 10
cel1 cel2 cel3 cel1 cel2 cel3
round
0 0 0 1 2 5 3
1 0 0 0 1 1 1
2 0 0 0 1 2 1
This can be done with unstack after setting player__number as index. You have to reorder the Multiindex columns and fill missing values/delete duplicates though:
import pandas as pd
data = {"player__number": [2, 10 , 2, 10, 2, 10],
"cel1": [0, 2, 0, 1, 0, 1],
"cel2": [0, 5, 0, 1, 0, 2],
"cel3": [1, 3, 0, 1, 0, 1],
}
df = pd.DataFrame(data).set_index('player__number', append=True)
df = df.unstack('player__number').reorder_levels([1, 0], axis=1).sort_index(axis=1) # unstacking, reordering and sorting columns
df = df.ffill().iloc[1::2].reset_index(drop=True) # filling values and keeping only every two rows
df.to_excel('output.xlsx')
Output:

How do I merge (insert) rows from one dataframe into another one in Pandas?

Let's suppose I have a following dataframe:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'val': [0, 0, 0, 0, 0]})
I want to modify the column val with values from another dataframes like these:
df1 = pd.DataFrame({'id': [2, 3], 'val': [1, 1]})
df2 = pd.DataFrame({'id': [1, 5], 'val': [2, 2]})
I need a function merge_values_into_df that would work in the way to provide the following result:
df = merge_values_into_df(df1, on='id', field='val')
df = merge_values_into_df(df2, on='id', field='val')
print(df)
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I need an efficient (by CPU and memory) solution because I want to apply the approach to huge dataframes.
Use DataFrame.update with convert id to index in all DataFrames:
df = df.set_index('id')
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df.update(df1)
df.update(df2)
df = df.reset_index()
print (df)
id val
0 1 2.0
1 2 1.0
2 3 1.0
3 4 0.0
4 5 2.0
You can concat all dataframes and drop_duplicated same id by keeping the last occurrence of id.
out = pd.concat([df, df1, df2]).drop_duplicates('id', keep='last')
print(out.sort_values('id', ignore_index=True))
# Output
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2

Removing certain Rows from subset of df

I have a pandas dataframe. All the columns right of column#2 may only contain the value 0 or 1. If they contain a value that is NOT 0 or 1, I want to remove that entire row from the dataframe.
So I created a subset of the dataframe to only contain columns right of #2
Then I found the indices of the rows that had values other than 0 or 1 and deleted it from the original dataframe.
See code below please
#reading data file:
data=pd.read_csv('MyData.csv')
#all the columns right of column#2 may only contain the value 0 or 1. So "prod" is a subset of the data df containing these columns:
prod = data.iloc[:,2:]
index_prod = prod[ (prod!= 0) & (prod!= 1)].dropna().index
data = data.drop(index_prod)
However when I run this, the index_prod vector is empty and so does not drop anything at all.
okay so my friend just told me that the data is not numeric and he fixed it by making it numeric. Can anyone please advise how I can find that out? Because all the columns were numeric it seemed like to me. All numbers
You can check dtypes by DataFrame.dtypes.
print (data.dtypes)
Or:
print (data.columns.difference(data.select_dtypes(np.number).columns))
And then convert all values without first 2 to numeric:
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Or all columns:
data = data.apply(lambda x: pd.to_numeric(x, errors='coerce'))
And last apply solution:
subset = data.iloc[:,2:]
data1 = data[subset.isin([0,1]).all(axis=1)]
Let's say you have this dataframe:
data = {'A': [1, 2, 3, 4, 5], 'B': [0, 1, 4, 3, 1], 'C': [2, 1, 0, 3, 4]}
df = pd.DataFrame(data)
A B C
0 1 0 2
1 2 1 1
2 3 4 0
3 4 3 3
4 5 1 4
And you want to delete rows based on column B that don't contain 0 or 1, we could accomplish by:
subset = df.iloc[:,1:]
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
df.drop(index)
A B C
0 1 0 2
1 2 1 1
4 5 1 4
df.reset_index(drop=True)
A B C
0 1 0 2
1 2 1 1
2 5 1 4

Pandas merge duplicate DataFrame columns preserving column names

How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T

Categories

Resources