I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns
Related
I have a Dataframe df,you can have it by running:
import pandas as pd
data = [10,20,30,40,50,60]
df = pd.DataFrame(data, columns=['Numbers'])
df
now I want to check if df's columns are in an existing list,if not then create a new column and set the column value as 0,column name is same as the value of the list:
columns_list=["3","5","8","9","12"]
for i in columns_list:
if i not in df.columns.to_list():
df[i]=0
How can I code it in one line,I have tried this:
[df[i]=0 for i in columns_list if i not in df.columns.to_list()]
However the IDE return :
SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?
Any friend can help ?
try:
columns_list=["3","5","8","9","12"]
df = df.reindex(
set(
list(df.columns) + columns_list
),
axis=1,
fill_value=0,
)
You can also use the dictionary unpacking operator.
columns_list = ["3","5","8","9","12"]
df = df.assign(**{col: 0 for col in columns_list if col not in df.columns})
With the df.assign, you unpack the dictionary you create with all the columns that are not part of your columns_list and add the value 0 for that column.
If you really want that in one, single line, then move the column_list as well
df = df.assign(**{col: 0 for col in ["3","5","8","9","12"] if col not in df.columns})
import numpy as np
import pandas as pd
# Some example data
df = pd.DataFrame(
np.random.randint(10, size=(5, 6)),
columns=map(str, range(6))
)
# 0 1 2 3 4 5
# 0 9 4 8 7 3 6
# 1 6 9 0 5 3 4
# 2 7 9 0 9 0 3
# 3 4 4 6 4 6 4
# 4 6 9 7 1 5 5
columns_list=["3","5","8","9","12"]
# Figure out which columns in your list do not appear in your dataframe
# by creating a new Index and using pd.Index.difference:
df[ pd.Index(columns_list).difference(df.columns, sort=False) ] = 0
# 0 1 2 3 4 5 8 9 12
# 0 9 4 8 7 3 6 0 0 0
# 1 6 9 0 5 3 4 0 0 0
# 2 7 9 0 9 0 3 0 0 0
# 3 4 4 6 4 6 4 0 0 0
# 4 6 9 7 1 5 5 0 0 0
Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)
I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1
Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5
I have 2 sample data frames:
df1 =
a_1 b_1 a_2 b_2
1 2 3 4
5 6 7 8
and
df2 =
c
12
14
I want to add values of c as a suffix in order:
df3 =
12_a_1 12_b_1 14_a_2 14_b_2
1 2 3 4
5 6 7 8
one option is list comprehension:
import itertools
# use itertools to repeat values of df2
prefix = list(itertools.chain.from_iterable(itertools.repeat(str(x), 2) for x in df2['c'].values))
# list comprehension to create new column names
df1.columns = [p+'_'+c for c,p in zip(df1.columns, prefix)]
print(df1)
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8
Use str.split and map
s = (df1.columns.str.split('_').str[-1].astype(int) - 1).map(df2.c)
df1.columns = s.astype(str) + '_' + df1.columns
print(df1)
Out[304]:
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8