How to sum columns with a duplicate name with Pandas? - python

I have a dataframe with duplicate column name and I would like to sum these columns.
>df
A B A B
1 12 2 4 1
2 10 5 4 9
3 2 1 4 8
4 2 4 3 8
What i would like is something like this:
A B
1 16 3
2 14 14
3 6 9
4 5 12
I can select duplicate columns in a loop but I don't know how to remove the columns and recreate a new column with summed values. I would like to know if there a more elegant way?
col = list(df.columns)
dup = list(set([x for x in col if col.count(x) > 1]))
for d in dup:
sum = df[d].sum(axis=1)

Let us try
sum_df=df.sum(level=0,axis=1)

Try this
df.groupby(lambda x:x, axis=1).sum()

Related

pandas ascend sort multiple columns but reverse sort one column

I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7

How to tranpose pandas dataframe by only using the first x values accoding id?

Initial dataframe looks as follows:
>>>>df
id param
1 4
1 15
1 3
2 2
2 7
4 8
4 6
4 11
How to achieve the following scheme by putting only the first 2 values of each id into new row? Resulting df should look as follows:
>>>>df
col_a col_b
4 15
2 7
8 6
I tried to achieve by using transpose and iloc but did not succeed.
Columns names are just for clarification. It is sufficient if index is displayed only (e.g. 0, 1, 2,..).
You can use a double groupby on 'id' to first get the first two rows of each group and then join your 'param' column, and thereafter expand it into new columns. Lastly, rename accordingly:
new = df.groupby('id').head(2).groupby('id',as_index=False).agg({'param':list}).param.apply(pd.Series)
new.columns = ['col_a', 'col_b']
Prints:
col_a col_b
0 4 15
1 2 7
2 8 6
You can first take groupby with head(2) and then split every 2 elements in a list:
a = df.groupby("id")['param'].head(2).tolist()
out = pd.DataFrame([a[i:i + 2] for i in range(0, len(a), 2)],columns=['col_a','col_b'])
print(out)
col_a col_b
0 4 15
1 2 7
2 8 6

How to add a items from a list to a dataframe column in Python Pandas?

I have list containing numbers x =(1,2,3,4,5,6,7,8)
I also have a DataFrame with 1000+ rows.
The thing I need is to assign the numbers in the list into a column/creating a new column, so that the rows 1-8 contain the numbers 1-8, but after that it starts again, so row 9 should contain number 1 and so on.
It seems really easy, but somehow I cannot manage to do this.
Here are two possible ways (example here with 3 items to repeat):
with numpy.tile
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = np.tile(x, len(df)//len(x)+1)[:len(df)]
with itertools
from itertools import cycle, islice
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = list(islice(cycle(x), len(df
input:
col
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
output:
col newcol
0 0 1
1 1 2
2 2 3
3 3 1
4 4 2
5 5 3
6 6 1
7 7 2
8 8 3
9 9 1
from math import ceil
df['new_column'] = (x*(ceil(len(df)/len(x))))[:len(df)]

Converting rows of the same group to one single row with Dask Dataframes

I have a dask dataframe that look like this:
group index col1 col2 col3
1 1 5 3 4
1 2 4 3 7
1 3 1 2 9
-----------------------
2 2 4 3 7
2 3 1 2 9
2 4 7 4 3
-----------------------
3 3 1 2 9
3 4 7 4 3
3 5 6 3 2
It´s basically a rolling window where each group has its row and x more rows on the dataset. I need to change it to something like this:
group col1_1 col2_1 col3_1 col1_2 col2_2 col3_2 col1_3 col2_3 col3_3
1 5 3 4 4 3 7 1 2 9
2 4 3 7 1 2 9 7 4 3
3 1 2 9 7 4 3 6 3 2
so for each group I have a row that contains all the values in that group. The number of rows per group is constant but can change, meaning it could be 10, but it would be 10 for the whole dataset. In pandas I found some way to do it using this code that I found in this page: link.
indexCol = dff.index.name
dff.reset_index(inplace=True)
colNames = dff.columns
df = pd.pivot_table(dff, index=[indexCol], columns=dff.groupby(indexCol).cumcount().add(1), values=colNames,
aggfunc='sum')
df.columns = df.columns.map('{0[0]}{0[1]}'.format)
The problem is that dask pivot table does not work like pandas and for what I have read it does not admit multiindex so this code does not work with dask dataframes. I can´t make compute() in the dask dataframe neither because the dataset is too big for my memory so I should keep it in dask.
Thank you very much for your help
Well, I figured it out at the end so I post it here:
def series(x):
di = {};
for y in x.columns:
di.update({y + str(i + 1): t for i, t in enumerate(x[y])})
return pd.Series(di);
dictMeta = {};
for y in colNames:
dictMeta.update({y + str(i + 1): df[y].dtype for i in range(0, int(window))})
lista = [(k, dictMeta[k]) for k in dictMeta.keys()]
#We create the 2d dataset for the model
df = dff.groupby(indexCol).apply(lambda x: series(x[colNames]), meta= dictMeta)
where colNames are the columns of the original dataset (col1, col2 and col3 in the question) and indexCol is the name of groupby column (group in the question). Basically we create a dictionary for each group and append it to the dataframe as a row. The dictMeta creates the meta since sometimes errors happen without it.

pandas add a column with only one row

This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8

Categories

Resources