I have a dataframe in which some values are split in different columns:
ch1a ch1b ch1c ch2
0 0 4 10
0 0 5 9
0 6 0 8
0 7 0 7
8 0 0 6
9 0 0 5
I want to sum those columns and keep the normal ones (like ch2).
The desired result should be something like:
ch1a ch2
4 10
5 9
6 8
7 7
8 6
9 5
I took a look at both pandas functions, merge and join, but I could not find the right one for my case.
This was my first try:
df = pd.DataFrame({'ch1a': [0, 0, 0, 0, 8, 9],'ch1b': [0, 0, 6, 7, 0, 0],'ch1c': [4, 5, 0, 0, 0, 0],'ch2': [10, 9, 8, 7, 6, 5]})
df['ch1a'] = df.sum(axis=1)
del df['ch1b']
del df['ch1c']
However the result is not what I want:
ch1a ch2
0 14 10
1 14 9
2 14 8
3 14 7
4 14 6
5 14 5
I have two questions:
How can I get my desired result?
Is there a way to merge some columns by summing their values and not have to delete the remaining columns afterwards?
This would get you the desired result:
cols_to_sum = ['ch1a', 'ch1b', 'ch1c']
df['ch1'] = df.loc[:, cols_to_sum].sum(axis=1)
df.drop(cols_to_sum, axis=1)
Your problem was that you were summing over all columns. Here we restrict it to the relevant ones.
I don't know how to avoid the drop though.
You can do a horizontal (column-wise) groupby using axis=1:
>>> df.groupby(df.columns.str[:3], axis=1).sum()
ch1 ch2
0 4 10
1 5 9
2 6 8
3 7 7
4 8 6
5 9 5
Here I used the first three letters of the columns to determine the destination groups, but you can use functions or dictionaries or lists instead:
>>> df.groupby(lambda x: x[:3], axis=1).sum()
ch1 ch2
0 4 10
1 5 9
2 6 8
3 7 7
4 8 6
5 9 5
>>> df.groupby(['a','b','b','c'], axis=1).sum()
a b c
0 0 4 10
1 0 5 9
2 0 6 8
3 0 7 7
4 8 0 6
5 9 0 5
Related
I have a small subset of data here:
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
Df1 is a column of data and df2 is a row of data. I need a dataframe that is going to be 3x9 where the row is multiplied by each value in the column to make one large dataframe.
The end result should look like:
df3 = [2 4 2 4 2 4 2 4 2
4 8 4 8 4 8 4 8 4
6 12 6 12 6 12 6 12 6 ]
They way I currently have it for my larger dataset, only a few datapoints are correctly multiplied and most are nans.
Dot(product) is one of the solutions to this problem
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.DataFrame(time)
# use dot
df3 = df1.dot(df2.T)
df3
Output
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Try this:
df1.dot(df2.to_frame().T)
Output:
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.
Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I have a df that looks like
L.1
L.2
G.1
G.2
1
5
9
13
2
6
10
14
3
7
11
15
4
8
12
16
This is just an arbitrary example but the structure of my df is the exactly the same. 4 column titles and then numbers under them. I would like to stack my columns in a way that it will look like
L
G
1
9
2
10
3
11
4
12
5
13
6
14
7
15
8
16
If someone could help me in solving this, it would be great as I am having a really hard time doing this.
Use wide_to_long with remove MultiIndex in DataFrame.reset_index with drop=True:
df = (pd.wide_to_long(df.reset_index(), stubnames=['L','G'], i='index', j='tmp', sep='.')
.reset_index(drop=True))
print (df)
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Or split columns by str.split with DataFrame.stack and sorting MultiIndex by DataFrame.sort_index, last also remove MultiIndex:
df.columns = df.columns.str.split('.', expand=True)
df = df.stack().sort_index(level=[1,0]).reset_index(drop=True)
print (df)
G L
0 9 1
1 10 2
2 11 3
3 12 4
4 13 5
5 14 6
6 15 7
7 16 8
You can make each column to list and concatenate them and create a new dataframe based on the new list:
import pandas as pd
df = pd.DataFrame({'L.1': [1, 2, 3, 4], 'L.2': [5, 6, 7, 8], 'G.1':[9, 10, 11, 12], 'G.2': [13, 14, 15, 16]})
new_df = pd.DataFrame({'L':df['L.1'].tolist()+df['L.2'].tolist(),
'G':df['G.1'].tolist()+df['G.2'].tolist()})
Printing new_df will give you:
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
The columns have a pattern, some start with L, others start with G. We can use pivot_longer from pyjanitor to abstract the process; simply pass a list of new column names, and pass a regular expression to match the patterns:
df.pivot_longer(index = None,
names_to = ['L', 'G'],
names_pattern = ['^L', '^G'])
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Using pivot_longer, you can use the .value approach, along with a regular expression that contains groups - the grouped part is retained as a column header:
df.pivot_longer(index = None,
names_to = ".value",
names_pattern = r"(.).")
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
I want to swap all the values of my data frame.Largest value must be replaced with smallest value (i.e. 7 with 1, 6 with 2, 5 with 3, 4 with 4, 3 with 5, and so on..
import numpy as np
import pandas as pd
import io
data = '''
Values
6
1
3
7
5
2
4
1
4
7
2
5
'''
df = pd.read_csv(io.StringIO(data))
Trial
First I want to get all the unique values from my data.
df1=df.Values.unique()
print(df1)
[6 1 3 7 5 2 4]
I have sorted it in ascending order:
sorted1 = list(np.sort(df1))
print(sorted1)
[1, 2, 3, 4, 5, 6, 7]
Than I have reverse sorted the list:
rev_sorted = list(reversed(sorted1))
print(rev_sorted)
[7, 6, 5, 4, 3, 2, 1]
Now I need to replace the max. value with min. value and so on in my main data set (df). The old values can be replaced or a new column might be added.
Expected Output:
Values,New_Values
6,2
1,7
3,5
7,1
5,3
2,6
4,4
1,7
4,4
7,1
2,6
5,3
Here's a vectorized one -
In [51]: m,n = np.unique(df['Values'], return_inverse=True)
In [52]: df['New_Values'] = m[n.max()-n]
In [53]: df
Out[53]:
Values New_Values
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
Translating to pandas with pandas.factorize -
m,n = pd.factorize(df.Values, sort=True)
df['New_Values'] = n[m.max()-m]
Use Series.map by dictionary created by sorted and reverse sorting lists:
df['New'] = df['Values'].map(dict(zip(sorted1,rev_sorted)))
print (df)
Values New
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.