This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")
Related
I use nth value as columns without row aggregation.
Because I want to create a feature that can be tracked by using the window function and the aggregation function at any time.
R:
library(tidyverse)
iris %>% arrange(Species, Sepal.Length) %>% group_by(Species) %>%
mutate(cs = cumsum(Sepal.Length), cs4th = cumsum(Sepal.Length)[4]) %>%
slice(c(1:4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species cs cs4th
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 4.3 3 1.1 0.1 setosa 4.3 17.5
2 4.4 2.9 1.4 0.2 setosa 8.7 17.5
3 4.4 3 1.3 0.2 setosa 13.1 17.5
4 4.4 3.2 1.3 0.2 setosa 17.5 17.5
5 4.9 2.4 3.3 1 versicolor 4.9 20
6 5 2 3.5 1 versicolor 9.9 20
7 5 2.3 3.3 1 versicolor 14.9 20
8 5.1 2.5 3 1.1 versicolor 20 20
9 4.9 2.5 4.5 1.7 virginica 4.9 22
10 5.6 2.8 4.9 2 virginica 10.5 22
11 5.7 2.5 5 2 virginica 16.2 22
12 5.8 2.7 5.1 1.9 virginica 22 22
Python: Too long and verbose!
import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.sort_values(['species','sepal_length']).assign(
index_species=lambda x: x.groupby('species').cumcount(),
cs=lambda x: x.groupby('species').sepal_length.cumsum(),
tmp=lambda x: np.where(x.index_species==3, x.cs, 0),
cs4th=lambda x: x.groupby('species').tmp.transform(sum)
).iloc[list(range(0,4))+list(range(50,54))+list(range(100,104))]
sepal_length sepal_width petal_length ... cs tmp cs4th
13 4.3 3.0 1.1 ... 4.3 0.0 17.5
8 4.4 2.9 1.4 ... 8.7 0.0 17.5
38 4.4 3.0 1.3 ... 13.1 0.0 17.5
42 4.4 3.2 1.3 ... 17.5 17.5 17.5
57 4.9 2.4 3.3 ... 4.9 0.0 20.0
60 5.0 2.0 3.5 ... 9.9 0.0 20.0
93 5.0 2.3 3.3 ... 14.9 0.0 20.0
98 5.1 2.5 3.0 ... 20.0 20.0 20.0
106 4.9 2.5 4.5 ... 4.9 0.0 22.0
121 5.6 2.8 4.9 ... 10.5 0.0 22.0
113 5.7 2.5 5.0 ... 16.2 0.0 22.0
101 5.8 2.7 5.1 ... 22.0 22.0 22.0
Python : My better solution(not smart. There is room for improvement about specifications of groupby )
iris.sort_values(['species','sepal_length']).assign(
cs=lambda x: x.groupby('species').sepal_length.transform('cumsum'),
cs4th=lambda x: x.merge(
x.groupby('species', as_index=False).nth(3).loc[:,['species','cs']],on='species')
.iloc[:,-1]
)
This doesn't work in a good way
iris.groupby('species').transform('nth(3)')
Here is an updated solution, using Pandas, which is still longer than what you will get with dplyr:
import seaborn as sns
import pandas as pd
iris = sns.load_dataset('iris')
iris['cs'] = (iris
.sort_values(['species','sepal_length'])
.groupby('species')['sepal_length']
.transform('cumsum'))
M = (iris
.sort_values(['species','cs'])
.groupby('species')['cs'])
groupby has a nth function that gets you a row per group : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html
iris = (iris
.sort_values(['species','cs'])
.reset_index(drop=True)
.merge(M.nth(3), how='left', on='species')
.rename(columns={'cs_x':'cs',
'cs_y':'cs4th'})
)
iris.head()
sepal_length sepal_width petal_length petal_width species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.5 2.3 1.3 0.3 setosa 22.0 17.5
Update: 16/04/2021 ... Below is a better way to achieve the OP's goal:
(iris
.sort_values(['species', 'sepal_length'])
.assign(cs = lambda df: df.groupby('species')
.sepal_length
.transform('cumsum'),
cs4th = lambda df: df.groupby('species')
.cs
.transform('nth', 3)
)
.groupby('species')
.head(4)
)
sepal_length sepal_width petal_length petal_width species cs cs4th
13 4.3 3.0 1.1 0.1 setosa 4.3 17.5
8 4.4 2.9 1.4 0.2 setosa 8.7 17.5
38 4.4 3.0 1.3 0.2 setosa 13.1 17.5
42 4.4 3.2 1.3 0.2 setosa 17.5 17.5
57 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
60 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
93 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
98 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
106 4.9 2.5 4.5 1.7 virginica 4.9 22.0
121 5.6 2.8 4.9 2.0 virginica 10.5 22.0
113 5.7 2.5 5.0 2.0 virginica 16.2 22.0
101 5.8 2.7 5.1 1.9 virginica 22.0 22.0
Now you can do it in a non-verbose way as you did in R with datar in python:
>>> from datar.datasets import iris
>>> from datar.all import f, arrange, mutate, cumsum, slice
>>>
>>> (iris >>
... arrange(f.Species, f.Sepal_Length) >>
... group_by(f.Species) >>
... mutate(cs=cumsum(f.Sepal_Length), cs4th=cumsum(f.Sepal_Length)[3]) >>
... slice(f[1:4]))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
5 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
6 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
7 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
8 4.9 2.5 4.5 1.7 virginica 4.9 22.0
9 5.6 2.8 4.9 2.0 virginica 10.5 22.0
10 5.7 2.5 5.0 2.0 virginica 16.2 22.0
11 5.8 2.7 5.1 1.9 virginica 22.0 22.0
[Groups: ['Species'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.
I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10
I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?
melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1
Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1
I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)
You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0