Pandas concat dataframe row seem index - python

I have two dataframe and I will concat but just row will have the seem index
my first dataframe look like this
a b c d e f
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204
20018-01-08 0.430000 0.502100 0.194092 0.588685 -0.507332 1.404635
20018-01-09 1.005721 0.604771 -2.296667 0.157201 1.583537 1.359332
and i will concat this dataframe
g h
20018-01-05 13.702556 -03.885554
20018-01-06 -03.968689 -03.700311
20018-01-07 13.249893 -03.613356
20018-01-22 03.430000 03.502100
20018-01-23 13.005721 03.604771
I would like concat just the tree first line with the seem index in drop others
my final dataframe should look like this
a b c d e f g h
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680 13.702556 -03.885554
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177 -03.968689 -03.700311
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204 13.249893 -03.613356

Try this
>>> pd.concat((df1, df2), axis=1).ix[:2, 1:]
a b ... g h
0 1.702556 -0.885554 ... 13.702556 -03.885554
1 -0.968689 -0.700311 ... -03.968689 -03.700311
2 1.249893 -0.613356 ... 13.249893 -03.613356
[3 rows x 9 columns]

Related

Create list of several datasets

I have several datasets like df_1,df_2,...df_100.
First I want to create a list of these datasets.
df=[df_1,df_2,...,df_100]
This is what I did which did not work for me.
df=[]
for i in range(1,101):
df.append("df_"+str(i))
I need the above one so that I do the following
final=pandas.concat(df,ignore_index=True)
This gives me an error since df is a list of strings, not datasets. I want to create a list of several datasets.
In R, I did the following
final=do.call(rbind,mget(paste0("df_",1:100)))
Is there anything similar in python?
Use built-in functions globals or locals to get variable by name
>>> [globals()[d] for d in df]
Example:
>>> df_1
A B C
9l6rvsotz5 0.209350 -1.360556 0.059560
jTonmSOIVv 1.046584 0.251718 0.567056
eGaK0n8y9N -0.347716 -0.292623 0.591843
>>> df_2
A B C
TIVsJWSDWe -0.169969 0.345766 0.674683
EJjXuhL3pi -0.527015 -1.089954 -1.658116
dm3IYAyC7z 1.653666 -0.203685 -1.441150
>>> df_3
A B C
DbmE1sc3MI 0.215871 -0.382257 0.662477
9qZd6bvPVy 0.150985 0.135556 0.308615
qiVrxD64IF -1.384027 0.765303 -0.734394
>>> df = ["df_{}".format(i) for i in range(1, 4)]
>>> df
['df_1', 'df_2', 'df_3']
>>> pd.concat([globals()[d] for d in df], ignore_index=True)
A B C
0 0.209350 -1.360556 0.059560
1 1.046584 0.251718 0.567056
2 -0.347716 -0.292623 0.591843
3 -0.169969 0.345766 0.674683
4 -0.527015 -1.089954 -1.658116
5 1.653666 -0.203685 -1.441150
6 0.215871 -0.382257 0.662477
7 0.150985 0.135556 0.308615
8 -1.384027 0.765303 -0.734394

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

How to get the highest values from many columns and show in what rows it happened using pandas?

I have a dataframe from which I want to know the highest value for each column. But I also want to know in what row it happened.
With my code I have to put the name of each column each time. Is there a better way to get all highest values from all columns?
df2.loc[df2['ALL'].idxmax()]
THE DATAFRAME
WHAT I GET WITH MY CODE
WHAT I WANT
THE DATAFRAME
You can stack your frame and then sort the values from largest to smallest and then take the first occurrence of your column names.
First I will create some fake data
df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde'),
index=list('nopqrstuvw'))
df.columns.name = 'level_0'
df.index.name = 'level_1'
Output
level_0 a b c d e
level_1
n 0.417317 0.821350 0.443729 0.167315 0.281859
o 0.166944 0.223317 0.418765 0.226544 0.508055
p 0.881260 0.789210 0.289563 0.369656 0.610923
q 0.893197 0.494227 0.677377 0.065087 0.228854
r 0.394382 0.573298 0.875070 0.505148 0.334238
s 0.046179 0.039642 0.930811 0.326114 0.880804
t 0.143488 0.561449 0.832186 0.486752 0.323215
u 0.891823 0.616401 0.247078 0.497050 0.995108
v 0.888553 0.386260 0.816100 0.874761 0.769073
w 0.557239 0.601758 0.932839 0.274614 0.854063
Now stack, sort and drop all but the first column occurrence
df.stack()\
.sort_values(ascending=False)\
.reset_index()\
.drop_duplicates('level_0')\
.sort_values('level_0')[['level_0', 0, 'level_1']]
level_0 0 level_1
3 a 0.893197 q
12 b 0.821350 n
1 c 0.932839 w
9 d 0.874761 v
0 e 0.995108 u

Summing 3 columns in a dataframe

This should be easy:
I have a data frame with the following columns
a,b,min,w,w_min
all I want to do is sum up the columns min,w,and w_min and read that result into another data frame.
I've looked, but I can not find a previously asked question that directly relates back to this. Everything I've found seems much more complex then what I'm trying to do.
You can just pass a list of cols and select these to perform the summation on:
In [64]:
df = pd.DataFrame(columns=['a','b','min','w','w_min'], data = np.random.randn(10,5) )
df
Out[64]:
a b min w w_min
0 0.626671 0.850726 0.539850 -0.669130 -1.227742
1 0.856717 2.108739 -0.079023 -1.107422 -1.417046
2 -1.116149 -0.013082 0.871393 -1.681556 -0.170569
3 -0.944121 -2.394906 -0.454649 0.632995 1.661580
4 0.590963 0.751912 0.395514 0.580653 0.573801
5 -1.661095 -0.592036 -1.278102 -0.723079 0.051083
6 0.300866 -0.060604 0.606705 1.412149 0.916915
7 -1.640530 -0.398978 0.133140 -0.628777 -0.464620
8 0.734518 1.230869 -1.177326 -0.544876 0.244702
9 -1.300137 1.328613 -1.301202 0.951401 -0.693154
In [65]:
cols=['min','w','w_min']
df[cols].sum()
Out[65]:
min -1.743700
w -1.777642
w_min -0.525050
dtype: float64

DataFrame Subset

I have a dataframe already and am subsetting some of it to another dataframe.
I do that like this:
D = njm[['svntygene', 'intgr', 'lowgr', 'higr', 'lumA', 'lumB', 'wndres', 'nlbrst', 'Erneg', 'basallike']]
I want to try and set it by the integer position though, something like this:
D = njm.iloc[1:, 2:, 3:, 7:]
But I get an error. How would I do this part? Read the docs but could not find a clear answer.
Also, is it possible to pass a list to this as values too?
Thanks.
This is covered in the iloc section of the documentation: you can pass a list with the desired indices.
>>> df = pd.DataFrame(np.random.random((5,5)),columns=list("ABCDE"))
>>> df
A B C D E
0 0.605594 0.229728 0.390391 0.754185 0.516801
1 0.384228 0.106261 0.457507 0.833473 0.786098
2 0.364943 0.664588 0.330835 0.846941 0.229110
3 0.025799 0.681206 0.235821 0.418825 0.878566
4 0.811800 0.761962 0.883281 0.932983 0.665609
>>> df.iloc[:,[1,2,4]]
B C E
0 0.229728 0.390391 0.516801
1 0.106261 0.457507 0.786098
2 0.664588 0.330835 0.229110
3 0.681206 0.235821 0.878566
4 0.761962 0.883281 0.665609

Categories

Resources