DataFrame Subset

DataFrame Subset - python

I have a dataframe already and am subsetting some of it to another dataframe.
I do that like this:
D = njm[['svntygene', 'intgr', 'lowgr', 'higr', 'lumA', 'lumB', 'wndres', 'nlbrst', 'Erneg', 'basallike']]
I want to try and set it by the integer position though, something like this:
D = njm.iloc[1:, 2:, 3:, 7:]
But I get an error. How would I do this part? Read the docs but could not find a clear answer.
Also, is it possible to pass a list to this as values too?
Thanks.

This is covered in the iloc section of the documentation: you can pass a list with the desired indices.
>>> df = pd.DataFrame(np.random.random((5,5)),columns=list("ABCDE"))
>>> df
A B C D E
0 0.605594 0.229728 0.390391 0.754185 0.516801
1 0.384228 0.106261 0.457507 0.833473 0.786098
2 0.364943 0.664588 0.330835 0.846941 0.229110
3 0.025799 0.681206 0.235821 0.418825 0.878566
4 0.811800 0.761962 0.883281 0.932983 0.665609
>>> df.iloc[:,[1,2,4]]
B C E
0 0.229728 0.390391 0.516801
1 0.106261 0.457507 0.786098
2 0.664588 0.330835 0.229110
3 0.681206 0.235821 0.878566
4 0.761962 0.883281 0.665609

Related

How to Convert a text data into DataFrame

How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.

Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])

import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747

Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665

wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747

Create list of several datasets

I have several datasets like df_1,df_2,...df_100.
First I want to create a list of these datasets.
df=[df_1,df_2,...,df_100]
This is what I did which did not work for me.
df=[]
for i in range(1,101):
df.append("df_"+str(i))
I need the above one so that I do the following
final=pandas.concat(df,ignore_index=True)
This gives me an error since df is a list of strings, not datasets. I want to create a list of several datasets.
In R, I did the following
final=do.call(rbind,mget(paste0("df_",1:100)))
Is there anything similar in python?

Use built-in functions globals or locals to get variable by name
>>> [globals()[d] for d in df]
Example:
>>> df_1
A B C
9l6rvsotz5 0.209350 -1.360556 0.059560
jTonmSOIVv 1.046584 0.251718 0.567056
eGaK0n8y9N -0.347716 -0.292623 0.591843
>>> df_2
A B C
TIVsJWSDWe -0.169969 0.345766 0.674683
EJjXuhL3pi -0.527015 -1.089954 -1.658116
dm3IYAyC7z 1.653666 -0.203685 -1.441150
>>> df_3
A B C
DbmE1sc3MI 0.215871 -0.382257 0.662477
9qZd6bvPVy 0.150985 0.135556 0.308615
qiVrxD64IF -1.384027 0.765303 -0.734394
>>> df = ["df_{}".format(i) for i in range(1, 4)]
>>> df
['df_1', 'df_2', 'df_3']
>>> pd.concat([globals()[d] for d in df], ignore_index=True)
A B C
0 0.209350 -1.360556 0.059560
1 1.046584 0.251718 0.567056
2 -0.347716 -0.292623 0.591843
3 -0.169969 0.345766 0.674683
4 -0.527015 -1.089954 -1.658116
5 1.653666 -0.203685 -1.441150
6 0.215871 -0.382257 0.662477
7 0.150985 0.135556 0.308615
8 -1.384027 0.765303 -0.734394

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.

A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()

You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

Pandas concat dataframe row seem index

I have two dataframe and I will concat but just row will have the seem index
my first dataframe look like this
a b c d e f
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204
20018-01-08 0.430000 0.502100 0.194092 0.588685 -0.507332 1.404635
20018-01-09 1.005721 0.604771 -2.296667 0.157201 1.583537 1.359332
and i will concat this dataframe
g h
20018-01-05 13.702556 -03.885554
20018-01-06 -03.968689 -03.700311
20018-01-07 13.249893 -03.613356
20018-01-22 03.430000 03.502100
20018-01-23 13.005721 03.604771
I would like concat just the tree first line with the seem index in drop others
my final dataframe should look like this
a b c d e f g h
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680 13.702556 -03.885554
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177 -03.968689 -03.700311
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204 13.249893 -03.613356

Try this
>>> pd.concat((df1, df2), axis=1).ix[:2, 1:]
a b ... g h
0 1.702556 -0.885554 ... 13.702556 -03.885554
1 -0.968689 -0.700311 ... -03.968689 -03.700311
2 1.249893 -0.613356 ... 13.249893 -03.613356
[3 rows x 9 columns]

Summing 3 columns in a dataframe

This should be easy:
I have a data frame with the following columns
a,b,min,w,w_min
all I want to do is sum up the columns min,w,and w_min and read that result into another data frame.
I've looked, but I can not find a previously asked question that directly relates back to this. Everything I've found seems much more complex then what I'm trying to do.

You can just pass a list of cols and select these to perform the summation on:
In [64]:
df = pd.DataFrame(columns=['a','b','min','w','w_min'], data = np.random.randn(10,5) )
df
Out[64]:
a b min w w_min
0 0.626671 0.850726 0.539850 -0.669130 -1.227742
1 0.856717 2.108739 -0.079023 -1.107422 -1.417046
2 -1.116149 -0.013082 0.871393 -1.681556 -0.170569
3 -0.944121 -2.394906 -0.454649 0.632995 1.661580
4 0.590963 0.751912 0.395514 0.580653 0.573801
5 -1.661095 -0.592036 -1.278102 -0.723079 0.051083
6 0.300866 -0.060604 0.606705 1.412149 0.916915
7 -1.640530 -0.398978 0.133140 -0.628777 -0.464620
8 0.734518 1.230869 -1.177326 -0.544876 0.244702
9 -1.300137 1.328613 -1.301202 0.951401 -0.693154
In [65]:
cols=['min','w','w_min']
df[cols].sum()
Out[65]:
min -1.743700
w -1.777642
w_min -0.525050
dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame Subset - python

Related

How to Convert a text data into DataFrame

Create list of several datasets

groupby and sum two columns and set as one column in pandas

Pandas concat dataframe row seem index

Summing 3 columns in a dataframe

Categories

Resources