I have several datasets like df_1,df_2,...df_100.
First I want to create a list of these datasets.
df=[df_1,df_2,...,df_100]
This is what I did which did not work for me.
df=[]
for i in range(1,101):
df.append("df_"+str(i))
I need the above one so that I do the following
final=pandas.concat(df,ignore_index=True)
This gives me an error since df is a list of strings, not datasets. I want to create a list of several datasets.
In R, I did the following
final=do.call(rbind,mget(paste0("df_",1:100)))
Is there anything similar in python?
Use built-in functions globals or locals to get variable by name
>>> [globals()[d] for d in df]
Example:
>>> df_1
A B C
9l6rvsotz5 0.209350 -1.360556 0.059560
jTonmSOIVv 1.046584 0.251718 0.567056
eGaK0n8y9N -0.347716 -0.292623 0.591843
>>> df_2
A B C
TIVsJWSDWe -0.169969 0.345766 0.674683
EJjXuhL3pi -0.527015 -1.089954 -1.658116
dm3IYAyC7z 1.653666 -0.203685 -1.441150
>>> df_3
A B C
DbmE1sc3MI 0.215871 -0.382257 0.662477
9qZd6bvPVy 0.150985 0.135556 0.308615
qiVrxD64IF -1.384027 0.765303 -0.734394
>>> df = ["df_{}".format(i) for i in range(1, 4)]
>>> df
['df_1', 'df_2', 'df_3']
>>> pd.concat([globals()[d] for d in df], ignore_index=True)
A B C
0 0.209350 -1.360556 0.059560
1 1.046584 0.251718 0.567056
2 -0.347716 -0.292623 0.591843
3 -0.169969 0.345766 0.674683
4 -0.527015 -1.089954 -1.658116
5 1.653666 -0.203685 -1.441150
6 0.215871 -0.382257 0.662477
7 0.150985 0.135556 0.308615
8 -1.384027 0.765303 -0.734394
Related
First off, let me say that I've already looked over various responses to similar questions, but so far, none of them has really made it clear to me why (or why not) the Series and DataFrame methodologies are different.
Also, some of the Pandas information is not clear, for example looking up Series.reindex,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
all the examples suddenly switch to showing examples for DataFrame not Series, but the functions don't seem to overlap exactly.
So, now to it, first with a DataFrame.
> df = pd.DataFrame(np.random.randn(6,4), index=range(6), columns=list('ABCD'))
> df
Out[544]:
A B C D
0 0.136833 -0.974500 1.708944 0.435174
1 -0.357955 -0.775882 -0.208945 0.120617
2 -0.002479 0.508927 -0.826698 -0.904927
3 1.955611 -0.558453 -0.476321 1.043139
4 -0.399369 -0.361136 -0.096981 0.092468
5 -0.130769 -0.075684 0.788455 1.640398
Now, to add new columns, I can do something simple (2 ways, same result).
> df[['X','Y']] = (99,-99)
> df.loc[:,['X','Y']] = (99,-99)
> df
Out[557]:
A B C D X Y
0 0.858615 -0.552171 1.225210 -1.700594 99 -99
1 1.062435 -1.917314 1.160043 -0.058348 99 -99
2 0.023910 1.262706 -1.924022 -0.625969 99 -99
3 1.794365 0.146491 -0.103081 0.731110 99 -99
4 -1.163691 1.429924 -0.194034 0.407508 99 -99
5 0.444909 -0.905060 0.983487 -4.149244 99 -99
Now, with a Series, I have hit a (mental?) block trying the same.
I'm going to be using a loop to construct a list of Series that will eventually be a data frame, but I want to deal with each 'row' as a Series first, (to make development easier).
> ss = pd.Series(np.random.randn(4), index=list('ABCD'))
> ss
Out[552]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
dtype: float64
> ss['X','Y'] = (99,-99)
Traceback (most recent call last):
...
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
Same for,
> ss[['X','Y']] = (99,-99)
> ss.loc[['X','Y']] = (99,-99)
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
The only way I can get this working is a rather clumsy (IMHO),
> ss['X'],ss['Y'] = (99,-99)
> ss
Out[560]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
X 99.000000
Y -99.000000
dtype: float64
I did think that, perhaps, reindexing the Series to add the new indices prior to assignment might solve to problem. It would, but then I hit an issue trying to change the index.
> ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
> xs = pd.Series([99,-99], index=['X','Y'], name='z')
Here I can concat my 2 Series to create a new one, and I can also concat the Series indices, eg,
> ss.index.append(xs.index)
Index(['A', 'B', 'C', 'D', 'X', 'Y'], dtype='object')
But I can't extend the current index with,
> ss.index = ss.index.append(xs.index)
ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements
So, what intuitive leap must I make to understand why the former Series methods don't work, but (what looks like an equivalent) DataFrame method does work?
It makes passing multiple outputs back from a function into new Series elements a bit clunky. I can't 'on the fly' make up new Series index names to insert values into my exiting Series object.
I don't think you can directly modify the Series in place to add multiple values at once.
If having a new object is not an issue:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
# new object with updated index
ss = ss.reindex(ss.index.union(xs.index))
ss.update(xs)
Output:
A -0.369182
B -0.239379
C 1.099660
D 0.655264
X 99.000000
Y -99.000000
Name: z, dtype: float64
in place alternative using a function:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
def extend(s1, s2):
s1.update(s2) # update common indices
# add others
for idx, val in s2[s2.index.difference(s1.index)].items():
s1[idx] = val
extend(ss, xs)
Updated ss:
A 0.279925
B -0.098150
C 0.910179
D 0.317218
X 99.000000
Y -99.000000
Name: z, dtype: float64
While I have accepted #mozway's answer above since it nicely handles extending the Series even when there are possible index conflicts, I'm adding this 'answer' to demonstrate my point about the inconsistency in the extend operation between Series and DataFrame.
If I create my Series as single row DataFrames, as below, I can now extend the 'series' as I expected.
z=pd.Index(['z'])
ss = pd.DataFrame(np.random.randn(1,4), columns=list('ABCD'),index=z)
xs = pd.DataFrame([[99,-99]], columns=['X','Y'],index=z)
ss
Out[619]:
A B C D
z 1.052589 -0.337622 -0.791994 -0.266888
ss[['x','y']] = xs
ss
Out[620]:
A B C D x y
z 1.052589 -0.337622 -0.791994 -0.266888 99 -99
type(ss)
Out[621]: pandas.core.frame.DataFrame
Note that, as a DataFrame, I don't even need a Series for the extend object.
ss[['X','Y']] = [123,-123]
ss
Out[633]:
A B C D X Y
z 0.600981 -0.473031 0.216941 0.255252 123 -123
So I've simply extended the DataFrame, but it's still a DataFrame of 1 row.
I can now either 'squeeze' the DataFrame,
zz1=ss.squeeze()
type(zz1)
Out[624]: pandas.core.series.Series
zz1
Out[625]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Alternatively, I can use 'iloc[0]' to get a Series directly. Note that 'loc' will return a DataFrame not a Series and will still require 'squeezing'.
zz2=ss.iloc[0]
type(zz2)
Out[629]: pandas.core.series.Series
zz2
Out[630]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Please note, I'm not a Pandas 'wizard' so there may be other insights that I lack.
I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3
I have two dataframe and I will concat but just row will have the seem index
my first dataframe look like this
a b c d e f
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204
20018-01-08 0.430000 0.502100 0.194092 0.588685 -0.507332 1.404635
20018-01-09 1.005721 0.604771 -2.296667 0.157201 1.583537 1.359332
and i will concat this dataframe
g h
20018-01-05 13.702556 -03.885554
20018-01-06 -03.968689 -03.700311
20018-01-07 13.249893 -03.613356
20018-01-22 03.430000 03.502100
20018-01-23 13.005721 03.604771
I would like concat just the tree first line with the seem index in drop others
my final dataframe should look like this
a b c d e f g h
20018-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680 13.702556 -03.885554
20018-01-06 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177 -03.968689 -03.700311
20018-01-07 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204 13.249893 -03.613356
Try this
>>> pd.concat((df1, df2), axis=1).ix[:2, 1:]
a b ... g h
0 1.702556 -0.885554 ... 13.702556 -03.885554
1 -0.968689 -0.700311 ... -03.968689 -03.700311
2 1.249893 -0.613356 ... 13.249893 -03.613356
[3 rows x 9 columns]
I'm fairly new to programming and I have a question on using loops to recode variables in a pandas data frame that I was hoping I could get some help with.
I want to recode multiple columns in a pandas data frame from units of seconds to minutes. I've written a simple function in python and then can copy and repeat it on each column which works, but I wanted to automate this. I appreciate the help.
The ivf.secondsUntilCC.xxx column contains the number of seconds until something happens. I want the new column ivf.minsUntilCC.xxx to be the number of minutes. The data frame name is data.
def f(x,y):
return x[y]/60
data['ivf.minsUntilCC.500'] = f(data,'ivf.secondsUntilCC.500')
data['ivf.minsUntilCC.1000'] = f(data,'ivf.secondsUntilCC.1000')
data['ivf.minsUntilCC.2000'] = f(data,'ivf.secondsUntilCC.2000')
data['ivf.minsUntilCC.3000'] = f(data,'ivf.secondsUntilCC.3000')
data['ivf.minsUntilCC.4000'] = f(data,'ivf.secondsUntilCC.4000')
I would use vectorized approach:
In [27]: df
Out[27]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 906395 854268 701859 979647 914942
1 288577 300394 577555 880370 924162 897984
2 66705 493545 232603 682509 794074 204429
3 747828 504930 379035 29230 410390 287327
4 926553 913360 657640 336139 210202 356649
In [28]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')] /= 60
In [29]: df
Out[29]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 15106.583333 14237.800000 11697.650000 16327.450000 15249.033333
1 288577 5006.566667 9625.916667 14672.833333 15402.700000 14966.400000
2 66705 8225.750000 3876.716667 11375.150000 13234.566667 3407.150000
3 747828 8415.500000 6317.250000 487.166667 6839.833333 4788.783333
4 926553 15222.666667 10960.666667 5602.316667 3503.366667 5944.150000
Setup:
df = pd.DataFrame(np.random.randint(0,10**6,(5,6)),
columns=['X','ivf.minsUntilCC.500', 'ivf.minsUntilCC.1000',
'ivf.minsUntilCC.2000', 'ivf.minsUntilCC.3000',
'ivf.minsUntilCC.4000'])
Explanation:
In [26]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')]
Out[26]:
ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 906395 854268 701859 979647 914942
1 300394 577555 880370 924162 897984
2 493545 232603 682509 794074 204429
3 504930 379035 29230 410390 287327
4 913360 657640 336139 210202 356649
I have a dataframe already and am subsetting some of it to another dataframe.
I do that like this:
D = njm[['svntygene', 'intgr', 'lowgr', 'higr', 'lumA', 'lumB', 'wndres', 'nlbrst', 'Erneg', 'basallike']]
I want to try and set it by the integer position though, something like this:
D = njm.iloc[1:, 2:, 3:, 7:]
But I get an error. How would I do this part? Read the docs but could not find a clear answer.
Also, is it possible to pass a list to this as values too?
Thanks.
This is covered in the iloc section of the documentation: you can pass a list with the desired indices.
>>> df = pd.DataFrame(np.random.random((5,5)),columns=list("ABCDE"))
>>> df
A B C D E
0 0.605594 0.229728 0.390391 0.754185 0.516801
1 0.384228 0.106261 0.457507 0.833473 0.786098
2 0.364943 0.664588 0.330835 0.846941 0.229110
3 0.025799 0.681206 0.235821 0.418825 0.878566
4 0.811800 0.761962 0.883281 0.932983 0.665609
>>> df.iloc[:,[1,2,4]]
B C E
0 0.229728 0.390391 0.516801
1 0.106261 0.457507 0.786098
2 0.664588 0.330835 0.229110
3 0.681206 0.235821 0.878566
4 0.761962 0.883281 0.665609