Pandas: Un-nesting / flattening hierarchical dataframe [duplicate] - python

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()

DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)

Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop

data1.reset_index(inplace=True)

Related

Pandas DataFrame Natural index [duplicate]

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)

Pivot in Pandas incorrectly converts index column to row labels ... losing the index column [duplicate]

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)

Averaging indices of table using pandas and numpy

I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop

How to sum values of a row of a pandas dataframe efficiently

I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing

Does adding column to a DataFrame involve copying data?

My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b will semantically create a copy of b, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
I think from my experiments that loc is slowier and align new Series with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.

Categories

Resources