I have a Pandas DataFrame which has a list of integers inside one of the columns. I'd like to access the individual elements within this list. I've found a way to do it by using tolist() and turning it back into a DataFrame, but I am wondering if there is a simpler/better way. In this example, I add Column A to the middle element of the list in Column B.
import pandas as pd
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1,2],[3,4,5,],[6,7,8])})
df['C'] = df['A'] + pd.DataFrame(df['B'].tolist())[1]
df
Is there a better way to do this?
A bit more straightforward is:
df['C'] = df['A'] + df['B'].apply(lambda x:x[1])
One option is to use the apply, which should be faster than creating a data frame out of it:
df['C'] = df['A'] + df.apply(lambda row: row['B'][1], axis = 1)
Some speed test:
%timeit df['C'] = df['A'] + pd.DataFrame(df['B'].tolist())[1]
# 1000 loops, best of 3: 567 µs per loop
%timeit df['C'] = df['A'] + df.apply(lambda row: row['B'][1], axis = 1)
# 1000 loops, best of 3: 406 µs per loop
%timeit df['C'] = df['A'] + df['B'].apply(lambda x:x[1])
# 1000 loops, best of 3: 250 µs per loop
OK. Slightly better. #breucopter's answer is the fastest.
You can also simply try the following:
df['C'] = df['A'] + df['B'].str[1]
Performance of this method:
%timeit df['C'] = df['A'] + df['B'].str[1]
#1000 loops, best of 3: 445 µs per loop
Related
This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)
This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)
I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop
My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b will semantically create a copy of b, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
I think from my experiments that loc is slowier and align new Series with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.
What would be the best way to convert this:
deviceid devicetype
0 b569dcb7-4498-4cb4-81be-333a7f89e65f Google
1 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f Android
2 cf7391c5-a82f-4889-8d9e-0a423f132026 Android
into this:
0 {"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":["Google"]}
1 {"deviceid":"04d3b752-f7a1-42ae-8e8a-9322cda4fd7f","devicetype":["Android"]}
2 {"deviceid":"cf7391c5-a82f-4889-8d9e-0a423f132026","devicetype":["Android"]}
I've tried df.to_dict() but that just gives:
{'deviceid': {0: 'b569dcb7-4498-4cb4-81be-333a7f89e65f',
1: '04d3b752-f7a1-42ae-8e8a-9322cda4fd7f',
2: 'cf7391c5-a82f-4889-8d9e-0a423f132026'},
'devicetype': {0: 'Google', 1: 'Android', 2: 'Android'}}
You can use apply with to_json:
In [11]: s = df.apply((lambda x: x.to_json()), axis=1)
In [12]: s[0]
Out[12]: '{"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":"Google"}'
To get the list for the device type you could do this manually:
In [13]: s1 = df.apply((lambda x: {"deviceid": x["deviceid"], "devicetype": [x["devicetype"]]}), axis=1)
In [14]: s1[0]
Out[14]: {'deviceid': 'b569dcb7-4498-4cb4-81be-333a7f89e65f', 'devicetype': ['Google']}
To expand on on the previous answer to_dict() should be a little faster than to_json()
This appears to be true for a larger test data frame, but the to_dict() method is actually a little slower for the example you provided.
Large test set
In [1]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[1]: 100 loops, best of 3: 5.88 ms per loop
In [2]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[2]: 100 loops, best of 3: 3.91 ms per loop
Provided example
In [3]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[3]: 1000 loops, best of 3: 375 µs per loop
In [4]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[4]: 1000 loops, best of 3: 450 µs per loop