I want to combine 2 seperate data frame of the following shape in Python Pandas:
Df1=
A B
1 1 2
2 3 4
3 5 6
Df2 =
C D
1 a b
2 c d
3 e f
I want to have as follows:
df =
A B C D
1 1 2 a b
2 3 4 c d
3 5 6 e f
I am using the following code:
dat = df1.join(df2)
But problem is that, In my actual data frame there are more than 2 Million rows and for that it takes too long time and consumes huge memory.
Is there any way to do it faster and memory efficient?
Thank you in advance for helping.
If I've read your question correctly, your indexes align exactly and you just need to combine columns into a single DataFrame. If that's right then it turns out that copying over a column from one DataFrame to another is the fastest way to go ([92] and [93]). f is my DataFrame in the example below:
In [85]: len(f)
Out[86]: 343720
In [87]: a = f.loc[:, ['date_val', 'price']]
In [88]: b = f.loc[:, ['red_date', 'credit_spread']]
In [89]: %timeit c = pd.concat([a, b], axis=1)
100 loops, best of 3: 7.11 ms per loop
In [90]: %timeit c = pd.concat([a, b], axis=1, ignore_index=True)
100 loops, best of 3: 10.8 ms per loop
In [91]: %timeit c = a.join(b)
100 loops, best of 3: 6.47 ms per loop
In [92]: %timeit a['red_date'] = b['red_date']
1000 loops, best of 3: 1.17 ms per loop
In [93]: %timeit a['credit_spread'] = b['credit_spread']
1000 loops, best of 3: 1.16 ms per loop
I also tried to copy both columns at once but for some strange reason it was more than two times slower than copying each column individually.
In [94]: %timeit a[['red_date', 'credit_spread']] = b[['red_date', 'credit_spread']]
100 loops, best of 3: 5.09 ms per loop
Related
Given a DataFrame
>>> df
x y z
0 1 a 7
1 2 b 5
2 3 c 7
I would like to find the index of the column by name, e.g., x -> 0, z -> 2, &c.
I can do
>>> list(df.columns).index('y')
1
but it seems backwards (the pandas.indexes.base.Index class should probably be able to do it without circling back to list).
You can use Index.get_loc:
print (df.columns.get_loc('z'))
2
Another solution with Index.searchsorted:
print (df.columns.searchsorted('z'))
2
Timings:
In [86]: %timeit (df.columns.get_loc('z'))
The slowest run took 13.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop
In [87]: %timeit (df.columns.searchsorted('z'))
The slowest run took 10.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.48 µs per loop
I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing
I have a dataframe df which looks like:
id location grain
0 BBG.XETR.AD.S XETR 16.545
1 BBG.XLON.VB.S XLON 6.2154
2 BBG.XLON.HF.S XLON NaN
3 BBG.XLON.RE.S XLON NaN
4 BBG.XLON.LL.S XLON NaN
5 BBG.XLON.AN.S XLON 3.215
6 BBG.XLON.TR.S XLON NaN
7 BBG.XLON.VO.S XLON NaN
In reality this dataframe will be much larger. I would like to iterate over this dataframe returning the 'grain' value but I am only interested in the rows that have a value (not NaN) in the 'grain' column. So only returning as I iterate over the dataframe the following values:
16.545
6.2154
3.215
I can iterate over the dataframe using:
for staticidx, row in df.iterrows():
value= row['grain']
But this returns a value for all rows including those with a NaN value. Is there a way to either remove the NaN rows from the dataframe or skip the rows in the dataframe where grain equals NaN?
Many thanks
You can specify a list of columns in dropna on which to subset the data:
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
>>> df.dropna(subset=['grain'])
id location grain
0 BBG.XETR.AD.S XETR 16.5450
1 BBG.XLON.VB.S XLON 6.2154
5 BBG.XLON.AN.S XLON 3.2150
This:
df[pd.notnull(df['grain'])]
Or this:
df['grain].dropna()
Let's compare different methods (for 800K rows DF):
In [21]: df = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df.shape
Out[22]: (800000, 3)
In [23]: %timeit df.grain[~pd.isnull(df.grain)]
The slowest run took 5.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 17.1 ms per loop
In [24]: %timeit df.ix[df.grain.notnull(), 'grain']
10 loops, best of 3: 23.9 ms per loop
In [25]: %timeit df[pd.notnull(df['grain'])]
10 loops, best of 3: 35.9 ms per loop
In [26]: %timeit df.grain.ix[df.grain.notnull()]
100 loops, best of 3: 17.4 ms per loop
In [27]: %timeit df.dropna(subset=['grain'])
10 loops, best of 3: 56.6 ms per loop
In [28]: %timeit df.grain[df.grain.notnull()]
100 loops, best of 3: 17 ms per loop
In [30]: %timeit df['grain'].dropna()
100 loops, best of 3: 16.3 ms per loop
My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b will semantically create a copy of b, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
I think from my experiments that loc is slowier and align new Series with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.
I have a very large file (5GB), and I need to count the number of occurence using two columns
a b c d e
0 2 3 1 5 4
1 2 3 2 5 4
2 1 3 2 5 4
3 2 4 1 5 3
4 2 4 1 5 3
so obviously I have to find
(2,3):2
(1,3):1
(2,4):2
How can I do that in a very fast way.
I used:
df.groupby(['a','b']).count().to_dict()
Let's say that the final result would be
a b freq
2 3 2
1 3 1
2 4 2
Approach for the first version of the question - dictionary as result
If you have high frequencies, i.e. few combinations of a and b, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.
If you have low frequencies and enough RAM, looks like your approach is good.
Some timings for 5e6 rows and numbers from 0 to 19:
>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 552 ms per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop
Using a different range of integers, here up to sys.maxsize (9223372036854775807), changes the timings considerably:
import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)),
columns=list('abcde'))
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 41.3 s per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`
Solution for the updated question
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
print(g)
a b feq
0 1 3 1
1 2 3 2
2 2 4 2
It is not much faster though.
For range 0 to 19:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 564 ms per loop
For range 0 to sys.maxsize:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 10.2 s per loop