I have a large dataframe where each row contains a string.
I want to split each string into several columns, and also replace two character types.
The code below does the job, but it is slow on a large dataframe. Is there a faster way than using a for loop?
import re
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = pd.DataFrame({'col1': [0,0], 'col2': [0,0], 'col3': [0,0]})
for i in range(df.shape[0]):
df_new.iloc[i, :] = re.split(',', df.iloc[i, 0].replace('[', '').replace(']', ''))
You can do it with:
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = df[0].str[1:-1].str.split(",", expand=True)
df_new.columns = ["col1", "col2", "col3"]
The idea is to first get rid of the [ and ] and then split by , and expand the dataframe. The last step would be to rename the columns.
Your solution should be changed with Series.str.strip and Series.str.split:
df1 = df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
print(df1)
col0 col1 col2
0 3.4 3.4 2.5
1 3.4 3.4 2.5
If performance is important use list comprehension instead pandas functions:
df1 = pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
Timings:
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [208]: %timeit df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
61.5 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
29.8 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?
Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data to_delete
[1,2,3,4] [2]
[0,1,2] [0,2]
This is what I want to end up with:
new_df
>>>> data to_delete
[1,2,4] [2]
[1] [0,2]
I could iterate over the rows by hand and calculate the new data for each one like this:
new_data = []
for _,v in df.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
df.assign(data=new_data)
but I'm looking for a better way to do this.
The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:
df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df.values]
print(df)
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
Timings on a 20K row dataframe:
df_large = pd.concat([df]*10000, axis=0)
%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_data = []
for _,v in df_large.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large.apply(lambda row: np.delete(row["data"],
row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You should use the apply function in order to apply a function to every row in the dataframe:
df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)
An other solution based on starmap:
This solution is based on a less known tool from the itertools module called starmap.
Check its doc, it's worth a try!
import pandas as pd
import numpy as np
from itertools import starmap
df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
'to_delete': [np.array([2]),np.array([0,2])]})
# Solution:
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
zip(df['data'],df['to_delete'])))
df2['data'] = pd.DataFrame(zip(A))
df2
prints out:
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
I have a panda data frame in python as below:
df['column'] = [abc, mno]
[mno, pqr]
[abc, mno]
[mno, pqr]
I want to get the count of each item below :
abc = 2,
mno= 4 ,
pqr = 2
I can do iteration over the each row to count but this is not the kind of solution I m looking for.
If there is any way where I can use iloc or anything related to that, please suggest to me.
I have looked at various solutions with a similar problem but none of them satisfied my scenario.
Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output:
In one line:
print(df.explode('column')['column'].value_counts())
Full example:
import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
column
index
0 [abc, mno]
1 [mno, pqr]
2 [abc, mno]
3 [mno, pqr]
Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:
df_new = df.explode('column')
print(df_new['column'].value_counts())
Output:
mno 4
abc 2
pqr 2
Use collections.Counter
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df.column))
Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})
%timeit
df1 = pd.concat([df]*10000, ignore_index=True)
In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I want to parse row values as columns and use them to look up values in a pandas dataframe
tried iterrows and .loc indexing without success
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
build toy dataset
coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30
""")
df = pd.read_csv(coltable, sep=";")
I want to access the column elements of the pair (eg first row NA=10 and NB=60 and use those values to create a new column (desired_result=10+60=70).
I want the function to create the new column in pandas to be compatible with np.vectorize as the dataset is huge
Something like this:
df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])
thanks a lot for any assistance you can give!
Use DataFrame.lookup:
a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])
df['new'] = a + b
print (df)
NA NB NC ND pair desired_result new
0 10 60 50 20 NANB 70 70
1 20 30 10 5 NANC 30 30
2 40 30 20 10 NCND 30 30
Also if no missing values is possible use list comprehension or apply:
#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)
In [263]: %%timeit
...: a = df.lookup(df.index, df['pair'].str[:2])
...: b = df.lookup(df.index, df['pair'].str[2:])
...:
...: df['new'] = a + b
...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %%timeit
...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
...:
...: df['new'] = a + b
...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %%timeit
...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
...:
...: df['new'] = a + b
...:
...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is it possible to call the apply function on multiple columns in pandas and if so how does one do this.. for example,
df['Duration'] = df['Hours', 'Mins', 'Secs'].apply(lambda x,y,z: timedelta(hours=x, minutes=y, seconds=z))
This is what the expected output should look like once everything comes together
Thank you.
You should use:
df['Duration'] = pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
When you use apply on a DataFrame with axis=1, it's a row calculation, so typically this syntax makes sense:
df['Duration'] = df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins,
seconds=row.Secs), axis=1)
Some timings
import pandas as pd
import numpy as np
df = pd.DataFrame({'Hours': np.tile([1,2,3,4],50),
'Mins': np.tile([10,20,30,40],50),
'Secs': np.tile([11,21,31,41],50)})
%timeit pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
#432 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins, seconds=row.Secs), axis=1)
#12 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As always, apply should be a last resort.
Use apply on the dataframe with axis=1
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
triangles = [{ 'base': 20, 'height': 9 }, { 'base': 10, 'height': 7 }, { 'base': 40, 'height': 4 }]
triangles_df = pd.DataFrame(triangles)
def calculate_area(row):
return row['base'] * row['height'] * 0.5
triangles_df.apply(calculate_area, axis=1)
Good luck!
This might help.
import pandas as pd
import datetime as DT
df = pd.DataFrame({"Hours": [1], "Mins": [2], "Secs": [10]})
df = df.astype(int)
df['Duration'] = df[['Hours', 'Mins', 'Secs']].apply(lambda x: DT.timedelta(hours=x[0], minutes=x[1], seconds=x[2]), axis=1)
print(df)
print(df["Duration"])
Output:
Hours Mins Secs Duration
0 1 2 10 01:02:10
0 01:02:10
dtype: timedelta64[ns]