Split and replace all strings in a pandas dataframe - python

I have a large dataframe where each row contains a string.
I want to split each string into several columns, and also replace two character types.
The code below does the job, but it is slow on a large dataframe. Is there a faster way than using a for loop?
import re
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = pd.DataFrame({'col1': [0,0], 'col2': [0,0], 'col3': [0,0]})
for i in range(df.shape[0]):
df_new.iloc[i, :] = re.split(',', df.iloc[i, 0].replace('[', '').replace(']', ''))

You can do it with:
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = df[0].str[1:-1].str.split(",", expand=True)
df_new.columns = ["col1", "col2", "col3"]
The idea is to first get rid of the [ and ] and then split by , and expand the dataframe. The last step would be to rename the columns.

Your solution should be changed with Series.str.strip and Series.str.split:
df1 = df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
print(df1)
col0 col1 col2
0 3.4 3.4 2.5
1 3.4 3.4 2.5
If performance is important use list comprehension instead pandas functions:
df1 = pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
Timings:
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [208]: %timeit df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
61.5 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
29.8 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

pandas better runtime, going trough dataframe

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?
Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas change values in column based on values in other column

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data to_delete
[1,2,3,4] [2]
[0,1,2] [0,2]
This is what I want to end up with:
new_df
>>>> data to_delete
[1,2,4] [2]
[1] [0,2]
I could iterate over the rows by hand and calculate the new data for each one like this:
new_data = []
for _,v in df.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
df.assign(data=new_data)
but I'm looking for a better way to do this.
The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:
df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df.values]
print(df)
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
Timings on a 20K row dataframe:
df_large = pd.concat([df]*10000, axis=0)
%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_data = []
for _,v in df_large.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large.apply(lambda row: np.delete(row["data"],
row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You should use the apply function in order to apply a function to every row in the dataframe:
df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)
An other solution based on starmap:
This solution is based on a less known tool from the itertools module called starmap.
Check its doc, it's worth a try!
import pandas as pd
import numpy as np
from itertools import starmap
df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
'to_delete': [np.array([2]),np.array([0,2])]})
# Solution:
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
zip(df['data'],df['to_delete'])))
df2['data'] = pd.DataFrame(zip(A))
df2
prints out:
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]

How to get frequency of each element in column (having array of strings) of data frame with pandas?

I have a panda data frame in python as below:
df['column'] = [abc, mno]
[mno, pqr]
[abc, mno]
[mno, pqr]
I want to get the count of each item below :
abc = 2,
mno= 4 ,
pqr = 2
I can do iteration over the each row to count but this is not the kind of solution I m looking for.
If there is any way where I can use iloc or anything related to that, please suggest to me.
I have looked at various solutions with a similar problem but none of them satisfied my scenario.
Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output:
In one line:
print(df.explode('column')['column'].value_counts())
Full example:
import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
column
index
0 [abc, mno]
1 [mno, pqr]
2 [abc, mno]
3 [mno, pqr]
Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:
df_new = df.explode('column')
print(df_new['column'].value_counts())
Output:
mno 4
abc 2
pqr 2
Use collections.Counter
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df.column))
Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})
%timeit
df1 = pd.concat([df]*10000, ignore_index=True)
In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

parse row values as columns and use to lookup values

I want to parse row values as columns and use them to look up values in a pandas dataframe
tried iterrows and .loc indexing without success
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
build toy dataset
coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30
""")
df = pd.read_csv(coltable, sep=";")
I want to access the column elements of the pair (eg first row NA=10 and NB=60 and use those values to create a new column (desired_result=10+60=70).
I want the function to create the new column in pandas to be compatible with np.vectorize as the dataset is huge
Something like this:
df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])
thanks a lot for any assistance you can give!
Use DataFrame.lookup:
a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])
df['new'] = a + b
print (df)
NA NB NC ND pair desired_result new
0 10 60 50 20 NANB 70 70
1 20 30 10 5 NANC 30 30
2 40 30 20 10 NCND 30 30
Also if no missing values is possible use list comprehension or apply:
#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)
In [263]: %%timeit
...: a = df.lookup(df.index, df['pair'].str[:2])
...: b = df.lookup(df.index, df['pair'].str[2:])
...:
...: df['new'] = a + b
...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %%timeit
...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
...:
...: df['new'] = a + b
...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %%timeit
...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
...:
...: df['new'] = a + b
...:
...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to use apply function on multiple columns at once

Is it possible to call the apply function on multiple columns in pandas and if so how does one do this.. for example,
df['Duration'] = df['Hours', 'Mins', 'Secs'].apply(lambda x,y,z: timedelta(hours=x, minutes=y, seconds=z))
This is what the expected output should look like once everything comes together
Thank you.
You should use:
df['Duration'] = pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
When you use apply on a DataFrame with axis=1, it's a row calculation, so typically this syntax makes sense:
df['Duration'] = df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins,
seconds=row.Secs), axis=1)
Some timings
import pandas as pd
import numpy as np
df = pd.DataFrame({'Hours': np.tile([1,2,3,4],50),
'Mins': np.tile([10,20,30,40],50),
'Secs': np.tile([11,21,31,41],50)})
%timeit pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
#432 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins, seconds=row.Secs), axis=1)
#12 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As always, apply should be a last resort.
Use apply on the dataframe with axis=1
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
triangles = [{ 'base': 20, 'height': 9 }, { 'base': 10, 'height': 7 }, { 'base': 40, 'height': 4 }]
triangles_df = pd.DataFrame(triangles)
def calculate_area(row):
return row['base'] * row['height'] * 0.5
triangles_df.apply(calculate_area, axis=1)
Good luck!
This might help.
import pandas as pd
import datetime as DT
df = pd.DataFrame({"Hours": [1], "Mins": [2], "Secs": [10]})
df = df.astype(int)
df['Duration'] = df[['Hours', 'Mins', 'Secs']].apply(lambda x: DT.timedelta(hours=x[0], minutes=x[1], seconds=x[2]), axis=1)
print(df)
print(df["Duration"])
Output:
Hours Mins Secs Duration
0 1 2 10 01:02:10
0 01:02:10
dtype: timedelta64[ns]

Categories

Resources