Split DataFrame string column into N splits - python

i have a df
a name
1 a/b/c
2 w/x/y/z
3 q/w/e/r/t
i want to split the name column on '/' to get this output
id name main sub leaf
1 a/b/c a b c
2 w/x/y/z w x z
3 q/w/e/r/t q w t
i.e first two slashes add as main and sub respectively,
and leaf should be filled with word after last slash
i tried using this, but result was incorrect
df['name'].str.split('/', expand=True).rename(columns={0:'main',1:'sub',2:'leaf'})
is there any way to assign columns

Use split with assign:
s = df['name'].str.split('/')
df = df.assign(main=s.str[0], sub=s.str[1], leaf=s.str[-1])
print (df)
a name leaf main sub
0 1 a/b/c c a b
1 2 w/x/y/z z w x
2 3 q/w/e/r/t t q w
For change order of columns:
s = df['name'].str.split('/')
df = df.assign(main=s.str[0], sub=s.str[1], leaf=s.str[-1])
df = df[df.columns[:-3].tolist() + ['main','sub','leaf']]
print (df)
a name main sub leaf
0 1 a/b/c a b c
1 2 w/x/y/z w x z
2 3 q/w/e/r/t q w t
Or:
s = df['name'].str.split('/')
df = (df.join(pd.DataFrame({'main':s.str[0], 'sub':s.str[1], 'leaf':s.str[-1]},
columns=['main','sub','leaf'])))
print (df)
a name main sub leaf
0 1 a/b/c a b c
1 2 w/x/y/z w x z
2 3 q/w/e/r/t q w t

Option 1
Using str.split, but don't expand the result. You should end up with a column of lists. Next, use df.assign, assign columns to return a new DataFrame object.
v = df['name'].str.split('/')
df.assign(
main=v.str[ 0],
sub=v.str[ 1],
leaf=v.str[-1]
)
name leaf main sub
a
1 a/b/c c a b
2 w/x/y/z z w x
3 q/w/e/r/t t q w
Details
This is what v looks like:
a
1 [a, b, c]
2 [w, x, y, z]
3 [q, w, e, r, t]
Name: name, dtype: object
This is actually a lot easier to handle, because you have greater control over elements with the .str accessor. If you expand the result, you have to snap your ragged data to a tabular format to fit into a new DataFrame object, thereby introducing Nones. At that point, indexing (finding the ith or ith-last element) becomes a chore.
Option 2
Using direct assignment (to maintain order) -
df['main'] = v.str[ 0]
df['sub' ] = v.str[ 1]
df['leaf'] = v.str[-1]
df
name main sub leaf
a
1 a/b/c a b c
2 w/x/y/z w x z
3 q/w/e/r/t q w t
Note that this modifies the original dataframe, instead of returning a new one, so it is cheaper. However, it is more intractable if you have a large number of columns.
You might instead consider this alternative which should generalise to many more columns:
for c, i in [('main', 0), ('sub', 1), ('leaf', -1)]:
df[c] = v[i]
df
name main sub leaf
a
1 a/b/c a b c
2 w/x/y/z w x z
3 q/w/e/r/t q w t
Iterate over a list of tuples. The first element in a tuple is the column name, and the second is the corresponding index to pick the result from v. You still have to assign each one separately, whether you like it or not. Using a loop would probably be a clean way of doing it.

Related

Is it possible to split a column value and add a new column at the same time for dataframe?

I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:
name type
a l
b m
c|d|e n
For this df, I want to flatten it to:
name type
a l
b m
c n
d n
e n
To do this, I used this command:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
Now, I want do one more thing besides above flatten operation:
name type co_occur
a l
b m
c n d
c n e
d n e
That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.
I don't see an easy way to do this by modifying:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
I think this is what you want. Use combinations and piece everything together
from itertools import combinations
import io
data = '''name type
a l
b m
c|d|e n
j|k o
f|g|h|i p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
combos = list(combinations(x['name'].split('|'),2))
# print(combos)
# print(x['type'])
df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
return
# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')
name type co_cur
0 a l
1 b m
0 c n d
1 c n e
2 d n e
0 j o k
0 f p g
1 f p h
2 f p i
3 g p h
4 g p i
5 h p i

Pandas DataFrame conditional selection with list comprehension

I have a dataframe with 15 columns named 0,1,2,...,14. I would like to write a method that would take in this data, and a vector of length 15. I would like it to return dataframe conditionally selected based on this vector that I have passed. E.g. the data passed is data_ and the vector passed is v_
I would like to produce that:
data[(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])]
However I would like the method to be flexible, e.g. I could pass in dataframe of 100 columns named 0, ..., 99 and a vector of length 99. My problem is that I do not know how to cleverly create [(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])] programatically to account for "&" sign. Equally well I would be satisfied if someone gave me a method that could merge multiple NxM matrices filled with True and False on "and" or "or" into single MxN matrix.
Thank You very much!
You can try this:
def custom_filter(data, v):
if len(data.columns) == len(v):
# If data has the same number of columns
# as v has elements
mask = (data == v).all(axis=1)
else:
# If they have a different length, we'll need to subset
# the data first, then create our mask
# This attempts to susbet the dataframe by assuming columns
# 0 .. len(v) - 1 exist as columns, and will throw an error
# otherwise
colnames = list(range(len(v)))
mask = (data[colnames] == v).all(axis=1)
return data.loc[mask, :]
df = pd.DataFrame({
"F": list("hiadsfin"),
0: list("aaaabbbb"),
1: list("cccdddee"),
2: list("ffgghhij")
})
v = ["a", "c", "f"]
df
F 0 1 2 H
0 h a c f 1
1 i a c f 2
2 a a c g 3
3 d a d g 4
4 s b d h 5
5 f b d h 6
6 i b e i 7
7 n b e j 8
custom_filter(df, v)
F 0 1 2 H
0 h a c f 1
1 i a c f 2
Note that with this function, if the number of columns exactly matches the length of your vector v, then you do not need to ensure the columns are labelled as 0, 1, 2, ..., len(v)-1. However if you have more columns than elements of v, you need to ensure that a subset of those columns are labelled as 0, 1, 2, ..., len(v)-1. If v` is longer than there are columns in your dataframe, this will throw an error.
This might work:
data[(data==v_.transpose())].dropna(axis=1)

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
prints:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner
Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
output:
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Output:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.
My method
s=df.groupby(df.id.shift().ne(df.id).cumsum()).agg({'id':'first','value':['min','max']})
s.columns=s.columns.droplevel(0)
t=s['min'].values[:,None]-s['max'].values
t=t.astype(float)
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
newdf=pd.DataFrame(t,index=s['first'],columns=s['first'])
newdf.values[newdf.index.values[:,None]==newdf.index.values]=np.nan
newdf=newdf.T.stack()
newdf
Out[933]:
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Python pandas apply too slow Fuzzy Match

def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)
I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')

Splitting and copying a row in pandas

I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!
I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024

Categories

Resources