Converting list column to string in pandas - python

I have a df called df like so. The tag_position is either a string or list. but I want them to be all strings. How can i do this? I also want to remove the white space at the end.
input
id tag_positions
1 center
2 right
3 ['left']
4 ['center ']
5 [' left']
6 ['right']
7 left
expected output
id tag_positions
1 center
2 right
3 left
4 center
5 left
6 right
7 left

You can explode and then strip:
df.tag_positions = df.tag_positions.explode().str.strip()
to get
id tag_positions
0 1 center
1 2 right
2 3 left
3 4 center
4 5 left
5 6 right
6 7 left

You can join:
df['tag_positions'].map(''.join)
Or:
df['tag_positions'].str.join('')

Try with str chain with np.where
df['tag_positions'] = np.where(df['tag_positions'].map(lambda x : type(x).__name__)=='list',df['tag_positions'].str[0],df['tag_positions'])
Also my favorite explode
df = df.explode('tag_positions')

you can convert with apply method like this
df.tag_positions = df.tag_positions.apply(lambda x : ''.join(x) if type(x) == list else x)
if all the lists have a length of 1 you can do this also:
df.tag_positions = df.tag_positions.apply(lambda x : x[0] if type(x) == list else x)

You can use apply and check if an item is an instance of a list, if yes, take the first element. and then you can just use str.strip to strip off the unwanted spaces.
df['tag_positions'].apply(lambda x: x[0] if isinstance(x, list) else x).str.strip()
OUTPUT
Out[42]:
0 center
1 right
2 left
3 center
4 left
5 right
6 left
Name: 0, dtype: object

Related

Find where word is present in string with where statement [duplicate]

I having replace issue while I try to replace a string with value from another column.
I want to replace 'Length' with df['Length'].
df["Length"]= df["Length"].replace('Length', df['Length'], regex = True)
Below is my data
Input:
**Formula** **Length**
Length 5
Length+1.5 6
Length-2.5 5
Length 4
5 5
Expected Output:
**Formula** **Length**
5 5
6+1.5 6
5-2.5 5
4 4
5 5
However, with the code I used above, it will replace my entire cell instead of Length only.
I getting below output:
I found it was due to df['column'] is used, if I used any other string the behind offset (-1.5) will not get replaced.
**Formula** **Length**
5 5
6 6
5 5
4 4
5 5
May I know is there any replace method for values from other columns?
Thank you.
If want replace by another column is necessary use DataFrame.apply:
df["Formula"]= df.apply(lambda x: x['Formula'].replace('Length', str(x['Length'])), axis=1)
print (df)
Formula Length
0 5 5
1 6+1.5 6
2 5-2.5 5
3 4 4
4 5 5
Or list comprehension:
df["Formula"]= [x.replace('Length', str(y)) for x, y in df[['Formula','Length']].to_numpy()]
Just wanted to add, that list comprehension is much faster of course:
df = pd.DataFrame({'a': ['aba'] * 1000000, 'c': ['c'] * 1000000})
%timeit df.apply(lambda x: x['a'].replace('b', x['c']), axis=1)
# 1 loop, best of 5: 11.8 s per loop
%timeit [x.replace('b', str(y)) for x, y in df[['a', 'c']].to_numpy()]
# 1 loop, best of 5: 1.3 s per loop

python remove duplicate substring parsed by comma

I have an input Pandas Series like this:
I would like to remove duplicates in each row. For example, change M,S,S to M,S.
I tried
fifa22['player_positions'] = fifa22['player_positions'].str.split(',').apply(pd.unique)
But the results are a Series of ndarray
I would like to convert the results to simple string, without the square bracket. Wondering what to do, thanks!
If it only on this one column, you should use map.
import pandas as pd
df = pd.DataFrame({
'player_positions' : "M,S,S S S,M M,M M,M M M,S S,M,M,S".split(' ')
})
print(df)
player_positions
0 M,S,S
1 S
2 S,M
3 M,M
4 M,M
5 M
6 M,S
7 S,M,M,S
out = df['player_positions'].map(lambda x: ','.join(set(x.split(','))))
print(out)
0 M,S
1 S
2 M,S
3 M
4 M
5 M
6 M,S
7 M,S
If you want to concatenate in any other way just change the , in ','.join(...) to anything else.

Stripping string values at different positions

Suppose I have the following dataframe:
df = pd.DataFrame({'X':['AB_123_CD','EF_123CD','XY_Z'],'Y':[1,2,3]})
X Y
0 AB_123_CD 1
1 EF_123CD 2
2 XY_Z 3
I want to use strip method to get rid of the first prefix such that I get
X Y
0 123_CD 1
1 123CD 2
2 Z 3
I tried doing: df.X.str.split('_').str[-1].str.strip() but since the positions of _'s are different it returns different result to the one desired above. I wonder how can I address this issue?
You're close, you can split once (n=1) from the left and keep the second one (str[1]):
df.X = df.X.str.split("_", n=1).str[1]
to get
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
Try this instead:
df["X"] = df["X"].apply(lambda x: x[x.find("_")+1:])
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
This keeps the entire string after the first occurence of _
The following code could do the job:
df['X'] = df.X.apply(lambda x: '_'.join(x.split('_')[1:]))
Your solution is very close. With some minor changes, it should work:
df.X.str.split('_').str[1:].str.join('_')
0 123_CD
1 123CD
2 Z
Name: X, dtype: object
You can define maxsplit in the str.split() function. It sounds like you just want to split with maxsplit 1 and take the last element:
df['X'] = df['X'].apply(lambda x: x.split('_',1)[-1])

Removing empty words from column of tokenized sentences

I have a dataframe containing lists of words in each row in the same column. I'd like to remove what I guess are spaces. I managed to get rid of some by doing:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
But some of them still remain.
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
Thank you very much.
Split on comma remove empty values and then combine again with comma
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)
You can use split(expand=True) to do that. Note: You dont have to specifically give spilt(' ', expand=True). By default, it takes ' ' as the value. You can replace ' ' with anything. For ex: if your words separate with , or -, then you can use that separator to split the columns.
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
The output of this will be:
Original dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe split into columns
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
If you want to limit them to 3 columns only, then use n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
The output will be:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
If you want to rename the columns to be more specific, then you can add rename to the end like this.
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
In case you want to replace all the None with '' and also prefix the column names, you can do it as follws:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok

fill in entire dataframe cell by cell based on index AND column names?

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]

Categories

Resources