Strip all characters from column header before a :

Strip all characters from column header before a : - python

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown

Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7

import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).

You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']

df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']

You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

Related

python remove duplicate substring parsed by comma

I have an input Pandas Series like this:
I would like to remove duplicates in each row. For example, change M,S,S to M,S.
I tried
fifa22['player_positions'] = fifa22['player_positions'].str.split(',').apply(pd.unique)
But the results are a Series of ndarray
I would like to convert the results to simple string, without the square bracket. Wondering what to do, thanks!

If it only on this one column, you should use map.
import pandas as pd
df = pd.DataFrame({
'player_positions' : "M,S,S S S,M M,M M,M M M,S S,M,M,S".split(' ')
})
print(df)
player_positions
0 M,S,S
1 S
2 S,M
3 M,M
4 M,M
5 M
6 M,S
7 S,M,M,S
out = df['player_positions'].map(lambda x: ','.join(set(x.split(','))))
print(out)
0 M,S
1 S
2 M,S
3 M
4 M
5 M
6 M,S
7 M,S
If you want to concatenate in any other way just change the , in ','.join(...) to anything else.

Stripping string values at different positions

Suppose I have the following dataframe:
df = pd.DataFrame({'X':['AB_123_CD','EF_123CD','XY_Z'],'Y':[1,2,3]})
X Y
0 AB_123_CD 1
1 EF_123CD 2
2 XY_Z 3
I want to use strip method to get rid of the first prefix such that I get
X Y
0 123_CD 1
1 123CD 2
2 Z 3
I tried doing: df.X.str.split('_').str[-1].str.strip() but since the positions of _'s are different it returns different result to the one desired above. I wonder how can I address this issue?

You're close, you can split once (n=1) from the left and keep the second one (str[1]):
df.X = df.X.str.split("_", n=1).str[1]
to get
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3

Try this instead:
df["X"] = df["X"].apply(lambda x: x[x.find("_")+1:])
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
This keeps the entire string after the first occurence of _

The following code could do the job:
df['X'] = df.X.apply(lambda x: '_'.join(x.split('_')[1:]))

Your solution is very close. With some minor changes, it should work:
df.X.str.split('_').str[1:].str.join('_')
0 123_CD
1 123CD
2 Z
Name: X, dtype: object

You can define maxsplit in the str.split() function. It sounds like you just want to split with maxsplit 1 and take the last element:
df['X'] = df['X'].apply(lambda x: x.split('_',1)[-1])

How to remove strings between between parentheses (or any char) in DataFrame?

I have a string of number chars that I want to change to type int, but I need to remove the parentheses and the numbers in it (it's just a multiplier for my application, this is how I get the data).
Here is the sample code.
import pandas as pd
voltages = ['0', '0', '0', '0', '0', '310.000 (31)', '300.000 (30)', '190.000 (19)', '0', '20.000 (2)']
df = pd.DataFrame(voltages, columns=['Voltage'])
df
Out [1]:
Voltage
0 0
1 0
2 0
3 0
4 0
5 310.000 (31)
6 300.000 (30)
7 190.000 (19)
8 0
9 20.000 (2)
How can I remove the substrings within the parenthesis? Is there a Pandas.series.str way to do it?

Use str.replace with regex:
df.Voltage.str.replace(r"\s\(.*","")
Out:
0 0
1 0
2 0
3 0
4 0
5 310.000
6 300.000
7 190.000
8 0
9 20.000
Name: Voltage, dtype: object

You can also use str.split()
df_2 = df['Voltage'].str.split(' ', 0, expand = True).rename(columns = {0:'Voltage'})
df_2['Voltage'] = df_2['Voltage'].astype('float')

If you know the separating character will always be a space then the following is quite a neat way of doing it:
voltages = [i.rsplit(' ')[0] for i in voltages]

I think you could try this:
new_series = df['Voltage'].apply(lambda x:int(x.split('.')[0]))
df['Voltage'] = new_series
I hope it helps.

Hopefully, this will work for you:
result = source_value[:source_value.find(" (")]
NOTE: the find function requires a string as source_value. But if you have parens in your value, I assume it is a string.

how to split string data using criterion in python?

i wondering about something expression of string list in dataframe.
how to split string value using python?
I'm using replace method.
But, i can't find a way to delete only the node number.
dataframe
index article_id
0 ['#abc_172', '#abc_249', '#abc-32', '#def-1']
1 ['#az3_2', '#bwc_4', '#xc-34', '#xc-1']
2 ['#ac_12']
3 ['#ea457870a2d32453609f52e50f84abdc_15', '#bb_3']
4 ...
... ...
I want to get like this
index article_id article_id_unique_count
0 ['abc', 'abc', 'abc', 'def'] 2
1 ['az3', 'bwc', 'xc', 'xc'] 3
2 ['ac'] 1
3 ['#ea457870a2d32453609f52e50f84abdc', 'bb'] 2
...

use re.findall
df['article_id'] = df.article_id.apply(lambda x: re.findall('([#a-z0-9]+)',x)).apply(lambda x: [i for i in x if i.isdigit() == False])
df['article_id_unique_count'] = df['article_id'].apply(lambda x: len(set(x)))
Output
article_id article_id_unique_count
0 [abc, abc, abc, def] 2
1 [az3, bwc, xc, xc] 3
2 [ac] 1
3 [#ea457870a2d32453609f52e50f84abdc, bb] 2

Assuming the delimiters are either - or _:
df['article_id'].map(lambda x:[re.findall('#*(.+?)[-_]', s)[0] for s in x], 1)
Output:
0 [abc, abc, abc, def]
1 [az3, bwc, xc, xc]
2 [ac]
3 [#ea457870a2d32453609f52e50f84abdc, bb]
You can then use apply(lambda x:len(set(x))).
Notice that the first element of row 1, az3 is also correctly extracted.

apply regex within apply and set to count unique elements in the list
import re
df = pd.DataFrame(data={"id":[0,1,2],
"article_id":[["abc_172", "#abc_249", "#abc-32", "#def-1"],
["#az3_2", "#bwc_4", "#xc-34", "#xc-1"],
["##ea457870a2d32453609f52e50f84abdc_15"]]})
df['article_id'] = df['article_id'].apply(lambda x : re.sub('[!#$]','', i).split("-")[0].split("_")[0] for i in x])
df['article_id_unique_count'] = df['article_id'].apply(lambda x : len(set(x)))
id article_id article_id_unique_count
0 0 [abc, abc, abc, def] 2
1 1 [az3, bwc, xc, xc] 3
2 2 [#ea457870a2d32453609f52e50f84abdc] 1

Other solutions using apply. I always try to find a solution without using apply. I come up with this one. Simple construct dataframe from the list, stack to series and work with str.extract and agg
(pd.DataFrame(df.article_id.tolist(), index=df.index).stack().str.extract(r'#?(.*)[_-]')
.groupby(level=0)[0].agg([list, 'nunique'])
.rename(columns={'list': 'article_id', 'nunique': 'article_id_unique_count'}))
Out[15]:
article_id article_id_unique_count
0 [abc, abc, abc, def] 2
1 [az3, bwc, xc, xc] 3
2 [ac] 1
3 [#ea457870a2d32453609f52e50f84abdc, bb] 2

Take multiple lists into dataframe

How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.
Attempt 1:
Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column
Attempt 2:
percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
'lst2Tite' : [lst2],
'lst3Tite' : [lst3] },
columns=['lst1Tite','lst1Tite', 'lst1Tite'])
yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column
How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?

I think you're almost there, try removing the extra square brackets around the lst's (Also you don't need to specify the column names when you're creating a dataframe from a dict like this):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{'lst1Title': lst1,
'lst2Title': lst2,
'lst3Title': lst3
})
percentile_list
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
...
If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:
import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=['lst1Title', 'lst2Title', 'lst3Title'])

Adding to Aditya Guru's answer here. There is no need of using map. You can do it simply by:
pd.DataFrame(list(zip(lst1, lst2, lst3)))
This will set the column's names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.
pd.DataFrame(list(zip(lst1, lst2, lst3)),
columns=['lst1_title','lst2_title', 'lst3_title'])

Adding one more scalable solution.
lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)

There are several ways to create a dataframe from multiple lists.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])

Just adding that using the first approach it can be done as -
pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))

Adding to above answers, we can create on the fly
df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)
hope it helps !

#oopsi used pd.concat() but didn't include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)
percentile_list
Out[2]:
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
...

you can simply use this following code
train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])

I just did it like this (python 3.9):
import pandas as pd
my_dict=dict(x=x, y=y, z=z) # Set column ordering here
my_df=pd.DataFrame.from_dict(my_dict)
This seems to be reasonably straightforward (albeit in 2022) unless I am missing something obvious...
In python 2 one could've used a collections.OrderedDict().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip all characters from column header before a : - python

I have column's named like this: 1:Arnston 2:Berg 3:Carlson 53:Brown and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100. My desired out put is: Arnston Berg Carlson Brown

import re s = '1:Arnston 2:Berg 3:Carlson 53:Brown' s_minus_numbers = re.sub(r'\d+:', '', s) Gets you 'Arnston Berg Carlson Brown'

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re: df.columns.str.extract(r'\d+:(.)') Where the regex means: select everything ((.)) after one or more digits (\d+) and a colon (:).

df = pd.DataFrame({'1:Arnston':[5,9,9], '2:Berg':[0,3,2], '3:Carlson':[2,2,9] , '53:Brown':[1,9,7]}) [x.split(':')[1] for x in df.columns.factorize()[1]] output: ['Arnston', 'Berg', 'Carlson', 'Brown']

Related

python remove duplicate substring parsed by comma

Stripping string values at different positions

How to remove strings between between parentheses (or any char) in DataFrame?

how to split string data using criterion in python?

Take multiple lists into dataframe

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip all characters from column header before a : - python

I have column's named like this: 1:Arnston 2:Berg 3:Carlson 53:Brown and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100. My desired out put is: Arnston Berg Carlson Brown

import re s = '1:Arnston 2:Berg 3:Carlson 53:Brown' s_minus_numbers = re.sub(r'\d+:', '', s) Gets you 'Arnston Berg Carlson Brown'

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re: df.columns.str.extract(r'\d+:(.*)') Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).

df = pd.DataFrame({'1:Arnston':[5,9,9], '2:Berg':[0,3,2], '3:Carlson':[2,2,9] , '53:Brown':[1,9,7]}) [x.split(':')[1] for x in df.columns.factorize()[1]] output: ['Arnston', 'Berg', 'Carlson', 'Brown']

Related

python remove duplicate substring parsed by comma

Stripping string values at different positions

How to remove strings between between parentheses (or any char) in DataFrame?

how to split string data using criterion in python?

Take multiple lists into dataframe

Categories

Resources

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re: df.columns.str.extract(r'\d+:(.)') Where the regex means: select everything ((.)) after one or more digits (\d+) and a colon (:).