Finding and deleting sub-strings in dataframe column Python - python

I would like to find all the rows in a column that contains a unique ID as a string which starts with digits and symbols. After they have been identified, I would like to delete the first 9 characters for those unique rows, only. So far I have:
if '.20_P' in df['ID']:
df['ID']= df['ID']str.slice[: 9]
where I would like it to take this:
df['ID'] =
2.2.2020_P18dhwys
2.1.2020_P18dh234
2.4.2020_P18dh229
P18dh209
P18dh219
2.5.2020_P18dh289
and trun it into this:
df['ID'] =
P18dhwys
P18dh234
P18dh229
P18dh209
P18dh219
P18dh289

Do a conditional row-wise apply to the same column:
df['ID'] = df.apply(lambda row: row['ID'][:9] if '.20_P' in row['ID'] else row['ID'], axis=1)

You could also use a regular expression to find your substring.
The regular expression here works as follows: Find a substring () consisting of multiple occurrences (+) of digits (\d) or ([]) non whitespace characters (\w). This might (*, ?) be preceded by a combination of digits and dots [\d+\.] with a trailing underscore _. Note that this is also quite fast as it is highly optimized (compared to .apply()). So if you have a lot of data, or do this often this is something you might want to consider.
import pandas as pd
df = pd.DataFrame({'A': [
'2.2.2020_P18dhwys',
'2.1.2020_P18dh234',
'2.4.2020_P18dh229',
'P18dh209',
'P18dh219',
'2.5.2020_P18dh289',
]})
print(df['A'].str.extract(r'[\d+\.]*_?([\d\w]+)'))
Output:
0
0 P18dhwys
1 P18dh234
2 P18dh229
3 P18dh209
4 P18dh219
5 P18dh289

If you know that the string to remove is a prefix added with underscore, you could do
df['ID']= df['ID'].apply(lambda x: x.split('_')[-1])

Related

seach for substring with minimum characters match pandas

I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object

How to remove unwanted dots from strings in pandas column?

I have a data that contains the column as following:
mouse.pad.v.1.2
key.board.1.0.c30
pen.color.4.32.r
I am removing digits by
df["parts"]= df["parts"].str.replace('\d+', '')
Once the digits are removed the data looks like the following:
mouse.pad.v..
key.board...c
pen.color...r
what I want to do is to replace more than one dot from the column with just one dot. Ideal output should be
mouse.pad.v
key.board.c
pen.color.r
I tried using
df["parts"]= df["parts"].str.replace('..', '.')
But I am not sure how many dots will be combined together. Is there a way to automate it?
Try:
df["parts"] = df["parts"].str.replace(r"\.*\d+", "", regex=True)
print(df)
Prints:
parts
0 mouse.pad.v
1 key.board.c
2 pen.color.r
Input dataframe:
parts
0 mouse.pad.v.1.2
1 key.board.1.0.c30
2 pen.color.4.32.r

If statement: String starts with exactly 4 digitis in Python/pandas

I have a column of a dataframe consisting of strings, which are either a date (e.g. "12-10-2020") or a string starting with 4 digits (e.g. "4030 - random name"). I would like to write an if statement to capture the strings which are starting with 4 digits, which is similar to this code:
string[0].isdigit()
but instead of isdigit, it should be something like:
is string which starts with 4 digits
I hope I clarified my question and let me know if it is not clear. I am btw working in pandas.
Use str.contains:
col"
df[df["col"].str.contains(r'^[0-9]{4}')]
You can use str.match that is anchored by default to the start of the string:
Example:
df = pd.DataFrame({'col': ['4030 - random name', 'other', '07-02-2022']})
df[df['col'].str.match('\d{4}')]
output:
col
0 4030 - random name

How to deal with Pandas dataframe column with list containing string values, get unique words

I am trying to do some basic operations on a dataframe column (called dimensions) that contains a list. Do basic operations like df['dimensions'].str.replace() work when the dataframe column contains a list? It did not work for me. I also tried to replace the text in the column using re.sub() method and it did not work either.
This is the last column in my dataframe:
**dimensions**
[50' long]
None
[70ft long, 19ft wide, 8ft thick]
[5' high, 30' long, 18' wide]
This is what I have tried, but it did not work:
def dimension_unique_words(dimensions):
if dimensions != 'None':
for value in dimensions:
new_value = re.sub(r'[^\w\s]|ft|feet', ' ', value)
new_value = ''.join([i for i in new_value if not i.isdigit()])
return new_value
df['new_col'] = df['dimensions'].apply(dimension_unique_words)
this is the output I got from my code:
**new_col**
NaN
None
NaN
None
NaN
None
What I want to do is to replace the numbers and the units [ft, feet, '] in the column called dimensions with a space and then apply the df.unique() on that column to get the unique values which are [long, wide, thick, high].
The expected output would be:
**new_col**
[long]
None
[long, wide, thick]
[high, long, wide]
...then I want to apply the df.unique() on the new_col to get [long, wide, thick, high]
How to do that?
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-zA-Z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
This gives you a list. You wanted a set, to prevent duplicates occurring, so:
df['dimensions2'].str.findall(r'\b([a-zA-Z]+)').apply(lambda l: set(l) if l else None)
0 {long}
1 None
2 {thick, long, wide}
3 {high, long, wide}
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
use str.findall to find all dimensions values to a list.
use explode to explode the list to elements with the same index.
then use groupby(level=0).unique() to drop duplicates by index to a list.
df['new_col'] = (
df['dimensions'].fillna('').astype(str)
.str.findall(r'\b[a-zA-Z]+\b')
.explode().dropna()
.groupby(level=0).unique()
)
use df['new_col'].explode().dropna().unique() to get the unique dimensions values.
array(['long', 'wide', 'thick', 'high'], dtype=object)

Filter Pandas dataframe row and replace value in column

I got a big list of phone numbers in all sorts of formats:
df = pd.DataFrame(
{'phone': ['0123/12345', '0123-23456', '0123/4455-10', '0123-4455-22'],
'name': ['A-1', 'B-1', 'C/3', 'D/7']})
name phone
0 A-1 0123/12345
1 B-1 0123-23456
2 C/3 0123/4455-10
3 D/7 0123-4455-22
The formats I want are in rows #0 and #2.
When I concentrate on #1, I tried the following:
df.loc[(df.phone.str.count('-')==1) &
(df.phone.str.count('/')==0)].apply(lambda x: x.str.replace('-', '/'))
And this does the trick on the number, but unfortunately also on the name column:
name phone
1 B/1 0123/23456
But the name column must not be changed.
So I have two questions:
How can I filter the row and only change the phone column?
How can I go with #3, where I would want to replace the first occurrence of '-' to '/'?
You can use regex replace (str.replace method) on column phone only:
df['phone'] = df.phone.str.replace("^(\d+)-(.*)$", r"\1/\2")
df
# name phone
#0 A-1 0123/12345
#1 B-1 0123/23456
#2 C/3 0123/4455-10
#3 D/7 0123/4455-22
Explanation on the regex:
^(\d+)-(.*)$ matches a string that starts with digits and immediately followed by dash which is the case for row #0 and row #2; With back reference, it replaces the first dash with / and for row #1 and row #3, since they don't match the regex, no modification is applied.
Or if you're no fan of regex (like me), you can simply do this:
df['phone'] = df.phone.apply(lambda x: x.replace('-','/',1) if '/' not in x else x)
print(df)
name phone
0 A-1 0123/12345
1 B-1 0123/23456
2 C/3 0123/4455-10
3 D/7 0123/4455-22
Probably not the best or fastest way, still I feel more comfortable with it since I don't know regex yet.
Hope that was helpful.

Categories

Resources