If statement: String starts with exactly 4 digitis in Python/pandas - python

I have a column of a dataframe consisting of strings, which are either a date (e.g. "12-10-2020") or a string starting with 4 digits (e.g. "4030 - random name"). I would like to write an if statement to capture the strings which are starting with 4 digits, which is similar to this code:
string[0].isdigit()
but instead of isdigit, it should be something like:
is string which starts with 4 digits
I hope I clarified my question and let me know if it is not clear. I am btw working in pandas.

Use str.contains:
col"
df[df["col"].str.contains(r'^[0-9]{4}')]

You can use str.match that is anchored by default to the start of the string:
Example:
df = pd.DataFrame({'col': ['4030 - random name', 'other', '07-02-2022']})
df[df['col'].str.match('\d{4}')]
output:
col
0 4030 - random name

Related

Filter on a pandas string column as numeric without creating a new column

This is a quite easy task, however, I am stuck here. I have a dataframe and there is a column with type string, so characters in it:
Category
AB00
CD01
EF02
GH03
RF04
Now I want to treat these values as numeric and filter on and create a subset dataframe. However, I do not want to change the dataframe in any way. I tried:
df_subset=df[df['Category'].str[2:4]<=3]
of course this does not work, as the first part is a string and cannot be evaluated as numeric and compared to 69.
I tried
df_subset=df[int(df['Category'].str[2:4])<=3]
but I am not sure about this, I think it is wrong or not the way it should be done.
Add type conversion to your expression:
df[df['Category'].str[2:].astype(int) <= 3]
Category
0 AB00
1 CD01
2 EF02
3 GH03
As you have leading zeros, you can directly use string comparison:
df_subset = df.loc[df['Category'].str[2:4] <= '03']
Output:
Category
0 AB00
1 CD01
2 EF02
3 GH03

Python - How to split a Pandas value and only get the value between the slashs

Example:
I the df['column'] has a bunch of values similar to: F/4500/O or G/2/P
The length of the digits range from 1 to 4 similar to the examples given above.
How can I transform that column to only keep 1449 as an integer?
I tried the split method but I can't get it right.
Thank you!
You could extract the value and convert to_numeric:
df['number'] = pd.to_numeric(df['column'].str.extract('/(\d+)/', expand=False))
Example:
column number
0 F/4500/O 4500
1 G/2/P 2
How's about:
df['column'].map(lambda x: int(x.split('/')[1]))

How to remove unwanted dots from strings in pandas column?

I have a data that contains the column as following:
mouse.pad.v.1.2
key.board.1.0.c30
pen.color.4.32.r
I am removing digits by
df["parts"]= df["parts"].str.replace('\d+', '')
Once the digits are removed the data looks like the following:
mouse.pad.v..
key.board...c
pen.color...r
what I want to do is to replace more than one dot from the column with just one dot. Ideal output should be
mouse.pad.v
key.board.c
pen.color.r
I tried using
df["parts"]= df["parts"].str.replace('..', '.')
But I am not sure how many dots will be combined together. Is there a way to automate it?
Try:
df["parts"] = df["parts"].str.replace(r"\.*\d+", "", regex=True)
print(df)
Prints:
parts
0 mouse.pad.v
1 key.board.c
2 pen.color.r
Input dataframe:
parts
0 mouse.pad.v.1.2
1 key.board.1.0.c30
2 pen.color.4.32.r

Finding and deleting sub-strings in dataframe column Python

I would like to find all the rows in a column that contains a unique ID as a string which starts with digits and symbols. After they have been identified, I would like to delete the first 9 characters for those unique rows, only. So far I have:
if '.20_P' in df['ID']:
df['ID']= df['ID']str.slice[: 9]
where I would like it to take this:
df['ID'] =
2.2.2020_P18dhwys
2.1.2020_P18dh234
2.4.2020_P18dh229
P18dh209
P18dh219
2.5.2020_P18dh289
and trun it into this:
df['ID'] =
P18dhwys
P18dh234
P18dh229
P18dh209
P18dh219
P18dh289
Do a conditional row-wise apply to the same column:
df['ID'] = df.apply(lambda row: row['ID'][:9] if '.20_P' in row['ID'] else row['ID'], axis=1)
You could also use a regular expression to find your substring.
The regular expression here works as follows: Find a substring () consisting of multiple occurrences (+) of digits (\d) or ([]) non whitespace characters (\w). This might (*, ?) be preceded by a combination of digits and dots [\d+\.] with a trailing underscore _. Note that this is also quite fast as it is highly optimized (compared to .apply()). So if you have a lot of data, or do this often this is something you might want to consider.
import pandas as pd
df = pd.DataFrame({'A': [
'2.2.2020_P18dhwys',
'2.1.2020_P18dh234',
'2.4.2020_P18dh229',
'P18dh209',
'P18dh219',
'2.5.2020_P18dh289',
]})
print(df['A'].str.extract(r'[\d+\.]*_?([\d\w]+)'))
Output:
0
0 P18dhwys
1 P18dh234
2 P18dh229
3 P18dh209
4 P18dh219
5 P18dh289
If you know that the string to remove is a prefix added with underscore, you could do
df['ID']= df['ID'].apply(lambda x: x.split('_')[-1])

replace values in pandas using regex that matches all values except the provided one

I want to use regex with pandas to replace values in a column to mark correct answer for the question.
Values in this column are '1943' - the correct one, and other years - incorrect.
The code I have now is:
incorrect_dict= {'Q1':{'^(?!1943$).*': 0}}
df = df.replace(incorrect_dict, regex=True)
and it doesn't replace values in pandas.
The regex itself seems ok, since it works when I use:
string ="1933"
regex = re.compile("^(?!1943$).*")
regex.findall(string)
i get:
[u'1933']
for string = '1943' i get 'No match was found:' so I assume the regex is ok. but when I use if with df.replace the values are not replaced.
thanks for any suggestions
I suspect the years were parsed as integers. See how it fails:
In [17]: df = DataFrame({'Q1': [1933, 1943]})
In [18]: df.replace(incorrect_dict, regex=True)
Out[18]:
Q1
0 1933
1 1943
But if I convert the years to strings, it works as you expect.
In [19]: df['Q1'] = df['Q1'].map(str)
In [20]: df.replace(incorrect_dict, regex=True)
Out[20]:
Q1
0 0
1 1943
Incidentally, I'm not convinced that treating the responses as strings and using regex is the way to go. Why not take the years as integer and evaluate df['Q1'] == 1943? The result will be True/False, meaning correct/incorrect. Seems more useful to me.

Categories

Resources