Split strings in DataFrame and keep only certain parts - python

I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!

Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool

You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

Related

Pandas groupby .all method

I am trying to understand how to use .all, for example:
import pandas as pd
df = pd.DataFrame({
"user_id": [1,1,1,1,1,2,2,2,3,3,3,3],
"score": [1,2,3,4,5,3,4,5,5,6,7,8]
})
When I try:
df.groupby("user_id").all(lambda x: x["score"] > 2)
I get:
score
user_id
1 True
2 True
3 True
But I expect:
score
user_id
1 False # Since for first group of users the score is not greater than 2 for all
2 True
3 True
In fact it doesn't even matter what value I pass instead of 2, the result DataFrame always has True for the score column.
Why do I get the result that I get? How can I get my expected result?
I looked at the documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.all.html, but it is very brief and did not help me.
the line
df.groupby("user_id").all(lambda x: x["score"] > 2)
is not asking "are all datapoints larger than 2?", in reality is asking "are there datapoints?"
to ask what you really want you need to do the following:
df['score'].gt(2).groupby(df['user_id']).all()
Out
user_id
1 False
2 True
3 True
groupby.all does not take any function as parameter. The only parameter (skipna) accepts a boolean and is used to change how NaN values are interpreted.
You probably want:
df['score'].gt(2).groupby(df['user_id']).all()
Which can also be written as:
df.assign(flag=df['score'].gt(2)).groupby('user_id')['flag'].all()

check if syntax in pandas column meets certain criteria

I have the dataframe underneath:
df = pd.DataFrame(np.array(['YM.296','MM.305','VO.081.019','VO.081.016','AM.081.002.001','AM081','SR.082','VO.081.012.001','VO.081.012.003']))
I want to know in what row the syntax is similar to 'XX.222.333' (example). So two letters followed by a stop ('.') followed by three numbers followed by a stop ('.') followed by three numbers again.
The desired outcome looks as follows:
tf = pd.DataFrame(np.array([False,False,True,True,False,False,False,False, False]))
Is there a fast and pythonic way to do this?
You can do that using str.contains and regex.
As follows:
df[0].str.contains(r'^[A-Z]{2}\.\d{3}\.\d{3}$', regex=True)
Outputs:
0 False
1 False
2 True
3 True
4 False
5 False
6 False
Name: 0, dtype: bool
Here is a visualization of the regex used:

See if a list of strings in df1[Keyword'] is found anywhere in a df2['Description'] and drop rows if df2 if no matches

I have two dataframes, df1 and df2.
df1 contains a list of strings in a column called 'Keyword'. I'm trying to see if these strings are found within a second dataframe, df2['Description'] and drop the rows in df2 if they do not contain at least one of the strings from df1['Keyword']
df1 df2
Keyword Description
Car I like driving my **car** <- Keep
Dog No keywords here! <- Drop Row
Elephant Bart gets an **Elephant** <- Keep
Bat No keywords in this sentence <- Drop Row
What I've tried:
df2['Check'] = df1["Keyword"].isin(df2[Description"])
Everything elevates to FALSE, even when there is a match. The idea is to drop all rows which contain FALSE once the code was working.
You can use the lookup() function which returns label based indexing functions for DataFrame
def lookup(x, values):
for value in values:
if value.lower() in x.lower():
return value
And then you can apply the function to df2
df2['df2'] = df2['B'].apply(lambda x: lookup(x, df1['df2']))
Solution 1:
You can create a long string of unique values joined by |and use str.contains to search for those values and pass case=False:
df1['Check'] = df2['Description'].str.contains('|'.join(df1['Keyword'].unique()), case=False)
Out[1]:
Keyword Check
0 Car True
1 Dog False
2 Elephant True
3 Bat False
Solution 2: You can use list comprehension to check if any values are in the column. Make sure to use lower to normalize the case, so that you get a match as python is case-sensitive::
df1['Check'] = (df1["Keyword"].apply(
lambda x: any([True
for y in df2['Description'].str.casefold()
if x.lower() in y])))
#str.casefold() is potentially more robust than str.lower() if text is not in English
df1
Out[2]:
Keyword Check
0 Car True
1 Dog False
2 Elephant True
3 Bat False
From there, just do df1[df1['Check']]:
Keyword Check
0 Car True
2 Elephant True

Python PANDAS: Apply Multi-Line Boolean Criteria Within Group?

I have a dataset with the following general format:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
2,thing_3,place_3
2,thing_7,place_2
2,thing_9,place_2
2,thing_4,place_5
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
What I am trying to accomplish is to apply two boolean criteria within a group that MAY have the criteria values spread across multiple records/lines within a group. And if these criteria exist, do not filter any records from the group. If not, filter out all records for a group.
This is a simplified example. The criteria sets are huge lists which is why I concatenate them with pipes and use str.contains() with regex=True instead of something simpler.
This is what I have come up with so far but I do not think I am even on the right track for handling the possibility of multi-line criteria within groups or returning all when found.
thing_criteria = (x.df['thing_criteria_field'].str.contains('thing_1|thing2|thing3', regex=True))
place_criteria = (x.df['place_criteria_field'].str.contains('place_1', regex=True))
df_result = df.groupby('id').filter(lambda x: (thing_criteria & place_criteria).all())
This is the result set I am trying to create from the sample dataset:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
Any advice would be most appreciated!
Try this:
# Build a dataframe indicating whether each row meets
# each of the individual criterion
all_criteria = [thing_criteria, place_criteria]
cond = pd.DataFrame(all_criteria).T \
.assign(id=df['id'])
# Now group them by id and reduce the truth values
# .any(): test if any row in the group matches a single criterion
# .all(): test if all criteria are met in the group
match = cond.groupby('id').apply(lambda x: x.iloc[:, :-1].any().all())
ids = match[match].index
# Finally, get the ids that matches all criteria
df[df['id'].isin(ids)]
How any().all() works: let's say you have the following groups:
thing_criteria_field place_criteria_field id
0 True False 1
1 False False 1
2 False False 1
3 False True 1
-------------------------------------------------
any: True True ==> all: True
thing_criteria_field place_criteria_field id
4 False False 2
5 False False 2
6 False False 2
7 False False 2
-------------------------------------------------
any: False False ==> all: False

Assigning a boolean value to a blank series in a pandas data frame based on a count of values in a string in another series/

I want to assign a boolean value to a currently column of "True" if the first column contains only one period and "False" if it contains more than one period.
This is what I've gotten to at this point and I am completely stuck:
for index, row in qbstats.iterrows():
if qbstats['qb'].count(".") > 1
...... so if it's greater than one I want to assign the column labeled "num_periods_in_name" as False else wise it sets as True.
I would appreciate any help, thanks.
You can use np.where():
df['New Col'] = np.where(df['qb'].str.count('\.')>1, False, True)
Note, you will need to escape the . with a \ as well.
Below is an example:
qb
0 Hello.
1 helloo...
2 hello...ooo
3 Hell.o
And applying the code above gives:
qb New Col
0 Hello. True
1 helloo... False
2 hello...ooo False
3 Hell.o True

Categories

Resources