Filter Dataframe using regular expression - python

I have a Dataframe with a column with values separated by semicolons eg. Patient1_Control2; Patient1_Patient3; Control1_Control3. However I only want the rows with PatientX_ControlX or ControlX_PatientX. I don't want ControlX_ControlX or PatientX_PatientX. I thought of the method filter(regex = '...') , but this does not quite do the job. I want to filter the dataframe by a regular expression where I can use the regular expression PatientX_ControlX or ControlX_PatientX (x meaning random string). Is there any method for that? Thanks so much in advance. I'm still learning how to code so every tip would be great. If you have any sources where i can learn more about regular expression, that'd be amazing!

Filter the column data not to contain the relevant values -
df[~(df["data"].str.contains('Patient\d+_Control\d+|Control\d+_Patient\d+'))]
For the following the dataframe -
df = pd.DataFrame({"data":["Patient1_Control2", "Patient1_Patient3", "Control1_Patient3", "Control1_Control3"]})
df[~(df["data"].str.contains('Patient\d+_Control\d+|Control\d+_Patient\d+'))]
Output is -
data
1 Patient1_Patient3
3 Control1_Control3

Related

Get unique years from a date column in pandas DataFrame

I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:
To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.
If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991
You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

How to extract strings using Regex in pandas based on multiple if-conditions

dframe = pd.DataFrame({'Num':['38-30','38-30','21-51','24-11','34-20'],
'Des':['Generates Vacuum','Pressure low','Ground Problem',
'Leak from Controller','Lock Unserviceable']})
In the above dataframe I want to extract specific strings from Des column based on the Num column. For instance, if Num is equal to 38-30 then extract Vacuum from Des and if it doesn't have Vacuum, then extract Pressure. There are multiple strings that I have to extract for each Num.
I am trying to use re.extract('generator|Pressure',re.I) but I don't know how to include the if statements as I mentioned above.
My output should look like this:
Use np.where with findall
np.where(df.Num.eq('38-30'),df.Des.str.findall('generator|vacuum',flags=re.IGNORECASE),'')

Test Anova on multiple groups

I have the following dataframe:
I would like to use this code to compare the means between my entire dataframe:
F_statistic, pVal = stats.f_oneway(percentage_age_ss.iloc[:,0:1],
percentage_age_ss.iloc[:,1:2],
percentage_age_ss.iloc[:,2:3],
percentage_age_ss.iloc[:,3:4]) etc...
However, I don't want to use each time .iloc because it takes too much time. Do you I have another way to do it?
Thanks
get a list of columns using list comprehension, then use star syntax to expand it into the arglist:
stats.f_oneway(*(percentage_age_ss[col] for col in percentage_age_ss.columns))
or, just
stats.f_oneway(*(percentage_age_ss.T.values))

SQL Alchemy using Or_ looping multiple columns (Pandas Dataframes)

SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.

Categories

Resources