pandas DataFrame conditional string split

pandas DataFrame conditional string split - python

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:
(A/Egypt/84/2001(H1N2))
A/Brazil/1759/2004(H3N2)
A/Argentina/126/2004
I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:
df['Strain Name'] = df['Original Name'].str.split("(")
However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.
Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?

So, based on EdChum's recommendation, I'll post my answer here.
Minimal data frame required for tackling this problem:
Index Strain Name Year
0 (A/Egypt/84/2001(H1N2)) 2001
1 A/Brazil/1759/2004(H3N2) 2004
2 A/Argentina/126/2004 2004
Code for getting the strain names only, without parentheses or anything else inside the parentheses:
df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))
This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

Related

pandas dataframe aggregate adding index in some cases

i have a pandas dataframe with an id column and relatively large text in other column. i want to group by id column and concatenate all the large texts into one single text whenever id repeats. it works great in simple toy example but when i run it on my real data it adds index of rows added in final concatenated text. here is my example code
data = {"A":[1,2,2,3],"B":['asdsa','manish','shukla','wfs']}
testdf = pd.DataFrame(data)
testdf = testdf.groupby(['A'],as_index=False).agg({'B':" ".join})
as you can see this code works great but when i run it on my real data it adds indexes in begnning of column B like it will say something like "1 manish \n 2 shukla" for A=2. it obviously is working here but no idea why its misbehaving when i have larger text with real data. any pointers? i tried to search but apparently noone else has run into this issue.

ok i figured out the answer. if any rows in the dataframe as na or nulls, it does that. once i removed the na and nulls it worked.

Pandas: Filter rows by regex condition

I've read several questions and answers to this, but I must be doing something wrong. I'd appreciate if someone points at me what it might be.
In my df dataframe I have my first column that should always contain six digits, I'm loading the dataframe from Excel, and some smart user thought it would be too funny if adding a disclaimer in the first column.
So I have in the first column something like:
['123456', '456789', '147852', 'In compliance with...']
So I need to filter only the valid records I'm tryng:
pat='\d{6}'
filter = df[0].str.contains(pat, regex=True)
This thing returns 'False' for the disclaimer, but NaN for the match, so doing a df[filter] yields nothing
What am I doing wrong?

You should be able to do that with the following.
You need to select the rows based on the regex filter.
Note that the current regex that you are using will match anything above 6 digits as well. I changed this to include 6 digits exactly.
df = df[df[df.columns[0]].str.contains('^[0-9]{6}$', regex=True)]

How to drop certain rows from dataframe if they partially meet certain condition

I'm trying to drop rows from dataframe if they 'partially' meet certain condition.
By 'partially' I mean some (not all) values in the cell meet the condition.
Lets' say that I have this dataframe.
>>> df
Title Body
0 Monday report: Stock market You should consider buying this.
1 Tuesday report: Equity XX happened.
2 Corrections and clarifications I'm sorry.
3 Today's top news Yes, it skyrocketed as I predicted.
I want to remove the entire row if the Title has "Monday report:" or "Tuesday report:".
One thing to note is that I used
TITLE = []
.... several lines of codes to crawl the titles.
TITLE.append(headline)
to crawl and store them into dataframe.
Another thing is that my data are in tuples because I used
df = pd.DataFrame(list(zip(TITLE, BODY)), columns =['Title', 'Body'])
to make the dataframe.
I think that's why when I used,
df.query("'Title'.str.contains('Monday report:')")
I got an error.
When I did some googling here in StackOverflow, some advised to convert tuples into multi-index and to use filter(), drop(), or isin().
None of them worked.
Or maybe I used them in a wrong way...?
Any idea to solve this prob?

you can do a basic filter for a condition and then pick reverse of it using ~:
eg:
df[~df['Title'].str.contains('Monday report')] will give you output that excludes all rows that contain 'Monday report' in title.

Compare two date columns in pandas DataFrame to validate third column

Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!

IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

Remove all rows that meet regex condition

trying to teach myself pandas.. and playing around with different dtypes
I have a df as follows
df = pd.DataFrame({'ID':[0,2,"bike","cake"], 'Course':['Test','Math','Store','History'] })
print(df)
ID Course
0 0 Test
1 2 Math
2 bike Store
3 cake History
the dtype of ID is of course an object. What I want to do is remove any rows in the DF if the ID has a string in it.
I thought this would be as simple as..
df.ID.filter(regex='[\w]*')
but this returns everything, is there a sure fire method for dealing with such things?

You can using to_numeric
df[pd.to_numeric(df.ID,errors='coerce').notnull()]
Out[450]:
Course ID
0 Test 0
1 Math 2

Another option is to convert the column to string and use str.match:
print(df[df['ID'].astype(str).str.match("\d+")])
# Course ID
#0 Test 0
#1 Math 2
Your code does not work, because as stated in the docs for pandas.DataFrame.filter:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Wen's answer is the correct (and fastest) way to solve this, but to explain why your regular expression doesn't work, you have to understand what \w means.
\w matches any word character, which includes [a-zA-Z0-9_]. So what you're currently matching includes digits, so everything is matched. A valid regular expression approach would be:
df.loc[df.ID.astype(str).str.match(r'\d+')]
ID Course
0 0 Test
1 2 Math
The second issue is your use of filter. It isn't filtering your ID row, it is filtering your index. A valid solution using filter would be as follows:
df.set_index('ID').filter(regex=r'^\d+$', axis=0)
Course
ID
0 Test
2 Math

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.