Remove all rows that meet regex condition - python

trying to teach myself pandas.. and playing around with different dtypes
I have a df as follows
df = pd.DataFrame({'ID':[0,2,"bike","cake"], 'Course':['Test','Math','Store','History'] })
print(df)
ID Course
0 0 Test
1 2 Math
2 bike Store
3 cake History
the dtype of ID is of course an object. What I want to do is remove any rows in the DF if the ID has a string in it.
I thought this would be as simple as..
df.ID.filter(regex='[\w]*')
but this returns everything, is there a sure fire method for dealing with such things?

You can using to_numeric
df[pd.to_numeric(df.ID,errors='coerce').notnull()]
Out[450]:
Course ID
0 Test 0
1 Math 2

Another option is to convert the column to string and use str.match:
print(df[df['ID'].astype(str).str.match("\d+")])
# Course ID
#0 Test 0
#1 Math 2
Your code does not work, because as stated in the docs for pandas.DataFrame.filter:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Wen's answer is the correct (and fastest) way to solve this, but to explain why your regular expression doesn't work, you have to understand what \w means.
\w matches any word character, which includes [a-zA-Z0-9_]. So what you're currently matching includes digits, so everything is matched. A valid regular expression approach would be:
df.loc[df.ID.astype(str).str.match(r'\d+')]
ID Course
0 0 Test
1 2 Math
The second issue is your use of filter. It isn't filtering your ID row, it is filtering your index. A valid solution using filter would be as follows:
df.set_index('ID').filter(regex=r'^\d+$', axis=0)
Course
ID
0 Test
2 Math

Related

Pandas: Filter rows by regex condition

I've read several questions and answers to this, but I must be doing something wrong. I'd appreciate if someone points at me what it might be.
In my df dataframe I have my first column that should always contain six digits, I'm loading the dataframe from Excel, and some smart user thought it would be too funny if adding a disclaimer in the first column.
So I have in the first column something like:
['123456', '456789', '147852', 'In compliance with...']
So I need to filter only the valid records I'm tryng:
pat='\d{6}'
filter = df[0].str.contains(pat, regex=True)
This thing returns 'False' for the disclaimer, but NaN for the match, so doing a df[filter] yields nothing
What am I doing wrong?
You should be able to do that with the following.
You need to select the rows based on the regex filter.
Note that the current regex that you are using will match anything above 6 digits as well. I changed this to include 6 digits exactly.
df = df[df[df.columns[0]].str.contains('^[0-9]{6}$', regex=True)]

Get rid of initial spaces at specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

Count occurrences of certain values in dask.dataframe

I have a dataframe like this:
df.head()
day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A
and want to find out how many occurrences of certain resource_records exist.
My first try was using the Series returned by value_counts(), which seems great, but does not allow me to exclude some labels afterwards, because there is no drop() implemented in dask.Series.
So I tried just to not print the undesired labels:
for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
continue
print('\t{0}\t{1}'.format(row[1], row[0]))
Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found mask(), but this feels a bit clumsy as well:
records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()
I looked for a method which would allow me to just count individual values, but count() does count all values that are not NaN.
Then I found str.contains(), but I don't know how to handle the undocumented Scalar type I get returned with this code:
print(df.resource_record.str.contains('A').sum())
Output:
dd.Scalar<series-..., dtype=int64>
But even after looking at Scalar's code in dask/dataframe/core.py I didn't find a way of getting its value.
How would you efficiently count the occurrences of a certain set of values in your dataframe?
In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.
I believe the simplest solution to your question is this:
df[df.resource_record!='AAAA'].resource_record.value_counts().compute()
Where the expression in the selector square brackets could be some mapping or function.
One quite nice method I found is this:
counts = df.resource_record.mask(df.resource_record.isin(['AAAA'])).dropna().value_counts()
First we mask all entries we'd like to get removed, which replaces the value with NaN. Then we drop all rows with NaN and last count the occurrences of unique values.
This requires df to have no NaN values, which otherwise leads to the row containing NaN being removed as well.
I expect something like
df.resource_record.drop(df.resource_record.isin(['AAAA']))
would be faster, because I believe drop would run through the dataset once, while mask + dropna runs through the dataset twice. But drop is only implemented for axis=1, and here we need axis=0.

pandas DataFrame conditional string split

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:
(A/Egypt/84/2001(H1N2))
A/Brazil/1759/2004(H3N2)
A/Argentina/126/2004
I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:
df['Strain Name'] = df['Original Name'].str.split("(")
However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.
Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?
So, based on EdChum's recommendation, I'll post my answer here.
Minimal data frame required for tackling this problem:
Index Strain Name Year
0 (A/Egypt/84/2001(H1N2)) 2001
1 A/Brazil/1759/2004(H3N2) 2004
2 A/Argentina/126/2004 2004
Code for getting the strain names only, without parentheses or anything else inside the parentheses:
df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))
This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

Categories

Resources