Applying Regex across entire column of a Dataframe - python

I have a Dataframe with 3 columns:
id,name,team
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n
I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.
This is what I have tried thus far:
df['team'] = re.sub(r'[\n\r]*','',df['team'])
But this throws an error AttributeError: 'Series' object has no attribute 're'
Could anyone advice how could I loop this regex through the entire Dataframe df['team'] column

You are almost there, there are two simple ways of doing this:
# option 1 - faster way
df['team'] = [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]
# option 2
df['team'] = df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))

As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)
Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.
RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)

Another example (but without regex) but maybe still usefull for someone.
id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series([' marketing','admin\n', 'finance\n'])
testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)

Related

how to find the value in dataframe by multi condtition with unknown column and return the value position and get the next row value [duplicate]

I want to get column name from the whole database (assume the database contains more than 100 rows with more than 50 column) based on specific value that contain in a specific column in pandas.
with the help of Bkmm3 (member from india) I've succeeded on numerical term but failed on alphabetic term. the way I've tried is this:
df = pd.DataFrame({'A':['APPLE','BALL','CAT'],
'B':['ACTION','BATMAN','CATCHUP'],
'C':['ADVERTISE','BEAST','CARTOON']})
response = input("input")
for i in df.columns: if(len(df.query(i + '==' + str(response))) > 0):
print(i)`
then output arise as error:
Traceback (most recent call last): NameError: name 'APPLE' is not defined
Any Help from You Guys will be very Appreciated, Thank You . . .
isin/eq works for DataFrames, and you can 100% vectorize this:
df.columns[df.isin(['APPLE']).any()] # df.isin([response])
Or,
df.columns[df.eq(response).any()]
Index(['A'], dtype='object')
And here's the roundabout way with DataFrame.eval and np.logical_or (were you to loop on columns):
df.columns[
np.logical_or.reduce(
[df.eval(f"{repr(response)} in {i}") for i in df]
)]
Index(['A'], dtype='object')
First, the reason for your error. With pd.DataFrame.query, as with regular comparisons, you need to surround strings with quotation marks. So this would work (notice the pair of " quotations):
response = input("input")
for i in df.columns:
if not df.query(i + '=="' + str(response) + '"').empty:
print(i)
inputAPPLE
A
Next, you can extract index and/or columns via pd.DataFrame.any. coldspeed's solution is fine here, I'm just going to show how similar syntax can be used to extract both row and column labels.
# columns
print(df.columns[(df == response).any(1)])
Index(['A'], dtype='object')
# rows
print(df.index[(df == response).any(0)])
Int64Index([0], dtype='int64')
Notice in both cases you get as your result Index objects. The code differs only in the property being extracted and in the axis parameter of pd.DataFrame.any.

Removing a lot of rows from a dataframe in Python

I am trying to remove non-English tweets from a large dataset in the most efficient way possible. I have tried to create a list of rows that are not English and them removing them, but removing each tweet takes a long time (the langid.classify() function is not the problem).
def removeLanguage(df):
rowsToDelete = []
text = df['tweet'][i]
try:
if (langid.classify(text)[0] != 'en' ):
rowsToDelete.append(i)
continue
except ValueError:
rowsToDelete.append(i)
continue
for i in rowsToDelete:
df.drop(i, inplace=True)
newDf = beforeClassification(inputDf).reset_index(drop=True)
Is there a more efficient way to remove a set of rows from a DataFrame than df.drop()?
df.drop is pretty efficient
but I'd also use anything like this
df = df[langid.classify(df.tweet)[0] != 'en' ]

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks
Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

How to modify cells in column conditionally in pandas?

I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.
Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

Categories

Resources