Python: Replacing alphanumeric values in Dataframe - python

I have words with \t and \r at the beginning of the words that I am trying to strip out without stripping the actual words.
For example "\tWant to go to the mall.\rTo eat something."
I have tried a few things from SO over three days. Its a Pandas Dataframe so I thought this answer pertained the best:
Pandas DataFrame: remove unwanted parts from strings in a column
But formulating from that for my own solution is not working.
i = df['Column'].replace(regex=False,inplace=False,to_replace='\t',value='')
I did not want to use regex since the expression has been difficult to make being that I am attempting to strip out '\t' and if possible also '\r'.
Here is my regular expression: https://regex101.com/r/92CUV5/5

Try the following code:
def remove_chars(text):
return str(re.sub(r'[\t\r]','',text))
i = df['Column'].map(remove_chars)

Related

Insert string in pandas column using regex if pattern is found

I have a string column in a dataframe and I'd like to insert a # to the begging of my pattern.
For example:
My pattern is the letters 'pr' followed by any amount of numbers. If in my column there is a value 'problem in pr123', I would change it to 'problem in #pr123'.
I'm trying a bunch of code snippets but nothing is working for me.
Tried to change the solution to replace for 'pr#123' but this didn't work either.
df['desc_clean'] = df['desc_clean'].str.replace(r'([p][r])(\d+)', r'\1#\2', regex=True)
What's the best way I can replace all values in this column when I find this pattern?
If you need pr#123 you can use
df['desc_clean'] = df['desc_clean'].str.replace(r'(pr)(\d+)', r'\1#\2')
To get #pr123, you can use
df['desc_clean'].str.replace(r'pr\d+', r'#\g<0>')
To match pr as a whole word, you can add a word boundary, \b, in front of pr:
df['desc_clean'].str.replace(r'\bpr\d+', r'#\g<0>')
See the regex demo.

How does the pandas read_csv parse regular expressions, exactly?

I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.

Pyspark: help filtering out any rows which have unwanted characters

Writing to a parquet file gives me an error that states that " ,;{}()\n\t=" characters are not allowed.
I'd like to eliminate rows that have any of these characters anywhere.
Would I use "like", "rlike" or something else?
I have tried this:
df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == False)
Obviously this does not work. I'm unsure what the right regex syntax is, or if I even need a regex in this particular case.
You would use rlike since it's for regular expressions:
df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))
When you put characters between [] it means any of the following characters.
I don't see why these characters wouldn't be allowed in the dataframe rows, there might be an invalid character in the column names instead. You can use .withColumnRenamed() to rename it.

Using the .split() function based on conditions?

How would you be able to use the .split() function based on conditions?
Lets say I have the raw data:
Apples,Oranges,Strawberries Green beans,Tomatoes,Broccoli
My intended result is:
['Apples','Oranges','Strawberries','Green beans','Tomatoes','Brocolli']
Would it be able to have it split at commas and if there is a space and a capital letter following it?
The literal interpretation of what you asked for, using re.split:
import re
pat = re.compile(r'\s(?=[A-Z])|,')
pat.split(my_str)
This is more simply done, in your case:
pat = re.compile(r'.(?=[A-Z])')
Basically, split on any character that is followed by a capital letter.
Using regex will make the code simpler than a complicated split statement.
import re
...
re.findall(", [A-Z]",data)
Note you asked for a split for a command, space, capital, but in your example there are no spaces after commas.

Cleaning up commas in numbers w/ regular expressions in Python

I have been googling this one fervently, but I can't really narrow it down. I am attempting to interpret a csv file of values, common enough sort of behaviour. But I am being punished by values over a thousand, i.e. in quotations and involving a comma. I have kinda gotten around it by using the csv reader, which creates a list of numbers from the row, but I then have to pick the commas out afterwards.
For purely academic reasons, is there a better way to edit a string with regular expressions? Going from 08/09/2010,"25,132","2,909",650 to 08/09/2010,25132,2909,650.
(If you are into Vim, basically I want to put Python on this:
:1,$s/"\([0-9]*\),\([0-9]*\)"/\1\2/g :D )
Use the csv module for first-stage parsing, and a regex only for seeing if the result can be transformed to a number.
import csv, re
num_re = re.compile('^[0-9]+[0-9,]+$')
for row in csv.reader(open('input_file.csv')):
for el_num in len(row):
if num_re.match(row[el_num]):
row[el_num] = row[el_num].replace(',', '')
...although it would probably be faster not to use the regular expression at all:
for row in ([item.replace(',', '') for item in row]
for row in csv.reader(open('input_file.csv'))):
do_something_with_your(row)
I think what you're looking for is, assuming that commas will only appear in numbers, and that those entries will always be quoted:
import re
def remove_commas(mystring):
return re.sub(r'"(\d+?),(\d+?)"', r'\1\2', mystring)
UPDATE:
Adding cdarke's comments below, the following should work for arbitrary-length numbers:
import re
def remove_commas_and_quotes(mystring):
return re.sub(r'","|",|"', ',', re.sub(r'(?:(\d+?),)',r'\1',mystring))
Python has a regular expressions module, "re":
http://docs.python.org/library/re.html
However, in this case, you might want to consider using the "partition" function:
>>> s = 'some_long_string,"12,345",more_string,"56,6789",and_some_more'
>>> left_part,quote_mark,right_part = s.partition(")
>>> right_part
'12,345",more_string,"56,6789",and_some_more'
>>> number,quote_mark,remainder = right_part.partition(")
'12,345'
string.partition("character") splits a string into 3 parts, stuff to the left of the first occurrence of "character", "character" itself and stuff to the right.
Here's a simple regex for removing commas from numbers of any length:
re.sub(r'(\d+),?([\d+]?)',r'\1\2',mystring)

Categories

Resources