Formatting the contents of pandas column. Removing trailing text and digits - python

I've used BeautifulSoup and pandas to create a csv with columns that contain error codes and corresponding error messages.
Before formatting, the columns look something like this
-132456ErrorMessage
-3254Some other Error
-45466You've now used 3 different examples. 2 more to go.
-10240 This time there was a space.
-1232113That was a long number.
I've successfully isolated the text of the codes like this:
dfDSError['text'] = dfDSError['text'].map(lambda x: x.lstrip('-0123456789'))
This returns just what I want.
But I've been struggling to come up with a solution for the codes.
I tried this:
dfDSError['codes'] = dfDSError['codes'].replace(regex=True,to_replace= r'\D',value=r'')
But that will append numbers from the error message to the end of the code number. So for the third example above instead of 45466 I would get 4546632. Also I would like to keep the leading minus sign.
I thought maybe that I could somehow combine rstrip() with a regex to find where there was a nondigit or a space next to a space and remove everything else, but I've been unsuccessful.
for_removal = re.compile(r'\d\D*')
dfDSError['codes'] = dfDSError['codes'].map(lambda x: x.rstrip(re.findall(for_removal,x)))
TypeError: rstrip arg must be None, unicode or str
Any suggestions? Thanks!

You can use extract:
dfDSError[['code','text']] = dfDSError.text.str.extract('([-0-9]+)(.*)', expand=True)
print (dfDSError)
text code
0 ErrorMessage -132456
1 Some other Error -3254
2 You've now used 3 different examples. 2 more t... -45466
3 This time there was a space. -10240
4 That was a long number. -1232113

Related

ValueError: Columns must be same length as key with multiple outputs

I am extracting a substring from an Excel cell and the entire string says this:
The bolts are 5" long each and 3" apart
I want to extract the length of the bolt which is 5". And I use the following code to get that
df['Bolt_Length'] = df['Description'].str.extract(r'(\s[0-9]")',expand=False)
But if the string says the following:
The bolts are 10" long each and 3" apart
and I try to use to the following code:
df['Bolt_Length'] = df['Description'].str.extract(r'(\s(\d{1,2})")',expand=False)
I get the following error message:
ValueError: Columns must be same length as key
I think Python doesn't know which number to acquire. The 10" or 3"
How can I fix this? How do I tell Python to only go for the first "?
On another note what if I want to get both the bolt length and distance from another bolt? How do I extract the two at the same time?
Your problem is that you have two capture groups in your second regular expression (\s(\d{1,2})"), not one. So basically, you're telling Python to get the number with the ", and the same number without the ":
>>> df['Description'].str.extract(r'(\s(\d{1,2})")', expand=False)
0 1
0 5" 5
1 10" 10
You can add ?: right after the opening parenthesis of a group to make it so that it does not capture anything, though it still functions as a group. The following makes it so that the inner group, which excludes the ", does not capture:
# notice vv
>>> df['Description'].str.extract(r'(\s(?:\d{1,2})")', expand=False)
0 5"
1 10"
Name: Description, dtype: object
The error occurs because your regex contains two capturing groups, that extract two column values, BUT you assign those to a single column, df['Bolt_Length'].
You need to use as many capturing groups in the regex pattern as there are columns you assign the values to:
df['Bolt_Length'] = df['Description'].str.extract(r'\s(\d{1,2})"',expand=False)
The \s(\d{1,2})" regex only contains one pair of unescaped parentheses that form a capturing group, so this works fine since this single value is assigned to a single Bolt_Length column.

Simplifying pandas data cleaning

I am cleaning data in my pandas dataframe, and i hope there is a better way than mine, to do this.
I have in the column["count"] in my pandas dateframe input like his:
~186-205
4 and 4
200
800-1000
550-550[2]
10, 20 or 50
5 (four score and bla bla)
38 or 30
88-80
If somebody could tell me how to add numbers together if they say "x and x" that would be great.
However, my main goal is just to get the lowest number from each row and everything else gone.
I succeed almost entirely with my solution:
df['Count'] = df['Count'].str.replace(r"\(.*\)","") #all square brackets with content
df['Count'] = df['Count'].str.replace(r"\[.*\]","") #all square brackets with content
df['Count'] = df['Count'].str.replace("(−).*","") #For one type of hyphens
df['Count'] = df['Count'].str.replace("(-).*","") #for another type of hyphens
df['Count'] = df['Count'].str.replace("(—).*","") #for yet another type of hyphens
df['Count'] = df['Count'].str.replace("(\u2013).*","") #because of different formating for hyphens
df['Count'] = df['Count'].str.replace("(or).*","") #for other alternatives, remove
df['Count'] = df['Count'].str.replace("(,).*","") #everything after commas
df['Count'] = df['Count'].replace(r'\D+', "", regex=True) #everything but numbers
any suggestions to make this more elegant?
either in a function, for loop or just something smarter...
Thank you for your time.
About your solution for stripping out unneeded symbols from the values, you can use the the built-in re module to collect all numbers in the string and just get the lowest one from them:
import re
min(map(int, re.findall(r'[0-9]+', value)))
To support only python operations you might try the built-in eval function, but if you need to support different operations like 'and' to sum your numbers, you will probably need to write a parser for more customizations. This is a cool article you can check for parsers and what are their parts.
Edit:
To apply it to the whole column extract to function of smallest number and then apply that function.
import re
def get_min_number(value):
return min(map(int, re.findall(r'[0-9]+', value)))
df['Count'].apply(get_min_number)

How to add a space within a string if it doesn't already have one?

I am working with UK post code data and, from what I've seen, they usually have 5, 6 or 7 characters with a space between them; examples include "EC2A 2FA", "HU8 9XL" and "E1 6AW". The problem I have is that some of the post codes in the dataset do not have a space for example "EC2A2FA", "HU89XL" and "E16AW"; this causes them not to be found when I try to get their location.
I want to add spaces to the ones that don't have one and leave the ones that already have a space. I could probably use if statements to check for a space at a particular index and add a space if it is not already there but I want to know if their is a more efficient method to add the spaces between them like, for example, using string formatting.
# If I have this list
post_codes = ["BH16 6FA", "HU37FD", "W4 5YE", "E50QD", "WC2H9JQ", "LE3 0PD"]
# I want to get
["BH16 6FA", "HU3 7FD", "W4 5YE", "E5 0QD", "WC2H 9JQ", "LE3 0PD"]
You can use negative numbers in slices to isolate characters at the end of the string. When a space is lacking, split out the last 3 characters and insert the space.
post_codes = ["fEC2A 2FA", "E1 6AW", "EC2A2FA", "HU89XL", "E16AW"]
for idx, code in enumerate(post_codes):
if " " not in code:
post_codes[idx] = code[:-3] + " " + code[-3:]
print(post_codes)

Pandas unable to split on multiple asterisk

I'm trying to split on 5x asterisk in Pandas by reading in data that looks like this
"This place is not good ***** less salt on the popcorn!"
My code attempt is trying to read in the reviews column and get the zero index
review = review_raw["reviews"].str.split('*****').str[0]
print(review)
The error
sre_constants.error: nothing to repeat at position 0
My expectation
This place is not good
pandas.Series.str.split
Series.str.split(pat=None, n=- 1, expand=False)
Parameters:
patstr, optional String or regular expression to split on. If not
specified, split on whitespace.
* character is a part of regex string which defines zero or more number of occurrences, and this is the reason why your code is failing.
You can either try escaping the character:
>>df['review'].str.split('\*\*\*\*\*').str[0]
0 This place is not good
Name: review, dtype: object
Or you can just pass the regex:
>>df['review'].str.split('[*]{5}').str[0]
0 This place is not good
Name: review, dtype: object
Third option would be to use inbuilt str.split() instead of pandas' Series.str.split()
>>df['review'].apply(lambda x: x.split('*****')).str[0]
0 This place is not good
Name: review, dtype: object
Try out with this code
def replace_str(string):
return str(string).replace("*****",',').split(',')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))
Well suppose we already have a ',' in our input string in that case the code can be little tweaked like below. Since I am replacing ***** , I can replace with any character like '[' in the modified answer.
def replace_str(string):
return str(string).replace("*****",'[').split('[')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))

Python: Checking a list with regex, filling in blanks

I've tried to find ways to do this and searched online here, but cannot find examples to help me figure this out.
I'm reading in rows from a large csv and changing each row to a list. The problem is that the data source isn't very clean. It has empty strings or bad data sometimes, and I need to fill in default values when that happens. For example:
list_ex1 = ['apple','9','','2012-03-05','455.6']
list_ex2 = ['pear','0','45','wrong_entry','565.11']
Here, list_ex1 has a blank third entry and list_ex2 has erroneous data where a date should be. To be clear, I can create a regex that limits what each of the five entries should be:
reg_ex_check = ['[A-Za-z]+','[0-9]','[0-9]','[0-9]{4}-[0-1][0-9]-[0-3][0-9]','[0-9.]+']
That is:
1st entry: A string, no numbers
2nd entry: Exactly one digit between 0 and 9
3rd entry: Exactly one digit as well.
4th entry: Date in standard format (allowing any four digit ints for year)
5th entry: Float
If an entry is blank OR does not match the regular expression, then it should be filled in/replaced with the following defaults:
default_fill = ['empty','0','0','2000-01-01','0']
I'm not sure how the best way to go about this is. I think I could write a complicated loop, but it doesn't feel very 'pythonic' to me to do such things.
Any better ideas?
Use zip and a conditional expression in a list comprehension:
[x if re.match(r,x) else d for x,r,d in zip(list_ex2,reg_ex_check,default_fill)]
Out[14]: ['pear', '0', '45', '2000-01-01', '565.11']
You don't really need to explicitly check for blank strings since your various regexen (plural of regex) will all fail on blank strings.
Other note: you probably still want to add an anchor for the end of your string to each regex. Using re.match ensures that it tries to match from the start, but still provides no guarantee that there is not illegal stuff after your match. Consider:
['pear and a pear tree', '0blah', '4 4 4', '2000-01-0000', '192.168.0.bananas']
The above entire list is "acceptable" if you don't add a $ anchor to the end of each regex :-)
What about something like this?
map(lambda(x,y,z): re.search(y,x) and x or z, zip(list_ex1, reg_ex_check, default_fill))

Categories

Resources