Simplifying pandas data cleaning

Simplifying pandas data cleaning - python

I am cleaning data in my pandas dataframe, and i hope there is a better way than mine, to do this.
I have in the column["count"] in my pandas dateframe input like his:
~186-205
4 and 4
200
800-1000
550-550[2]
10, 20 or 50
5 (four score and bla bla)
38 or 30
88-80
If somebody could tell me how to add numbers together if they say "x and x" that would be great.
However, my main goal is just to get the lowest number from each row and everything else gone.
I succeed almost entirely with my solution:
df['Count'] = df['Count'].str.replace(r"\(.*\)","") #all square brackets with content
df['Count'] = df['Count'].str.replace(r"\[.*\]","") #all square brackets with content
df['Count'] = df['Count'].str.replace("(−).*","") #For one type of hyphens
df['Count'] = df['Count'].str.replace("(-).*","") #for another type of hyphens
df['Count'] = df['Count'].str.replace("(—).*","") #for yet another type of hyphens
df['Count'] = df['Count'].str.replace("(\u2013).*","") #because of different formating for hyphens
df['Count'] = df['Count'].str.replace("(or).*","") #for other alternatives, remove
df['Count'] = df['Count'].str.replace("(,).*","") #everything after commas
df['Count'] = df['Count'].replace(r'\D+', "", regex=True) #everything but numbers
any suggestions to make this more elegant?
either in a function, for loop or just something smarter...
Thank you for your time.

About your solution for stripping out unneeded symbols from the values, you can use the the built-in re module to collect all numbers in the string and just get the lowest one from them:
import re
min(map(int, re.findall(r'[0-9]+', value)))
To support only python operations you might try the built-in eval function, but if you need to support different operations like 'and' to sum your numbers, you will probably need to write a parser for more customizations. This is a cool article you can check for parsers and what are their parts.
Edit:
To apply it to the whole column extract to function of smallest number and then apply that function.
import re
def get_min_number(value):
return min(map(int, re.findall(r'[0-9]+', value)))
df['Count'].apply(get_min_number)

Related

Convert data into comma separated values

How do i convert data into comma separated values, i want to convert like
I have this data in excel on single cell
"ABCD x3 ABC, BAC x 3"
Want to convert to
ABCD,ABCD,ABCD,ABC,BAC,BAC,BAC
can't find an easy way to do that.
I am trying to solve it in python so i can get a structured data

Hi Zeeshan to try and sort the string into usable data while also multiplying certain parts of the string is kind of tricky for me.
the best solution I can think of is kind of gross but it seems to work. hopefully my comments aren't too confusing <3
import re
data = "ABCD x3 AB BAC x2"
#this will split the string into a list that you can iterate through.
Datalist = re.findall(r'(\w+)', data)
#create a new list for the final result
newlist = []
for object in Datalist:
#for each object in the Datalist list
#if the object starts with 'x'
if re.search("x.*", object):
#convert the multiplier to type(string) and then split the x from the multiplier number string
xvalue = str(object).split('x')
#grab and remove the last item added to the newlist because it hasnt been multiplied.
lastitem = newlist.pop()
#now we can add the last item back in by as many times as the x value
newlist.extend([lastitem] * int(xvalue[1]))
else:
#if the object doesnt start with an x then we can just add it to the list.
newlist.extend([object])
#print result
print(newlist)
#re.search() - looks for a match in a string
#.split() - splits a string into multiple substrings
#.pop() - removes the last item from a list and returns that item.
#.extend() - adds an item to the end of a list
keep in mind that to find the multiplier its looking for x followed by a number (x1). if there is a space for example = (x 1) then it will match x but it wont return a value because there is a space.

there might be multiple ways around this issue and I think the best fix will be to restructure how the data is Formatted into the cell.
here are a couple of ways you can work with the data. it wont directly solve your issue but I hope it will help you think about how you approach it (not being rude I don't actually have a good way to handle your example <3 )
split() will split your string as character 'x' and return a list of substrings you can iterate over.
data = 'ABCD ABCD ABCD ABC BAC BAC BAC'
splitdata = data.split(' ')
print(splitdata)
#prints - ['ABCD', 'ABCD', 'ABCD', 'ABC', 'BAC', 'BAC', 'BAC']
you could also try and match strings from the data
import re
data2 = "ABCD x3 ABC BAC x3"
result = []
for match in re.finditer(r'(\w+) x(\d+)', data2):
substring, count = match.groups()
result.extend([substring] * int(count))
print(result)
use re.finditer to go through the string and match the data with the following format = '(\w+) x(\d+)'
each match then gets added to the list.
'\w' is used to match a character.
'\d' is used to match a digit.
'+' is the quantifier, means one or more.
so we are matching = '(\w+) x(\d+)',
which broken down means we are matching (\w+) one or more characters followed by a 'space' then 'x' followed by (\d+) one or more digits
so because your cell data is essentially a string followed by a multiplier then a string followed by another string and then another multiplier, the data just feels too random for a general solution and i think this requires a direct solution that can only work if you know exactly what data is already in the cell. that's why i think the best way to fix it is to rework the data in the cell first. im in no way an expert and this answer is to help you think of ways around the problem and to add to the discussion :) ,if someone wants to correct me and offer a better solution to this I would love to know myself.

Remove everything after second caret regex and apply to pandas dataframe column

I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844, what regex would I use in Python? Similarly PACU^SPAC^06 would be PACU^SPAC. And to apply it to the whole column.
I tried r'[\\^].+$' since I thought it would take the last caret and everything after, but it didn't work.

You can negate the character group to find everything except ^ and put it in a match group. you don't need to escape the ^ in the character group but you do need to escape the one outside.
re.match(r"([^^]+\^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+\^[^^]+)', expand=False)
NOTE
Originally, I used replace, but the extract solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+\^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354

I don't think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find accepts a second argument of where to start the search, place it just after the position of the first caret.

Do these characters have some sort of mapping function? "[1]", "[2]", "[3]",...,"[n]"

I am using this line of code
df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(filter_list), flags=re.IGNORECASE)).any(1)
to create a mask for my df. The filter list is
filter_list = ["[1]", "[2]", "[3]", "[4]", "[5]", "[6]", "[7]", "[8]","[9]",..."[n]"]
But I am having weird results I was hoping it would just filter the rows in column 0 of the df that have [1]...[n] in. But it doesn't it is also filtering rows that don't have those elements in. There is somewhat a pattern to it though. It will filter out rows that have numbers with "characters" by which i mean £55, 2010), 55*, 55 *
Can anyone explaine what is going on and if there is a workaround for this?

If you want to match the items in filter list exactly, use re.escape() to escape the special characters. [1] is a regular expression that just matches the digit 1, not the string [1].
df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(re.escape(f) for f in filter_list), flags=re.IGNORECASE)).any(1)
See Reference - What does this regex mean?

Python Split if there is only one element

I am trying to do a Python Split but there seems to be a problem with my logic.
I have some data, separated with a semicolon. Some example of my data would be like:
89;50;20
40
I only want to retrieve one value from each row. Like for example in row 1, i only want the last value which is 20, and i want 40 from the second row.
I tried using the following code:
fields = fields.split(";")[-1]
It works for the first row, i got 20. but i am unable to get the data from second row as it has only one element in the split.
Then I tried using an if-else condition like below but the code is unable to run.
if (len(fields.split(";")) > 0):
fields = fields.split(";")[-1]
else:
pass
Anybody knows how to deal with this problem ? What I am achieve is that if there is only 1 value in that row I will read it. If there is more than one value, I split it and take the last value.

Use strip to normalize input, the problem is there is an extra ; for one number situation, so we should remove it first.
In [1]: def lnum(s):
...: return s.strip(';').split(';')[-1]
...:
In [2]: lnum('89;50;20')
Out[2]: '20'
In [3]: lnum('89;')
Out[3]: '89'
In [5]: lnum('10;')
Out[5]: '10'

So, if you see when you split the string - '40;' using semicolon (;), you get a list of two strings - ['40', '']. So, fields.split(";")[-1] returns an empty string for the input '40;'.
So, either you strip the last semicolon ; before splitting as follows.
print('40;'.rstrip(';').split(';')[-1])
OR, you can do:
fields = '40;'.split(';')
if fields[-1]:
print(fields[-1])
else:
print(fields[-2])
I prefer the first approach than the if/else approach. Also, have a look at the .strip(), .lstrip(), .rstrip() functions.

Another way is to use re module.
from re import findall
s1 = '80;778;20'
s2 = '40'
res1 = findall( '\d+', s1)
res2 = findall( '\d+', s2)
print res1[-1]
print res2[-1]

Formatting the contents of pandas column. Removing trailing text and digits

I've used BeautifulSoup and pandas to create a csv with columns that contain error codes and corresponding error messages.
Before formatting, the columns look something like this
-132456ErrorMessage
-3254Some other Error
-45466You've now used 3 different examples. 2 more to go.
-10240 This time there was a space.
-1232113That was a long number.
I've successfully isolated the text of the codes like this:
dfDSError['text'] = dfDSError['text'].map(lambda x: x.lstrip('-0123456789'))
This returns just what I want.
But I've been struggling to come up with a solution for the codes.
I tried this:
dfDSError['codes'] = dfDSError['codes'].replace(regex=True,to_replace= r'\D',value=r'')
But that will append numbers from the error message to the end of the code number. So for the third example above instead of 45466 I would get 4546632. Also I would like to keep the leading minus sign.
I thought maybe that I could somehow combine rstrip() with a regex to find where there was a nondigit or a space next to a space and remove everything else, but I've been unsuccessful.
for_removal = re.compile(r'\d\D*')
dfDSError['codes'] = dfDSError['codes'].map(lambda x: x.rstrip(re.findall(for_removal,x)))
TypeError: rstrip arg must be None, unicode or str
Any suggestions? Thanks!

You can use extract:
dfDSError[['code','text']] = dfDSError.text.str.extract('([-0-9]+)(.*)', expand=True)
print (dfDSError)
text code
0 ErrorMessage -132456
1 Some other Error -3254
2 You've now used 3 different examples. 2 more t... -45466
3 This time there was a space. -10240
4 That was a long number. -1232113

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simplifying pandas data cleaning - python

Related

Convert data into comma separated values

Remove everything after second caret regex and apply to pandas dataframe column

Do these characters have some sort of mapping function? "[1]", "[2]", "[3]",...,"[n]"

Python Split if there is only one element

Formatting the contents of pandas column. Removing trailing text and digits

Categories

Resources