I have a table with a street number field that should have been all numeric values but the creators of the dataset allowed invalid values like 567M or 4321.5 to indicate an apartment number or granny unit. I need to get whole numbers into a new field and place the letter values and decimal values into a street suffix field. I've been playing around with regex and isalpha() and isalnum():
# import regex
import re
# list of values that should be all numbers
lst = ['1234', '4321.5', '567M']
# create new lists where l1 will be numeric datatypes and l2 will contain suffixes
# if there is a decimal symbol the string should be split with first value being all numbers
# and going into l1 and everything after decimal in l2
l1 = []
l2 =[]
for i in lst:
if i.isalnum(): # all characters are numeric and good to go. Maybe need to do int()?
l1.append(i)
elif '.' in i: # a decimal was found and values need to to be split and placed into two different lists
i.split(".")
l1.append(i[0])
l2.append(i[-1])
else:
if i.isnumeric() == False: # if a letter is found in a list item everything prior to letter goes to l1 and letter goes to l2
i = re.split('(\d+)', i)
l1.append(i[0])
l2.append(i[-1])
I immediately got this back when running the code:
['4321', '5']
And then got this for l1 and l2 (l1 being new numeric list and l2 being string suffix list):
['4321', '5']
l1
['1234', '4', '567M']
l2
['5']
Am I headed in the right direction here? I was hoping this would be simpler but the data is pretty wonky.
You can just use a regex to do the pattern matching. The expression would consist of three parts:
the street number: [1-9]\d*
a optional delimiter, e.g .,- : [\.,\- ]*
the suffix (any chars except whitespaces): [^\s]*
import re
pattern = re.compile(r'(?P<number>[1-9]\d*)[\.,\- ]*(?P<suffix>[^\s]*)')
list = ['1234', '4321.5', '567M', ' 31-a ']
l1 = []
l2 = []
for item in list:
match = pattern.search(item)
if match:
l1.append(int(match.group('number')))
l2.append(match.group('suffix'))
print(f'l1: {l1}')
print(f'l2: {l2}')
Result:
l1: [1234, 4321, 567, 31]
l2: ['', '5', 'M', 'a']
Your first issue is that i.split(".") doesn't do anything useful by itself, it returns a list with the split pieces. You're not using that (and it seems like your interpreter is printing it out for you), and you go on to index into the original, unsplit string. You could use i = i.split(".") to fix that part of your code, but below I suggest a better approach.
The other issue is that your regex splitting code isn't working either. Rather than try to fix it, I'd suggest using an entirely different regex approach, which will also handle the other cases you're looking at:
for i in lst:
match = re.match(r'(\d+)\.?(.*)', i)
if match is None:
raise ValueError("invalid address format {i}")
l1.append(match.group(1))
l2.append(match.group(2))
This uses a regex pattern to match the two parts of the street number you want. Any leading numeric part gets matched by the first capturing group and put into l1, and whatever else is trailing (optionally after a decimal point) goes into l2. The second part may be an empty string, if there is no extra part to the street number, but I still put it into l2, which you'll appreciate when you try to iterate over l1 and l2 together.
Related
up till this function everything is fine, i get 4999 rows, that's the amount i got. Can you check the code down below, where do i make mistakes that i end up having 5095 instead of 4999 and in the second function i have 5032 instead of 4999 instances
I have to get no more than 4999.
Any help is appreciated
a=[]
for i in matches:
a.append([i for i in list(dict.fromkeys(i))])
print(len(a))
print ((a))
result:
4999
[['23-year-old'], [' '], ['42 years old'], ['-year-old']..]
can the -year-old be a problem in here?
Now here i face the problem
t=[]
for i in a:
for j in i:
p=len(j)
if p>1:
r=j.replace('-', ' ').split(' ')
# print(r)
t+=[s for s in r if s.isdigit()]
else:
t+=['']
print(len(t))
print(t)
output:
5095 #This should be 4999
['23', '', '42', '', '', '30', '31', ''...]
I do have also the same issue with the list of the gender? i end up having 5032
This part is not answered yet
import re
fil = data['transcription']
print(fil)
gender_aux = []
for i in fil:
try:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
except:
gender_aux.append(' ')
# pass
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else:
gender_aux+=[' ']
break
print(len(gender_aux))
print(gender_aux)
output:
5032 #this should be 4999
['female', 'male', 'male', ' ',
Assuming there are no decimal values in your dataset, and that each list item will only contain one string, one number per string.
If all you're after a list containing the integer values of everyone's ages, starting from your completed list a, you can simply
import re
t = [re.findall(r'\d+', item[0])[0] for item in a if re.findall(r'\d+', item[0])]
This list comprehension accomplishes a few things.
Firstly, because your a list is a list of single item lists, as we iterate through each item, we obtain the value of the first (and only) item in the list using item[0]. We then perform a regex operation (hence import re) on this item, with the search pattern r'\d+' which extracts only the integer values from each string (You can check out https://regex101.com/ to play around with regex patterns to better understand how they work).
Because re.findall returns a list of matches, and because it seems each string in your dataset will only contain one match (at most), we simply take the [0] index of the resulting list as our chosen value. Where there are no matches, re.findall returns an empty list. Because empty lists evaluate to false, the if statement in our list comprehension will prevent index errors on strings where there are no numbers to be extracted.
Using your example, the resulting t array would be as follows:
['23', '42']
Note that the empty strings are not included in the final list. If you wanted to include them, you could simply add an else condition to our if statement as follows:
t = [re.findall(r'\d+', item[0])[0] if re.findall(r'\d+', item[0]) else '' for item in a]
this would result in
['23', '', '42', '', '']
Lastly, if you wanted to convert each number (currently strings) to integer values, you could instead write:
t = [int(re.findall(r'\d+', item[0])[0]) if re.findall(r'\d+', item[0]) else '' for item in a]
which finally, would result in:
[23, '', 42, '', '']
Of course, this all assumes there are no decimal values in your dataset, and that each list item will only contain one string, with each string only containing one desired number.
For example, our re.findall with the string "I am 42 years old, and my son is 16", would return ['42', '16'], and because we only return the first item of the list, the final list would not include '16'. Keep this in mind.
Because we aren't creating any additional items (e.g. by using str.split()), we can be sure the resulting list consists of the same number of elements (so long as we use the variant with the else '' statement). If we use the first variant, the resulting list will contain only as many elements as there are elements in a containing numbers.
I have string below,and I want to get list,dict,var from this string.
How can I to split this string to specific format?
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
import re
m1 = re.findall (r'(?=.*,)(.*?=\[.+?\],?)',s)
for i in m1 :
print('m1:',i)
I only get result 1 correctly.
Does anyone know how to do?
m1: list_c=[1,2],
m1: a=3,b=1.3,c=abch,list_a=[1,2],
Use '=' to split instead, then you can work around with variable name and it's value.
You still need to handle the type casting for values (regex, split, try with casting may help).
Also, same as others' comment, using dict may be easier to handle
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
al = s.split('=')
var_l = [al[0]]
value_l = []
for a in al[1:-1]:
var_l.append(a.split(',')[-1])
value_l.append(','.join(a.split(',')[:-1]))
value_l.append(al[-1])
output = dict(zip(var_l, value_l))
print(output)
You may have better luck if you more or less explicitly describe the right-hand side expressions: numbers, lists, dictionaries, and identifiers:
re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
# [('list_c', '[1,2]'), ('a', '3'), ('b', '1.3'), ('c', 'abch'),
# ('list_a', '[1,2]'), ('dict_a', '{a:2,b:3}')]
The answer is like below
import re
from pprint import pprint
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1],Save,Record,dict_a={a:2,b:3}'
m1 = re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
temp_d = {}
for i,j in m1:
temp = i.strip(',').split(',')
if len(temp)>1:
for k in temp[:-1]:
temp_d[k]=''
temp_d[temp[-1]] = j
else:
temp_d[temp[0]] = j
pprint(temp_d)
Output is like
{'Record': '',
'Save': '',
'a': '3',
'b': '1.3',
'c': 'abch',
'dict_a': '{a:2,b:3}',
'list_a': '[1]',
'list_c': '[1,2]'}
Instead of picking out the types, you can start by capturing the identifiers. Here's a regex that captures all the identifiers in the string (for lowercase only, but see note):
regex = re.compile(r'([a-z]|_)+=')
#note if you want all valid variable names: r'([a-z]|[A-Z]|[0-9]|_)+'
cases = [x.group() for x in re.finditer(regex, s)]
This gives a list of all the identifiers in the string:
['list_c=', 'a=', 'b=', 'c=', 'list_a=', 'dict_a=']
We can now define a function to sequentially chop up s using the
above list to partition the string sequentially:
def chop(mystr, mylist):
temp = mystr.partition(mylist[0])[2]
cut = temp.find(mylist[1]) #strip leading bits
return mystr.partition(mylist[0])[2][cut:], mylist[1:]
mystr = s[:]
temp = [mystr]
mylist = cases[:]
while len() > 1:
mystr, mylist = chop(mystr, mylist)
temp.append(mystr)
This (convoluted) slicing operation gives this list of strings:
['list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'list_a=[1,2],dict_a={a:2,b:3}',
'dict_a={a:2,b:3}']
Now cut off the ends using each successive entry:
result = []
for x in range(len(temp) - 1):
cut = temp[x].find(temp[x+1]) - 1 #-1 to remove commas
result.append(temp[x][:cut])
result.append(temp.pop()) #get the last item
Now we have the full list:
['list_c=[1,2]', 'a=3', 'b=1.3', 'c=abch', 'list_a=[1,2]', 'dict_a={a:2,b:3}']
Each element is easily parsable into key:value pairs (and is also executable via exec).
I have a very messy data, I am trying to remove elements that contains alphabets or words. I am trying to capture the elements that have alphanumerical and numerical values. I tried .isalpha() but it not working. How do I remove this?
lista = ['A8817-2938-228','12421','12323-12928-A','12323-12928',
'-','A','YDDEWE','hello','world','testing_purpose','testing purpose',
'A8232-2938-228','N7261-8271']
lista
Tried:
[i.isalnum() for i in lista] # gives boolean, but opposite of what I need.
Output:
['A8817-2938-228','12421','12323-12928-A','12323-12928','-','A8232-2938-228','N7261-8271']
Thanks!
You can add conditional checks in list comprehensions, so this is what you want:
new_list = [i for i in lista if not i.isalnum()]
print(new_list)
Output:
['A8817-2938-228', '12323-12928-A', '12323-12928', '-', 'testing_purpose', 'testing purpose', 'A8232-2938-228', 'N7261-8271']
Note that isalnum won't say True if the string contains spaces or underscores. One option is to remove them before checking: (You also need to use isalpha instead of isalnum)
new_list_2 = [i for i in lista if not i.replace(" ", "").replace("_", "").isalpha()]
print(new_list_2)
Output:
['A8817-2938-228', '12421', '12323-12928-A', '12323-12928', '-', 'A8232-2938-228', 'N7261-8271']
It seems you can just test at least one character is a digit or equality with '-':
res = [i for i in lista if any(ch.isdigit() for ch in i) or i == '-']
print(res)
['A8817-2938-228', '12421', '12323-12928-A', '12323-12928',
'-', 'A8232-2938-228', 'N7261-8271']
What type your data in the list?
You can try to do this:
[str(i).isalnum() for i in lista]
Given the following list of strings:
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n', ... ]
how can I split each entry into a list of tuples:
[ (123, 321), (223, 32221), (19823, 328771), ... ]
In my other poor attempt, I managed to extract the numbers, but I encountered a problem, the element placeholder also contains a number which this method includes! It also doesn't write to a tuple, rather a list.
numbers = list()
for s in my_list:
for x in s:
if x.isdigit():
numbers.append((x))
numbers
We can first build a regex that identifies positive integers:
from re import compile
INTEGER_REGEX = compile(r'\b\d+\b')
Here \d stands for digit (so 0, 1, etc.), + for one or more, and \b are word boundaries.
We can then use INTEGER_REGEX.findall(some_string) to identify all positive integers from the input. Now the only thing left to do is iterate through the elements of the list, and convert the output of INTEGER_REGEX.findall(..) to a tuple. We can do this with:
output = [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
For your given sample data, this will produce:
>>> [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
[('123', '321'), ('223', '32221'), ('19823', '328771')]
Note that digits that are not separate words will not be matched. For instance the 8 in 'see you l8er' will not be matched, since it is not a word.
your attempts iterates on each char of the string. You have to split the string according to blank. A task that str.split does flawlessly.
Also numbers.append((x)) is numbers.append(x). For a tuple of 1 element, add a comma before the closing parenthese. Even if that doesn't solve it either.
Now, the list seems to contain an id (skipped), then 2 integers as string, so why not splitting, zap the first token, and convert as tuple of integers?
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n']
result = [tuple(map(int,x.split()[1:])) for x in my_list]
print(result)
gives:
[(123, 321), (223, 32221), (19823, 328771)]
If i have a list strings:
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
how would i append the numbers flanking the .. to their corresponding lists ? to get:
first = [1,34,407]
last = [23,405,4032]
i wouldn't mind strings either because i can convert to int later
first = ['1','34','407']
last = ['23','405','4032']
Use re.search to match the numbers between .. and store them in two different groups:
import re
first = []
last = []
for s in my_list:
match = re.search(r'(\d+)\.\.(\d+)', s)
first.append(match.group(1))
last.append(match.group(2))
DEMO.
I'd use a regular expression:
import re
num_range = re.compile(r'(\d+)\.\.(\d+)')
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
for entry in my_list:
match = num_range.search(entry)
if match is not None:
f, l = match.groups()
first.append(int(f))
last.append(int(l))
This outputs integers:
>>> first
[1, 34, 407]
>>> last
[23, 405, 4032]
One more solution.
for string in my_list:
numbers = string.split(" ")[-1]
first_num, last_num = numbers.split("..")
first.append(first_num)
last.append(last_num)
It will throw a ValueError if there is a string with no spaces in my_list or there is no ".." after the last space in some of the strings (or there is more than one ".." after the last space of the string).
In fact, this is a good thing if you want to be sure that values were really obtained from all the strings, and all of them were placed after the last space. You can even add a try…catch block to do something in case the string it tries to process is in an unexpected format.
first=[(i.split()[1]).split("..")[0] for i in my_list]
second=[(i.split()[1]).split("..")[1] for i in my_list]