Given the following list of strings:
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n', ... ]
how can I split each entry into a list of tuples:
[ (123, 321), (223, 32221), (19823, 328771), ... ]
In my other poor attempt, I managed to extract the numbers, but I encountered a problem, the element placeholder also contains a number which this method includes! It also doesn't write to a tuple, rather a list.
numbers = list()
for s in my_list:
for x in s:
if x.isdigit():
numbers.append((x))
numbers
We can first build a regex that identifies positive integers:
from re import compile
INTEGER_REGEX = compile(r'\b\d+\b')
Here \d stands for digit (so 0, 1, etc.), + for one or more, and \b are word boundaries.
We can then use INTEGER_REGEX.findall(some_string) to identify all positive integers from the input. Now the only thing left to do is iterate through the elements of the list, and convert the output of INTEGER_REGEX.findall(..) to a tuple. We can do this with:
output = [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
For your given sample data, this will produce:
>>> [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
[('123', '321'), ('223', '32221'), ('19823', '328771')]
Note that digits that are not separate words will not be matched. For instance the 8 in 'see you l8er' will not be matched, since it is not a word.
your attempts iterates on each char of the string. You have to split the string according to blank. A task that str.split does flawlessly.
Also numbers.append((x)) is numbers.append(x). For a tuple of 1 element, add a comma before the closing parenthese. Even if that doesn't solve it either.
Now, the list seems to contain an id (skipped), then 2 integers as string, so why not splitting, zap the first token, and convert as tuple of integers?
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n']
result = [tuple(map(int,x.split()[1:])) for x in my_list]
print(result)
gives:
[(123, 321), (223, 32221), (19823, 328771)]
Related
I have a table with a street number field that should have been all numeric values but the creators of the dataset allowed invalid values like 567M or 4321.5 to indicate an apartment number or granny unit. I need to get whole numbers into a new field and place the letter values and decimal values into a street suffix field. I've been playing around with regex and isalpha() and isalnum():
# import regex
import re
# list of values that should be all numbers
lst = ['1234', '4321.5', '567M']
# create new lists where l1 will be numeric datatypes and l2 will contain suffixes
# if there is a decimal symbol the string should be split with first value being all numbers
# and going into l1 and everything after decimal in l2
l1 = []
l2 =[]
for i in lst:
if i.isalnum(): # all characters are numeric and good to go. Maybe need to do int()?
l1.append(i)
elif '.' in i: # a decimal was found and values need to to be split and placed into two different lists
i.split(".")
l1.append(i[0])
l2.append(i[-1])
else:
if i.isnumeric() == False: # if a letter is found in a list item everything prior to letter goes to l1 and letter goes to l2
i = re.split('(\d+)', i)
l1.append(i[0])
l2.append(i[-1])
I immediately got this back when running the code:
['4321', '5']
And then got this for l1 and l2 (l1 being new numeric list and l2 being string suffix list):
['4321', '5']
l1
['1234', '4', '567M']
l2
['5']
Am I headed in the right direction here? I was hoping this would be simpler but the data is pretty wonky.
You can just use a regex to do the pattern matching. The expression would consist of three parts:
the street number: [1-9]\d*
a optional delimiter, e.g .,- : [\.,\- ]*
the suffix (any chars except whitespaces): [^\s]*
import re
pattern = re.compile(r'(?P<number>[1-9]\d*)[\.,\- ]*(?P<suffix>[^\s]*)')
list = ['1234', '4321.5', '567M', ' 31-a ']
l1 = []
l2 = []
for item in list:
match = pattern.search(item)
if match:
l1.append(int(match.group('number')))
l2.append(match.group('suffix'))
print(f'l1: {l1}')
print(f'l2: {l2}')
Result:
l1: [1234, 4321, 567, 31]
l2: ['', '5', 'M', 'a']
Your first issue is that i.split(".") doesn't do anything useful by itself, it returns a list with the split pieces. You're not using that (and it seems like your interpreter is printing it out for you), and you go on to index into the original, unsplit string. You could use i = i.split(".") to fix that part of your code, but below I suggest a better approach.
The other issue is that your regex splitting code isn't working either. Rather than try to fix it, I'd suggest using an entirely different regex approach, which will also handle the other cases you're looking at:
for i in lst:
match = re.match(r'(\d+)\.?(.*)', i)
if match is None:
raise ValueError("invalid address format {i}")
l1.append(match.group(1))
l2.append(match.group(2))
This uses a regex pattern to match the two parts of the street number you want. Any leading numeric part gets matched by the first capturing group and put into l1, and whatever else is trailing (optionally after a decimal point) goes into l2. The second part may be an empty string, if there is no extra part to the street number, but I still put it into l2, which you'll appreciate when you try to iterate over l1 and l2 together.
sorry if this is very noob question, but I have tried to solve this on my own for some time, gave it a few searches (used the "map" function, etc.) and I did not find a solution to this. Maybe it's a small mistake somewhere, but I am new to python and seem to have some sort of tunnel vision.
I have some text (see sample) that has numbers inbetween. I want to extract all numbers with regular expressions into a list and then sum them. I seem to be able to do the extraction, but struggle to convert them to integers and then sum them.
import re
df = ["test 4497 test 6702 test 8454 test",
"7449 test"]
numlist = list()
for line in df:
line = line.rstrip()
numbers = re.findall("[0-9]+", line) # find numbers
if len(numbers) < 1: continue # ignore lines with no numbers, none in this sample
numlist.append(numbers) # create list of numbers
The sum(numlist) returns an error.
You don't need a regex for this. Split the strings in the list, and sum those that are numeric in a comprehension:
sum(sum(int(i) for i in s.split() if i.isnumeric()) for s in df)
# 27102
Or similarly, flatten the resulting lists, and sum once:
from itertools imprt chain
sum(chain.from_iterable((int(i) for i in s.split() if i.isnumeric()) for s in df))
# 27102
This is the source of your problem:
finadall returns a list which you are appending to numlist, a list. So you end up with a list of lists. You should instead do:
numlist.extend(numbers)
So that you end up with a single list of numbers (well, actually string representations of numbers). Then you can convert the strings to integers and sum:
the_sum = sum(int(n) for n in numlist)
Iterate twice over df and append each digit to numlist:
numlist = list()
for item in df:
for word in item.split():
if word.isnumeric():
numlist.append(int(word))
print(numlist)
print(sum(numlist))
Out:
[4497, 6702, 8454, 7449]
27102
You could make a one-liner using list comprehension:
print(sum([int(word) for item in df for word in item.split() if word.isnumeric()]))
>>> 27102
It's as easy as
my_sum = sum(map(int, numbers_list))
Here is an option using map, filter and sum:
First splits the strings at the spaces, filters out the non-numbers, casts the number-strings to int and finally sums them.
# if you want the sum per string in the list
sums = [sum(map(int, filter(str.isnumeric, s.split()))) for s in df]
# [19653, 7449]
# if you simply want the sum of all numbers of all strings
sum(sum(map(int, filter(str.isnumeric, s.split()))) for s in df)
# 27102
I want to check a string to see if it contains any of the words i have in my list.
the list is has somewhere around 100 individual words.
i have tried using regex but cant get it to work...
string = "<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>"
list = ['Café','Afrikansk','............','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
in this case the string has 'Dansk' in it. The string could contain more than one of the words in the list.
i want to write a piece of code that prints the words in the list which is also in the string.
in this case the output should be: Dansk
if there was more than one word in the string it should be: Dansk, ...., ....
I hope someone can help
>>> list = ['Café','Afrikansk','............','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
>>> string = """<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>"""
>>> [x for x in list if x in string]
['Dansk']
I recommend not using list as a variable name, as it usually referring to the type list (like str or int)
Use a list comprehension with a membership check:
[x for x in lst if x in string]
Note that I have renamed your list to lst, as list is built-in.
Example:
string = '<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>'
lst = ['Café','Afrikansk','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
print([x for x in lst if x in string])
# ['Dansk']
in your case you can use:
string_intersection = set(string.replace(',', '').split()).intersection(my_list)
print(*string_intersection, sep =',')
output:
Dansk
I have following data in a list and it is a hex number,
['aaaaa955554e']
I would like to split this into ['aaaaa9,55554e'] with a comma.
I know how to split this when there are some delimiters between but how should i do for this case?
Thanks
This will do what I think you are looking for:
yourlist = ['aaaaa955554e']
new_list = [','.join([x[i:i+6] for i in range(0, len(x), 6)]) for x in yourlist]
It will put a comma at every sixth character in each item in your list. (I am assuming you will have more than just one item in the list, and that the items are of unknown length. Not that it matters.)
i assume you wanna split into every 6th character
using regex
import re
lst = ['aaaaa955554e']
newlst = re.findall('\w{6}', lst[0])
# ['aaaaa9', '55554e']
Using list comprehension, this works for multiple items in lst
lst = ['aaaaa955554e']
newlst = [item[i:i+6] for i in range(0,len(a[0]),6) for item in lst]
# ['aaaaa9', '55554e']
This could be done using a regular expression substitution as follows:
import re
print re.sub(r'([a-zA-Z]+\d)(.*?)', r'\1,\2', 'aaaaa955554e', count=1)
Giving you:
aaaaa9,55554e
This splits after seeing the first digit.
If i have a list strings:
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
how would i append the numbers flanking the .. to their corresponding lists ? to get:
first = [1,34,407]
last = [23,405,4032]
i wouldn't mind strings either because i can convert to int later
first = ['1','34','407']
last = ['23','405','4032']
Use re.search to match the numbers between .. and store them in two different groups:
import re
first = []
last = []
for s in my_list:
match = re.search(r'(\d+)\.\.(\d+)', s)
first.append(match.group(1))
last.append(match.group(2))
DEMO.
I'd use a regular expression:
import re
num_range = re.compile(r'(\d+)\.\.(\d+)')
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
for entry in my_list:
match = num_range.search(entry)
if match is not None:
f, l = match.groups()
first.append(int(f))
last.append(int(l))
This outputs integers:
>>> first
[1, 34, 407]
>>> last
[23, 405, 4032]
One more solution.
for string in my_list:
numbers = string.split(" ")[-1]
first_num, last_num = numbers.split("..")
first.append(first_num)
last.append(last_num)
It will throw a ValueError if there is a string with no spaces in my_list or there is no ".." after the last space in some of the strings (or there is more than one ".." after the last space of the string).
In fact, this is a good thing if you want to be sure that values were really obtained from all the strings, and all of them were placed after the last space. You can even add a try…catch block to do something in case the string it tries to process is in an unexpected format.
first=[(i.split()[1]).split("..")[0] for i in my_list]
second=[(i.split()[1]).split("..")[1] for i in my_list]