Split list into sub-lists based on integer in string - python

I have a list of strings as such:
['text_1.jpg', 'othertext_1.jpg', 'text_2.jpg', 'othertext_2.jpg', ...]
In reality, there are more entries than 2 per number but this is the general format. I would like to split this list into list of lists as such:
[['text_1.jpg', 'othertext_1.jpg'], ['text_2.jpg', 'othertext_2.jpg'], ...]
These sub-lists being based on the integer after the underscore. My current method to do so is to first sort the list based on the numbers as shown in the first list sample above and then iterate through each index and copy the values into new lists if it matches the value of the previous integer.
I am wondering if there is a simpler more pythonic way of performing this task.

Try:
import re
lst = ["text_1.jpg", "othertext_1.jpg", "text_2.jpg", "othertext_2.jpg"]
r = re.compile(r"_(\d+)\.jpg")
out = {}
for val in lst:
num = r.search(val).group(1)
out.setdefault(num, []).append(val)
print(list(out.values()))
Prints:
[['text_1.jpg', 'othertext_1.jpg'], ['text_2.jpg', 'othertext_2.jpg']]

Similiar solution to #Andrej:
import itertools
import re
def find_number(s):
# it is said that python will compile regex automatically
# feel free to compile first
return re.search(r'_(\d+)\.jpg', s).group(1)
l = ['text_1.jpg', 'othertext_1.jpg', 'text_2.jpg', 'othertext_2.jpg']
res = [list(v) for k, v in itertools.groupby(l, find_number)]
print(res)
#[['text_1.jpg', 'othertext_1.jpg'], ['text_2.jpg', 'othertext_2.jpg']]

Related

Seperating list of strings with similarities into different lists

I am trying to separate a list with similar strings into multiple lists in python.
e.g. lets say the list is:
lst = ["asd_A01_000.csv", "asd_A02_000.csv", "asd_A02_001.csv", "asd_A01_001.csv", "asd_A04_000.csv"]
and I want to have new lists with any new codes like "A01" (so would have A01, A02, A04 etc.) meaning the result I want would be
["asd_A01_000.csv","asd_A01_001.csv"]
["asd_A02_000.csv","asd_A02_001.csv"]
["asd_A04_000.csv"]
The numbers do not have to be in order, as long as they are in different lists.
It is pretty easy to just do this one by one using a for loop where "A01" in list, but I have codes ranging from A01-A100.
Is there an easy way to do this without doing tons of for loops?
P.S The strings are actually full file directory paths which also have _'s in them (e.g C:\Users\Name\Documents\0XX_20220719_XX\asd_A001_000.csv)
One approach:
from collections import defaultdict
lst = ["asd_A01_000.csv", "asd_A02_000.csv", "asd_A02_001.csv", "asd_A01_001.csv", "asd_A04_000.csv"]
d = defaultdict(list)
for e in lst:
d[e.split("_")[1]].append(e)
res = list(d.values())
print(res)
Output
[['asd_A01_000.csv', 'asd_A01_001.csv'], ['asd_A02_000.csv', 'asd_A02_001.csv'], ['asd_A04_000.csv']]
You can try itertools.groupby()
import itertools
lst = sorted(lst, key=lambda asd: asd.split("_")[1])
out = [list(g) for _, g in itertools.groupby(lst, lambda asd: asd.split("_")[1])]
print(out)
[['asd_A01_000.csv', 'asd_A01_001.csv'], ['asd_A02_000.csv', 'asd_A02_001.csv'], ['asd_A04_000.csv']]

How to change the index of an element in a list/array to another position/index without deleting/changing the original element and its value

For example lets say I have a list as below,
list = ['list4','this1','my3','is2'] or [1,6,'one','six']
So now I want to change the index of each element to match the number or make sense as I see fit (needn't be number) like so, (basically change the index of the element to wherever I want)
list = ['this1','is2','my3','list4'] or ['one',1,'six',6]
how do I do this whether there be numbers or not ?
Please help, Thanks in advance.
If you don't wanna use regex and learn it's mini language use this simpler method:
list1 = ['list4','this1', 'he5re', 'my3','is2']
def mySort(string):
if any(char.isdigit() for char in string): #Check if theres a number in the string
return [float(char) for char in string if char.isdigit()][0] #Return list of numbers, and return the first one (we are expecting only one number in the string)
list1.sort(key = mySort)
print(list1)
Inspired by this answer: https://stackoverflow.com/a/4289557/11101156
For the first one, it is easy:
>>> lst = ['list4','this1','my3','is2']
>>> lst = sorted(lst, key=lambda x:int(x[-1]))
>>> lst
['this1', 'is2', 'my3', 'list4']
But this assumes each item is string, and the last character of each item is numeric. Also it works as long as the numeric parts in each item is single digit. Otherwise it breaks. For the second one, you need to define "how you see it fit", in order to sort it in a logic.
If there are multiple numeric characters:
>>> import re
>>> lst = ['lis22t4','th2is21','my3','is2']
>>> sorted(lst, key=lambda x:int(re.search(r'\d+$', x).group(0)))
['is2', 'my3', 'list4', 'this21']
# or,
>>> ['is2', 'my3', 'lis22t4', 'th2is21']
But you can always do:
>>> lst = [1,6,'one','six']
>>> lst = [lst[2], lst[0], lst[3], lst[1]]
>>> lst
['one', 1, 'six', 6]
Also, don't use python built-ins as variable names. list is a bad variable name.
If you just want to move element in position 'y' to position 'x' of a list, you can try this one-liner, using pop and insert:
lst.insert(x, lst.pop(y))
If you know the order how you want to change indexes you can write simple code:
old_list= ['list4','this1','my3','is2']
order = [1, 3, 2, 0]
new_list = [old_list[idx] for idx in order]
If you can write your logic as a function, you can use sorted() and pass your function name as a key:
old_list= ['list4','this1','my3','is2']
def extract_number(string):
digits = ''.join([c for c in string if c.isdigit()])
return int(digits)
new_list = sorted(old_list, key = extract_number)
This case list is sorted by number, which is constructed by combining digits found in a string.
a = [1,2,3,4]
def rep(s, l, ab):
id = l.index(s)
q = s
del(l[id])
l.insert(ab, q)
return l
l = rep(a[0], a, 2)
print(l)
Hope you like this
Its much simpler

Python: Compare Lists

I have two lists a and b:
a = ['146769015', '163081689', '172235774', ...]
b = [['StackOverflow (146769015)'], ['StackOverflow (146769015)'], ['StackOverflow (163081689)'], ...]
What I'm trying to do is to check if the elements of list a are in list b, and if they are, how many times they appear.
In this case the output should be:
'146769015':2
'163081689':1
I've already tried the set() function but that does not seem to work
print(set(a)&set(b))
And i get this
print(set(a)&set(b))
TypeError: unhashable type: 'list'
Is it possible to do what i want?
Thank you all.
When you perform set(a) & set(b), you're trying to see which elements both lists share. There are a couple errors in your logic.
First, your first list is comprised of strings. Your second list is comprised of lists.
Second, the elements of your second list are never the same than your first list, because the first has only numbers, and the second has numbers and letters.
Third, even if you only extract the numbers, the intersection of both sets will bring which numbers are on both sets, but not how many times.
A good approach might be to extract the numbers in your second list and then count occurrences if they are present in list a:
from collections import Counter
import re
a=['146769015', '163081689', '172235774']
b=[['StackOverflow (146769015)'],['StackOverflow (146769015)'],['StackOverflow (163081689)']]
numbs = [re.search('\d+', elem[0]).group(0) for elem in b]
cnt = Counter()
for n in numbs:
if n in a:
cnt[n]+= 1
Output:
Counter({'146769015': 2, '163081689': 1})
I'll leave as homework to you to research what are dictionaries and Counters.
It's tricky when you have a string as a subset of strings, otherwise I think you could use a Counter from collections and iterate that using a as a key.
Otherwise you can flatten the list and nested loop through it.
from collections import defaultdict
flat_list = [item for sublist in b for item in sublist]
c = defaultdict(lambda: 0)
for string in a:
for string2 in flat_list:
if string in string2:
c[string] += 1
You can use a dictionary:
a=['146769015', '163081689', '172235774']
b=[['StackOverflow (146769015)'],['StackOverflow (146769015)'],['StackOverflow (163081689)']]
c = {}
for s in a:
for d in b:
for i in d:
if s in i:
if s not in c:
c[s] = 1
else:
c[s] += 1
print(c)
Output:
{'146769015': 2, '163081689': 1}

Matching elements between lists in Python - keeping location

I have two lists, both fairly long. List A contains a list of integers, some of which are repeated in list B. I can find which elements appear in both by using:
idx = set(list_A).intersection(list_B)
This returns a set of all the elements appearing in both list A and list B.
However, I would like to find a way to find the matches between the two lists and also retain information about the elements' positions in both lists. Such a function might look like:
def match_lists(list_A,list_B):
.
.
.
return match_A,match_B
where match_A would contain the positions of elements in list_A that had a match somewhere in list_B and vice-versa for match_B.
I can see how to construct such lists using a for-loop, however this feels like it would be prohibitively slow for long lists.
Regarding duplicates: list_B has no duplicates in it, if there is a duplicate in list_A then return all the matched positions as a list, so match_A would be a list of lists.
That should do the job :)
def match_list(list_A, list_B):
intersect = set(list_A).intersection(list_B)
interPosA = [[i for i, x in enumerate(list_A) if x == dup] for dup in intersect]
interPosB = [i for i, x in enumerate(list_B) if x in intersect]
return interPosA, interPosB
(Thanks to machine yearning for duplicate edit)
Use dicts or defaultdicts to store the unique values as keys that map to the indices they appear at, then combine the dicts:
from collections import defaultdict
def make_offset_dict(it):
ret = defaultdict(list) # Or set, the values are unique indices either way
for i, x in enumerate(it):
ret[x].append(i)
dictA = make_offset_dict(A)
dictB = make_offset_dict(B)
for k in dictA.viewkeys() & dictB.viewkeys(): # Plain .keys() on Py3
print(k, dictA[k], dictB[k])
This iterates A and B exactly once each so it works even if they're one-time use iterators, e.g. from a file-like object, and it works efficiently, storing no more data than needed and sticking to cheap hashing based operations instead of repeated iteration.
This isn't the solution to your specific problem, but it preserves all the information needed to solve your problem and then some (e.g. it's cheap to figure out where the matches are located for any given value in either A or B); you can trivially adapt it to your use case or more complicated ones.
How about this:
def match_lists(list_A, list_B):
idx = set(list_A).intersection(list_B)
A_indexes = []
for i, element in enumerate(list_A):
if element in idx:
A_indexes.append(i)
B_indexes = []
for i, element in enumerate(list_B):
if element in idx:
B_indexes.append(i)
return A_indexes, B_indexes
This only runs through each list once (requiring only one dict) and also works with duplicates in list_B
def match_lists(list_A,list_B):
da=dict((e,i) for i,e in enumerate(list_A))
for bi,e in enumerate(list_B):
try:
ai=da[e]
yield (e,ai,bi) # element e is in position ai in list_A and bi in list_B
except KeyError:
pass
Try this:
def match_lists(list_A, list_B):
match_A = {}
match_B = {}
for elem in list_A:
if elem in list_B:
match_A[elem] = list_A.index(elem)
match_B[elem] = list_B.index(elem)
return match_A, match_B

Trying to add to dictionary values by counting occurrences in a list of lists (Python)

I'm trying to get a count of items in a list of lists and add those counts to a dictionary in Python. I have successfully made the list (it's a list of all possible combos of occurrences for individual ad viewing records) and a dictionary with keys equal to all the values that could possibly appear, and now I need to count how many times each occur and change the values in the dictionary to the count of their corresponding keys in the list of lists. Here's what I have:
import itertools
stuff=(1,2,3,4)
n=1
combs=list()
while n<=len(stuff):
combs.append(list(itertools.combinations(stuff,n)))
n = n+1
viewers=((1,3,4),(1,2,4),(1,4),(1,2),(1,4))
recs=list()
h=1
while h<=len(viewers):
j=1
while j<=len(viewers[h-1]):
recs.append(list(itertools.combinations(viewers[h-1],j)))
j=j+1
h=h+1
showcount={}
for list in combs:
for item in list:
showcount[item]=0
for k, v in showcount:
for item in recs:
for item in item:
if item == k:
v = v+1
I've tried a bunch of different ways to do this, and I usually either get 'too many values to unpack' errors or it simply doesn't populate. There are several similar questions posted but I'm pretty new to Python and none of them really addressed what I needed close enough for me to figure it out. Many thanks.
Use a Counter instead of an ordinary dict to count things:
from collections import Counter
showcount = Counter()
for item in recs:
showcount.update(item)
or even:
from collections import Counter
from itertools import chain
showcount = Counter(chain.from_iterable(recs))
As you can see that makes your code vastly simpler.
If all you want to do is flatten your list of lists you can use itertools.chain()
>>> import itertools
>>> listOfLists = ((1,3,4),(1,2,4),(1,4),(1,2),(1,4))
>>> flatList = itertools.chain.from_iterable(listOfLists)
The Counter object from the collections module will probably do the rest of what you want.
>>> from collections import Counter
>>> Counter(flatList)
Counter({1: 5, 4: 4, 2: 2, 3: 1})
I have some old code that resembles the issue, it might prove useful to people facing a similar problem.
import sys
file = open(sys.argv[-1], "r").read()
wordictionary={}
for word in file.split():
if word not in wordictionary:
wordictionary[word] = 1
else:
wordictionary[word] += 1
sortable = [(wordictionary[key], key) for key in wordictionary]
sortable.sort()
sortable.reverse()
for member in sortable: print (member)
First, 'flatten' the list using a generator expression: (item for sublist in combs for item in sublist).
Then, iterate over the flattened list. For each item, you either add an entry to the dict (if it doesn't already exist), or add one to the value.
d = {}
for key in (item for sublist in combs for item in sublist):
try:
d[key] += 1
except KeyError: # I'm not certain that KeyError is the right one, you might get TypeError. You should check this
d[key] = 1
This technique assumes all the elements of the sublists are hashable and can be used as keys.

Categories

Resources