String Compression in Python

String Compression in Python - python

I have the following input :
my_list = ["x d1","y d1","z d2","t d2"]
And would like to transform it into :
Expected_result = ["d1(x,y)","d2(z,t)"]
I had to use brute force, and also had to call pandas to my rescue, since I didn't find any way to do it in plain/vanilla python. Do you have any other way to solve this?
import pandas as pd
my_list = ["x d1","y d1","z d2","t d2"]
df = pd.DataFrame(my_list,columns=["col1"])
df2 = df["col1"].str.split(" ",expand = True)
df2.columns = ["col1","col2"]
grp = df2.groupby(["col2"])
result = []
for grp_name, data in grp:
res = grp_name +"(" + ",".join(list(data["col1"])) + ")"
result.append(res)
print(result)

The code defines an empty dictionary.
It then iterates over each item in your list and uses the split() method to split item into a key and a value.
Then uses the setdefault() method to add the key and the value to the empty dictionary. If the value already exists as a key in the dictionary, it appends the key to that value's existing list of keys. And if the value does not exist as a key in the dictionary, it creates a new key-value pair with the value as the key and the key as the first element in the new list.
Finally, the list comprehension iterates over the items in the dictionary and creates a string for each key-value pair using join() method to concatenate the keys in the value list into a single string.
result = {}
for item in my_list:
key, value = item.split()
result.setdefault(value, []).append(key)
output = [f"{k}({', '.join(v)})" for k, v in result.items()]
print(output)
['d1(x, y)', 'd2(z, t)']

If your values are already sorted by key (d1, d2), you can use itertools.groupby:
from itertools import groupby
out = [f"{k}({','.join(x[0] for x in g)})"
for k, g in groupby(map(str.split, my_list), lambda x: x[1])]
Output:
['d1(x,y)', 'd2(z,t)']
Otherwise you should use a dictionary as shown by #Jamiu.
A variant of your pandas solution:
out = (df['col1'].str.split(n=1, expand=True)
.groupby(1)[0]
.apply(lambda g: f"{g.name}({','.join(g)})")
.tolist()
)

my_list = ["x d1","y d1","z d2","t d2"]
res = []
for item in my_list:
a, b, *_ = item.split()
if len(res) and b in res[-1]:
res[-1] = res[-1].replace(')', f',{a})')
else:
res.append(f'{b}({a})')
print(res)
['d1(x,y)', 'd2(z,t)']
Let N be the number that follows d, this code works for any number of elements within dN, as long as N is ordered, that is, d1 comes before d2, which comes before d3, ... Works with any value of N , and you can use any letter in the d link as long as it has whatever value is in dN and then dN, keeping that order, "val_in_dN dN"
If you need something that works even if the dN are not in sequence, just say the word, but it will cost a little more

Another possible solution, which is based on pandas:
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({x.values[0]}, {x.values[1]})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x, y)', 'd2(z, t)']
EDIT
The OP, #ShckTchamna, would like to see the above solution modified, in order to be more general: The reason of this edit is to provide a solution that works with the example the OP gives in his comment below.
my_list = ["x d1","y d1","z d2","t d2","kk d2","m d3", "n d3", "s d4"]
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({",".join(x.values)})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x,y)', 'd2(z,t,kk)', 'd3(m,n)', 'd4(s)']

import pandas as pd
df = pd.DataFrame(data=[e.split(' ') for e in ["x d1","y d1","z d2","t d2"]])
r = (df.groupby(1)
.apply(lambda r:"{0}({1},{2})".format(r.iloc[0,1], r.iloc[0,0], r.iloc[1,0]))
.reset_index()
.rename({1:"points", 0:"coordinates"}, axis=1)
)
print(r.coordinates.tolist())
# ['d1(x,y)', 'd2(z,t)']
print(r)
# points coordinates
# 0 d1 d1(x,y)
# 1 d2 d2(z,t)
In replacement of my previous one (that works too) :
import itertools as it
my_list = [e.split(' ') for e in ["x d1","y d1","z d2","t d2"]]
r=[]
for key, group in it.groupby(my_list, lambda x: x[1]):
l=[e[0] for e in list(group)]
r.append("{0}({1},{2})".format(key, l[0], l[1]))
print(r)
Output :
['d1(x,y)', 'd2(z,t)']

Related

Python merge similar list values, from defined number of charactes to new list values

I have already sorted and filtered list of files which look similar to this:
sortList = ['aa.001', 'aa.002', 'aa.003', 'vvv.001', 'vvv.002', 'vvv.003']
and I would like to have new list with merged similar values before . as independent lists inside list:
merList = [['aa.001', 'aa.002', 'aa.003'], ['vvv.001', 'vvv.002', 'vvv.003']]
I tried to make loop but without result, so will be great if anyone could help fix it:
merList = []
for name in sortList:
temp_merList = []
for b in range(len(sortList)-1):
if name[b][:-3] == name[b+1][:-3] and name[b] not in merList:
temp_merList.append(name)
else:
merList.append(temp_merList)
print(merList)

You can use itertools.groupby:
from itertools import groupby
sortList = ['aa.001', 'aa.002', 'aa.003', 'vvv.001', 'vvv.002', 'vvv.003']
out = []
for _, g in groupby(sortList, lambda k: k.split('.')[0]):
out.append(list(g))
print(out)
Prints:
[['aa.001', 'aa.002', 'aa.003'], ['vvv.001', 'vvv.002', 'vvv.003']]
EDIT: Another method (using temporay dictionary):
sortList = ['aa.001', 'aa.002', 'aa.003', 'vvv.001', 'vvv.002', 'vvv.003']
tmp = {}
for name in sortList:
tmp.setdefault(name.split('.')[0], []).append(name)
merList = [v for _, v in tmp.items()]
print(merList)

Python Remove duplicates and original from nested list based on specific key

I m trying to delete all duplicates & original from a nested list based on specific column.
Example
list = [['abc',3232,'demo text'],['def',9834,'another text'],['abc',0988,'another another text'],['poi',1234,'text']]
The key column is the first (abc, def, abc) and based on this I want to remove any item (plus the original) which has the same value with the original.
So the new list should contain:
newlist = [['def',9834,'another text'],['poi',1234,'text']]
I found many similar topics but not for nested lists...
Any help please?

You can construct a list of keys
keys = [x[0] for x in list]
and select only those records for which the key occurs exactly once
newlist = [x for x in list if keys.count(x[0]) == 1]

Use collections.Counter:
from collections import Counter
lst = [['abc',3232,'demo text'],['def',9834,'another text'],['abc',988,'another another text'],['poi',1234,'text']]
d = dict(Counter(x[0] for x in lst))
print([x for x in lst if d[x[0]] == 1])
# [['def', 9834, 'another text'],
# ['poi', 1234, 'text']]
Also note that you shouldn't name your list as list as it shadows the built-in list.

Using a list comprehension.
Demo:
l = [['abc',3232,'demo text'],['def',9834,'another text'],['abc', 988,'another another text'],['poi',1234,'text']]
checkVal = [i[0] for i in l]
print( [i for i in l if not checkVal.count(i[0]) > 1 ] )
Output:
[['def', 9834, 'another text'], ['poi', 1234, 'text']]

Using collections.defaultdict for an O(n) solution:
L = [['abc',3232,'demo text'],
['def',9834,'another text'],
['abc',988,'another another text'],
['poi',1234,'text']]
from collections import defaultdict
d = defaultdict(list)
for key, num, txt in L:
d[key].append([num, txt])
res = [[k, *v[0]] for k, v in d.items() if len(v) == 1]
print(res)
[['def', 9834, 'another text'],
['poi', 1234, 'text']]

Python - Split array into multiple arrays

I have a array contains file names like below:
['001_1.png', '001_2.png', '001_3.png', '002_1.png','002_2.png', '003_1.png', '003_2.png', '003_3.png', '003_4.png', ....]
I want to quickly group these files into multiple arrays like this:
[['001_1.png', '001_2.png', '001_3.png'], ['002_1.png', '002_2.png'], ['003_1.png', '003_2.png', '003_3.png', '003_4.png'], ...]
Could anyone tell me how to do it in few lines in python?

If your data is already sorted by the file name, you can use itertools.groupby:
files = ['001_1.png', '001_2.png', '001_3.png', '002_1.png','002_2.png',
'003_1.png', '003_2.png', '003_3.png']
import itertools
keyfunc = lambda filename: filename[:3]
# this creates an iterator that yields `(group, filenames)` tuples,
# but `filenames` is another iterator
grouper = itertools.groupby(files, keyfunc)
# to get the result as a nested list, we iterate over the grouper to
# discard the groups and turn the `filenames` iterators into lists
result = [list(files) for _, files in grouper]
print(list(result))
# [['001_1.png', '001_2.png', '001_3.png'],
# ['002_1.png', '002_2.png'],
# ['003_1.png', '003_2.png', '003_3.png']]
Otherwise, you can base your code on this recipe, which is more efficient than sorting the list and then using groupby.
Input: Your input is a flat list, so use a regular ol' loop to iterate over it:
for filename in files:
Group identifier: The files are grouped by the first 3 letters:
group = filename[:3]
Output: The output should be a nested list rather than a dict, which can be done with
result = list(groupdict.values())
Putting it together:
files = ['001_1.png', '001_2.png', '001_3.png', '002_1.png','002_2.png',
'003_1.png', '003_2.png', '003_3.png']
import collections
groupdict = collections.defaultdict(list)
for filename in files:
group = filename[:3]
groupdict[group].append(filename)
result = list(groupdict.values())
print(result)
# [['001_1.png', '001_2.png', '001_3.png'],
# ['002_1.png', '002_2.png'],
# ['003_1.png', '003_2.png', '003_3.png']]
Read the recipe answer for more details.

Something like that should work:
import itertools
mylist = [...]
[list(v) for k,v in itertools.groupby(mylist, key=lambda x: x[:3])]
If input list isn't sorted, than use something like that:
import itertools
mylist = [...]
keyfunc = lambda x:x[:3]
mylist = sorted(mylist, key=keyfunc)
[list(v) for k,v in itertools.groupby(mylist, key=keyfunc)]

You can do it using a dictionary.
list = ['001_1.png', '001_2.png', '003_3.png', '002_1.png', '002_2.png', '003_1.png', '003_2.png', '003_3.png', '003_4.png']
dict = {}
for item in list:
if item[:3] not in dict:
dict[item[:3]] = []
dict[item[:3]].append(item)
Then you have to sort the dictionary by key value.
dict = {k:v for k,v in sorted(dict.items())}
The last step is to use a list comprehension in order to achieve your requirement.
list = [v for k,v in dict.items()]
print(list)
Output
[['001_1.png', '001_2.png'], ['002_1.png', '002_2.png'], ['003_3.png', '003_1.png', '003_2.png', '003_3.png', '003_4.png']]

Using a simple iteration and dictionary.
Ex:
l = ['001_1.png', ' 001_2.png', ' 003_3.png', ' 002_1.png', ' 002_2.png', ' 003_1.png', ' 003_2.png', ' 003_3.png', ' 003_4.png']
r = {}
for i in l:
v = i.split("_")[0][-1]
if v not in r:
r[v] = []
r[v].append(i)
print(r.values())
Output:
[['001_1.png', ' 001_2.png'], [' 003_3.png', ' 003_1.png', ' 003_2.png', ' 003_3.png', ' 003_4.png'], [' 002_1.png', ' 002_2.png']]

If your list is ordered like this here is a short script for this task.
myList = []
for i in a:
if i[:-4].endswith('1'):
myList.append([i])
else:
myList[-1].append(i)
# [['001_1.png', '001_2.png', '003_3.png'], ['002_1.png', '002_2.png'], ...]

#IYN
mini_list = []
p = ['001_1.png', '001_2.png', '001_3.png', '002_1.png','002_2.png', '003_1.png', '003_2.png', '003_3.png', '003_4.png']
new_p = []
for index, element in enumerate(p):
if index == len(p)-1:
mini_list.append(element)
new_p.append(mini_list)
break
if element[0:3]==p[index+1][0:3]:
mini_list.append(element)
else:
mini_list.append(element)
new_p.append(mini_list)
mini_list = []
print (new_p)
The code above will cut the initial list into sub lists and append them as individual lists into a resulting, larger list.
Note: not a few lines, but you can convert this to a function.
def list_cutter(ls):
mini_list = []
new_list = []
for index, element in enumerate(ls):
if index == len(ls)-1:
mini_list.append(element)
new_list.append(mini_list)
break
if element[0:3]==ls[index+1][0:3]:
mini_list.append(element)
else:
mini_list.append(element)
new_list.append(mini_list)
mini_list = []
return new_list

Filter List by Longest Element Containing a String

I want to filter a list of all items containing the same last 4 digits, I want to print the longest of them.
For example:
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
# want to return abcdabcd1234 and poiupoiupoiupoiu7890
In this case, we print the longer of the elements containing 1234, and the longer of the elements containing 7890. Finding the longest element containing a certain element is not hard, but doing it for all items in the list (different last four digits) efficiently seems difficult.
My attempt was to first identify all the different last 4 digits using list comprehension and slice:
ids=[]
for x in lst:
ids.append(x[-4:])
ids = list(set(ids))
Next, I would search through the list by index, with a "max_length" variable and "current_id" to find the largest elements of each id. This is clearly very inefficient and was wondering what the best way to do this would be.

Use a dictionary:
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> d = {} # to keep the longest items for digits.
>>> for item in lst:
... key = item[-4:] # last 4 characters
... d[key] = max(d.get(key, ''), item, key=len)
...
>>> d.values() # list(d.values()) in Python 3.x
['abcdabcd1234', 'poiupoiupoiupoiu7890']

from collections import defaultdict
d = defaultdict(str)
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
for x in lst:
if len(x) > len(d[x[-4:]]):
d[x[-4:]] = x
To display the results:
for key, value in d.items():
print key,'=', value
which produces:
1234 = abcdabcd1234
7890 = poiupoiupoiupoiu7890

itertools is great. Use groupby with a lambda to group the list into the same endings, and then from there it is easy:
>>> from itertools import groupby
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> [max(y, key=len) for x, y in groupby(lst, lambda l: l[-4:])]
['abcdabcd1234', 'poiupoiupoiupoiu7890']

Slightly more generic
import string
import collections
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
z = [(x.translate(None, x.translate(None, string.digits)), x) for x in lst]
x = collections.defaultdict(list)
for a, b in z:
x[a].append(b)
for k in x:
print k, max(x[k], key=len)
1234 abcdabcd1234
7890 poiupoiupoiupoiu7890

How to split a list into subsets based on a pattern?

I'm doing this but it feels this can be achieved with much less code. It is Python after all. Starting with a list, I split that list into subsets based on a string prefix.
# Splitting a list into subsets
# expected outcome:
# [['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]
mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
def func(l, newlist=[], index=0):
newlist.append([i for i in l if i.startswith('sub_%s' % index)])
# create a new list without the items in newlist
l = [i for i in l if i not in newlist[index]]
if len(l):
index += 1
func(l, newlist, index)
func(mylist)

You could use itertools.groupby:
>>> import itertools
>>> mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
>>> for k,v in itertools.groupby(mylist,key=lambda x:x[:5]):
... print k, list(v)
...
sub_0 ['sub_0_a', 'sub_0_b']
sub_1 ['sub_1_a', 'sub_1_b']
or exactly as you specified it:
>>> [list(v) for k,v in itertools.groupby(mylist,key=lambda x:x[:5])]
[['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]
Of course, the common caveats apply (Make sure your list is sorted with the same key you're using to group), and you might need a slightly more complicated key function for real world data...

In [28]: mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
In [29]: lis=[]
In [30]: for x in mylist:
i=x.split("_")[1]
try:
lis[int(i)].append(x)
except:
lis.append([])
lis[-1].append(x)
....:
In [31]: lis
Out[31]: [['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]

Use itertools' groupby:
def get_field_sub(x): return x.split('_')[1]
mylist = sorted(mylist, key=get_field_sub)
[ (x, list(y)) for x, y in groupby(mylist, get_field_sub)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String Compression in Python - python

Related

Python merge similar list values, from defined number of charactes to new list values

Python Remove duplicates and original from nested list based on specific key

Python - Split array into multiple arrays

Filter List by Longest Element Containing a String

How to split a list into subsets based on a pattern?

Categories

Resources