Python3 match, reverse match and dedupe

Python3 match, reverse match and dedupe - python

The intention of the code below is to process the two dictionaries and add matching symbol values from each dictionary to the pairs list if the value contains the item in cur but not if the value contains either item in the curpair list.
I'm successful with the value matching cur but I can't figure out how to do the reverse match against the items in curpair. Also, a secondary issue is that it seems to create duplicates, likely because of the additional for loop to compare against the items in curpair. Either way, I'm not sure if there's a way to dedupe in-line or if that needs to be another routine.
I'm sure there may be a way to do all of this, and simplify the code at the same time, with list comprehension, but maybe not. My trying to understand list comprehension only results in reassuring me that my Python experience is far too brief to be able to make sense of that yet :)
Grateful for any insights.
cur='EUR'
curpair=['BUSD', 'USDT']
def get_pairs(tickers):
pairs = []
for entry in tickers:
if cur in entry['symbol']:
for cp in curpair:
if cp not in entry['symbol']:
pairs.append(entry['symbol'])
return pairs
# d1 and d2 # https://pastebin.com/NfNAeqD4
spot_pairs_list = get_pairs(d1)
margin_pairs_list = get_pairs(d2)
print(f"from d1: {spot_pairs_list}")
print(f"from d2: {margin_pairs_list}")
Output:
from d1: ['BTCEUR', 'BTCEUR', 'ETHEUR', 'ETHEUR', 'BNBEUR', 'BNBEUR', 'XRPEUR', 'XRPEUR', 'EURBUSD', 'EURUSDT', 'SXPEUR', 'SXPEUR', 'LINKEUR', 'LINKEUR', 'DOTEUR', 'DOTEUR', 'LTCEUR', 'LTCEUR', 'ADAEUR', 'ADAEUR', 'BCHEUR', 'BCHEUR', 'YFIEUR', 'YFIEUR', 'XLMEUR', 'XLMEUR', 'GRTEUR', 'GRTEUR', 'EOSEUR', 'EOSEUR', 'DOGEEUR', 'DOGEEUR', 'EGLDEUR', 'EGLDEUR', 'AVAXEUR', 'AVAXEUR', 'UNIEUR', 'UNIEUR', 'CHZEUR', 'CHZEUR', 'ENJEUR', 'ENJEUR', 'MATICEUR', 'MATICEUR', 'LUNAEUR', 'LUNAEUR', 'THETAEUR', 'THETAEUR', 'BTTEUR', 'BTTEUR', 'HOTEUR', 'HOTEUR', 'WINEUR', 'WINEUR', 'VETEUR', 'VETEUR', 'WRXEUR', 'WRXEUR', 'TRXEUR', 'TRXEUR', 'SHIBEUR', 'SHIBEUR', 'ETCEUR', 'ETCEUR', 'SOLEUR', 'SOLEUR', 'ICPEUR', 'ICPEUR']
from d2: ['ADAEUR', 'ADAEUR', 'BCHEUR', 'BCHEUR', 'BNBEUR', 'BNBEUR', 'BTCEUR', 'BTCEUR', 'DOTEUR', 'DOTEUR', 'ETHEUR', 'ETHEUR', 'EURBUSD', 'EURUSDT', 'LINKEUR', 'LINKEUR', 'LTCEUR', 'LTCEUR', 'SXPEUR', 'SXPEUR', 'XLMEUR', 'XLMEUR', 'XRPEUR', 'XRPEUR', 'YFIEUR', 'YFIEUR']

The problem with double values can easily be solved by using set instead of list.
As for the other problem, this loop isn't doing the right thing:
for cp in curpair:
if cp not in entry['symbol']:
pairs.append(entry['symbol'])
This will append the symbol to the list if any of the elements in curpair is missing. For example, if the first cp is not in symbol, it's accepted even if the second element is in symbol. But it seems that you want to include only symbols that include none of the elements in curpair.
In other words, you only want to append if cp in symbol is False for all cp.
This, indeed, can easily be done with list comprehensions:
def get_pairs(tickers):
pairs = set() # set instead of list
for entry in tickers:
symbol = entry['symbol']
if cur in symbol and not any([cp in symbol for cp in curpair]):
pairs.add(symbol) # note it's 'add' for sets, not append
return pairs
[cp in symbol for cp in curpair] is the same as this (deliberately verbose) loop:
cp_check = []
for cp in curpair:
if cp in curpair:
cp_check.append(True)
else:
cp_check.append(False)
So you will get a list of True and False values. any() returns True
if any of the list elements are True, i.e., it basically does the opposite
of what you want. Hence we need to reverse its truth value with not, which will give you True if all of the list elements are False, exactly what we need.

Related

Python Remove Duplicate Dict

I am trying to find a way to remove duplicates from a dict list. I don't have to test the entire object contents because the "name" value in a given object is enough to identify duplication (i.e., duplicate name = duplicate object). My current attempt is this;
newResultArray = []
for i in range(0, len(resultArray)):
for j in range(0, len(resultArray)):
if(i != j):
keyI = resultArray[i]['name']
keyJ = resultArray[j]['name']
if(keyI != keyJ):
newResultArray.append(resultArray[i])
, which is wildly incorrect. Grateful for any suggestions. Thank you.

If name is unique, you should just use a dictionary to store your inner dictionaries, with name being the key. Then you won't even have the issue of duplicates, and you can remove from the list in O(1) time.
Since I don't have access to the code that populates resultArray, I'll simply show how you can convert it into a dictionary in linear time. Although the best option would be to use a dictionary instead of resultArray in the first place, if possible.
new_dictionary = {}
for item in resultArray:
new_dictionary[item['name']] = item
If you must have a list in the end, then you can convert back into a dictionary as such:
new_list = [v for k,v in new_dictionary.items()]

Since "name" provides uniqueness... and assuming "name" is a hashable object, you can build an intermediate dictionary keyed by "name". Any like-named dicts will simply overwrite their predecessor in the dict, giving you a list of unique dictionaries.
tmpDict = {result["name"]:result for result in resultArray}
newArray = list(tmpDict.values())
del tmpDict
You could shrink that down to
newArray = list({result["name"]:result for result in resultArray}.values())
which may be a bit obscure.

Deleting every item in a list corresponding to match in other list

I am showing two cases of many such cases to explain the problem. Each case
has two lists. First list i.e. nu contains elements ID , which has to be matched with first element of every tuple in second list i.e. nu_ew.
If the match is found, I want to delete every occurrence of tuple with same ID from the second list i.e. nu_ew.
Issue is that I am successfully able to delete all desired element (tuple) from Case-2, but last occurrence of element (tuple) i.e. ('Na23', 0.0078838) corresponding to last ID in nu remains undeleted in Case-1.
I am looking for any way out to get desired result. Any suggestion is greatly appreciated.
Case-1:
nu=['F19', 'U234', 'U235', 'U238', 'Cl35', 'Cl37', 'Na23']
nu_ew = [('Mg24', 0.070385), ('Mg25', 0.0092824),
('Mg26', 0.0106276), ('F19', 0.42348),
('U234', 1.083506277), ('U235', 0.0014516),
('U238', 0.202605), ('Cl35', 0.0454252),
('Cl37', 0.0153641), ('Na23', 0.047303),
('F19', 0.0521210), ('U234', 3.61168759),
('U235', 0.000483890), ('U238', 0.067535),
('F19', 0.0217170), ('Na23', 0.0078838),
('Cl35', 0.0181700), ('Cl37', 0.0061456)]
Case-2:
nu=['F19', 'U234', 'U235', 'U238']
nu_ew = [('Mg24', 0.068893), ('Mg25', 0.009085),
('Mg26', 0.0104025), ('F19', 0.414511),
('U234', 1.060551431), ('U235', 0.0014209),
('U238', 0.198313), ('Cl35', 0.0444628),
('Cl37', 0.0150386), ('Na23', 0.046301),
('F19', 0.0510167), ('U234', 5.65627430),
('U235', 0.00075782), ('U238', 0.105767),
('F19', 0.034011)]
I tried doing:
for n in nu:
for ind, id_wf in enumerate(nu_ew):
if n == id_wf[0]:
del nu_ew[ind]`
print(nu_ew)`

I would just use list comprehension here:
something like
result = [t for t in nu_ew if t[0] not in nu]
For larger lists
nu_as_set = set(nu)
result = [t for t in nu_ew if t[0] not in nu_as_set]

Two points to address.
Don't delete/add an element to a list while iterating over it. Instead, make a new list containing the result. This can be done by a for loop, or by list comprehension.
You don't have to iterate over nu, just use in. If nu is going to be larger, consider using a set, because it takes less time with in operator.
So, either one of the following will work for your second code.
Use for loop.
result = []
for id_wf in nu_ew:
if id_wf[0] in nu:
result.append(id_wf)
nu_ew = result
print(nu_ew)
Use list comprehension.
nu_ew = [id_wf in nu_ew if id_wf[0] in nu]
Making a set out of nu is simple, just add
nu = set(nu)
or
nu_set = set(nu)
(If you want to keep the original list)
beforehand.

Sorting a list of dict from redis in python

in my current project i generate a list of data, each entry is from a key in redis in a special DB where only one type of key exist.
r = redis.StrictRedis(host=settings.REDIS_AD, port=settings.REDIS_PORT, db='14')
item_list = []
keys = r.keys('*')
for key in keys:
item = r.hgetall(key)
item_list.append(item)
newlist = sorted(item_list, key=operator.itemgetter('Id'))
The code above let me retrieve the data, create a list of dict each containing the information of an entry, problem is i would like to be able to sort them by ID, so they come out in order when displayed on my html tab in the template, but the sorted function doesn't seem to work since the table isn't sorted.
Any idea why the sorted line doesn't work ? i suppose i'm missing something to make it work but i can't find what.
EDIT :
Thanks to the answer in the comments,the problem was that my 'Id' come out of redis as a string and needed to be casted as int to be sorted
key=lambda d: int(d['Id'])

All values returned from redis are apparently strings and strings do not sort numerically ("10" < "2" == True).
Therefore you need to cast it to a numerical value, probably to int (since they seem to be IDs):
newlist = sorted(item_list, key=lambda d: int(d['Id']))

Remove duplicate data from an array in python

I have this array of data
data = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
I remove the duplicate data with list(set(data)), which gives me
data = [20001202.05, 20001202.50, 20001215.75, 20021215.75]
But I would like to remove the duplicate data, based on the numbers before the "period"; for instance, if there is 20001202.05 and 20001202.50, I want to keep one of them in my array.

As you don't care about the order of the items you keep, you could do:
>>> {int(d):d for d in data}.values()
[20001202.5, 20021215.75, 20001215.75]
If you would like to keep the lowest item, I can't think of a one-liner.
Here is a basic example for anybody who would like to add a condition on the key or value to keep.
seen = set()
result = []
for item in sorted(data):
key = int(item) # or whatever condition
if key not in seen:
result.append(item)
seen.add(key)

Generically, with python 3.7+, because dictionaries maintain order, you can do this, even when order matters:
data = {d:None for d in data}.keys()
However for OP's original problem, OP wants to de-dup based on the integer value, not the raw number, so see the top voted answer. But generically, this will work to remove true duplicates.

data1 = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
for i in data1:
if i not in ls:
ls.append(i)
print ls

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 match, reverse match and dedupe - python

Related

Python Remove Duplicate Dict

Deleting every item in a list corresponding to match in other list

Sorting a list of dict from redis in python

Remove duplicate data from an array in python

Regular expressions matching words which contain the pattern but also the pattern plus something else

Categories

Resources