Extract field using Python - python

I am looking for help to extract one specific field.
Here is the example, I am not able to split and cut based on field number because number may change due to the content change
Example 1
[["cn","Phone",1,"","LI(\"\")","0","19%","",""],["OS_DisplayName","Display Name",1,"","LI(\"\")","1,0","19%","",""],["OS_ProductPackage","Product Package",1,"","CO(\"\",\"REQ;1_BASIC!OS!TRV;2_Messaging!OS!OEM;3_Extended!OS!EAC;4_Enhanced!OS!APO;5_Analog Port!OS!CCA;6_Contact Center Agent\",\"\",\";\",\"\",\"\")","2,0","19%","",""],["sn","Last name",1,"","LI(\"\")","3,0","12%","",""],["givenName","First name",1,"","LI(\"\")","4,0","12%","",""],["OS_SiteCode","Site Code",1,"","LI(\"\")","5,0","19%","",""]],[["917845678923","Backup","OEM","917845678923","","CNdd_RD_91784567","","cn=917845678923,cn=Subscribers,cn=np_CNdd_RnD_WangJing,cn=IPC_APAC_1_01,cn=DN,cn=Resources,cn=Users,cn=OS"]],
Output should be
cn=917845678923,cn=Subscribers,cn=np_CNdd_RnD_WangJing,cn=IPC_APAC_1_01,cn=DN,cn=Resources,cn=Users,cn=OS
Example 2
[["cn","Phone",1,"","LI(\"\")","0","19%","",""],["OS_DisplayName","Display Name",1,"","LI(\"\")","1,0","19%","",""],["OS_ProductPackage","Product Package",1,"","CO(\"\",\"REQ;1_BASIC!OS!TRV;2_Messaging!OS!OEM;3_Extended!OS!EAC;4_Enhanced!OS!APO;5_Analog Port!OS!CCA;6_Contact Center Agent\",\"\",\";\",\"\",\"\")","2,0","19%","",""],["sn","Last name",1,"","LI(\"\")","3,0","12%","",""],["givenName","First name",1,"","LI(\"\")","4,0","12%","",""],["OS_SiteCode","Site Code",1,"","LI(\"\")","5,0","19%","",""]],[["868694755000","Yaeng Danning","EAC","Yaeng","Dainning","CNdd_DT_86869475","","cn=868694755000,cn=Subscribers,cn=np_CNdd_DN,cn=IPC_APAC_1_01,cn=DN,cn=Resources,cn=Users,cn=OS"]],
Output should be
cn=868694755000,cn=Subscribers,cn=np_CNdd_DN,cn=IPC_APAC_1_01,cn=DN,cn=Resources,cn=Users,cn=OS
Can someone help me on this.
i tried below code but i am not able to use constant filed number (e[8]) due to field number change
e = line3.split('","","')
print "e"
print e
e = e[8].replace('"]],','').replace('","','').strip()
print "e:" ,e

You could flatten the list and then search through it.
myList = (['one', 'two', ['cn=blahblah', 4, [5],['hi']], [6, [[[7, 'hello']]]]])
def flatten(container):
for i in container:
if isinstance(i, (list,tuple)):
for j in flatten(i):
yield j
else:
yield i
flattenedList = list(flatten(myList))
for x in flattenedList:
if str(x).startswith('cn='):
print(x)

If you are guaranteed the cn field is the very last, you could do something like:
cnFields = array [-1][-1]
and then parse it as you see fit.
Otherwise, you'll need to iterate through the 2d array until you find a string that starts with cn=.

Related

Adding text in front of separate numbers in list

Im new to python and hit a wall with my last print in my program
I got a list of numbers created with math int(numbers that when printed looks like this
[0, 0, 0, 0] #just with random numbers from 1 - 1000
I want to add text in front of every random number in list and print it out like this
[Paul 0 Frederick 0 Ape 0 Ida 0]
Any help would be appreciated. Thanks !
Sounds like you want to make a dictionary. You could type:
d = dict()
d["Paul"] = random.randint(1,100)
....
print(d)
#output: {"Paul":1, "Fredrick":50, "Ape":25, "Ida":32}
Alternatively there is nothing stopping you from using strings and integers in the same list! Python is not strongly statically typed.
If you have a list of numbers [45,5,59,253] and you want to add names to them you probably need a loop.
nums = [45,5,59,253]
names = ["Paul", "Frederick", "Ape", "Ida"]
d = dict()
i = 0
for n in nums:
d[names[i]] = n
i+=1
or if you wanted a list
nums = [45,5,59,253]
names = ["Paul", "Frederick", "Ape", "Ida"]
list = [x for y in zip(names, nums) for x in y]
You'd have to turn your random integers into strings and add them to the text (string) you want.
Example:
lst=[]
x = str(randint(0,1000))
text = 'Alex'
final_text = text+' '+x
lst.append(final_text)
Just added the space like in your example. It'll just be a little more complex to access the numbers if you do it this way.

Python insertion sorting a csv by row

My objective is to use an insertion sort to sort the contents of a csv file by the numbers in the first column for example I want this:
[[7831703, Christian, Schmidt]
[2299817, Amber, Cohen]
[1964394, Gregory, Hanson]
[1984288, Aaron, White]
[9713285, Alexander, Kirk]
[7025528, Janice, Lee]
[6441979, Sarah, Browning]
[8815776, Rick, Wallace]
[2395480, Martin, Weinstein]
[1927432, Stephen, Morrison]]
and sort it to:
[[1927432, Stephen, Morrison]
[1964394, Gregory, Hanson]
[1984288, Aaron, White]
[2299817, Amber, Cohen]
[2395480, Martin, Weinstein]
[6441979, Sarah, Browning]
[7025528, Janice, Lee]
[7831703, Christian, Schmidt]
[8815776, Rick, Wallace]
[9713285, Alexander, Kirk]]
based off the numbers in the first column within python my current code looks like:
import csv
with open('EmployeeList.csv', newline='') as File:
reader = csv.reader(File)
readList = list(reader)
for row in reader:
print(row)
def insertionSort(readList):
#Traverse through 1 to the len of the list
for row in range(len(readList)):
# Traverse through 1 to len(arr)
for i in range(1, len(readList[row])):
key = readList[row][i]
# Move elements of arr[0..i-1], that are
# greater than key, to one position ahead
# of their current position
j = i-1
while j >=0 and key < readList[row][j] :
readList[row] = readList[row]
j -= 1
readList[row] = key
insertionSort(readList)
print ("Sorted array is:")
for i in range(len(readList)):
print ( readList[i])
The code can already sort the contents of a 2d array, but as it is it tries to sort everything.
I think if I got rid of the [] it would work but in testing it hasn't given what I needed.
To try to clarify again I want to sort the rows positions based off of the first columns numerical value.
Sorry if I didn't understand your need right. But you have a list and you need to sort it? Why you don't you just use sort method in list object?
>>> data = [[7831703, "Christian", "Schmidt"],
... [2299817, "Amber", "Cohen"],
... [1964394, "Gregory", "Hanson"],
... [1984288, "Aaron", "White"],
... [9713285, "Alexander", "Kirk"],
... [7025528, "Janice", "Lee"],
... [6441979, "Sarah", "Browning"],
... [8815776, "Rick", "Wallace"],
... [2395480, "Martin", "Weinstein"],
... [1927432, "Stephen", "Morrison"]]
>>> data.sort()
>>> from pprint import pprint
>>> pprint(data)
[[1927432, 'Stephen', 'Morrison'],
[1964394, 'Gregory', 'Hanson'],
[1984288, 'Aaron', 'White'],
[2299817, 'Amber', 'Cohen'],
[2395480, 'Martin', 'Weinstein'],
[6441979, 'Sarah', 'Browning'],
[7025528, 'Janice', 'Lee'],
[7831703, 'Christian', 'Schmidt'],
[8815776, 'Rick', 'Wallace'],
[9713285, 'Alexander', 'Kirk']]
>>>
Note that here we have first element parsed as integer. It is important if you want to sort it by numerical value (99 comes before 100).
And don't be confused by importing pprint. You don't need it to sort. I just used is to get nicer output in console.
And also note that List.sort() is in-place method. It doesn't return sorted list but sorts the list itself.
*** EDIT ***
Here is two different apporach to sort function. Both could be heavily optimized but I hope you get some ideas how this can be done. Both should work and you can add some print commands in loops to see what happens there.
First recursive version. It orders the list a little bit on every run until it is ordered.
def recursiveSort(readList):
# You don't want to mess original data, so we handle copy of it
data = readList.copy()
changed = False
res = []
while len(data): #while 1 shoudl work here as well because eventually we break the loop
if len(data) == 1:
# There is only one element left. Let's add it to end of our result.
res.append(data[0])
break;
if data[0][0] > data[1][0]:
# We compare first two elements in list.
# If first one is bigger, we remove second element from original list and add it next to the result set.
# Then we raise changed flag to tell that we changed the order of original list.
res.append(data.pop(1))
changed = True
else:
# otherwise we remove first element from the list and add next to the result list.
res.append(data.pop(0))
if not changed:
#if no changes has been made, the list is in order
return res
else:
#if we made changes, we sort list one more time.
return recursiveSort(res)
And here is a iterative version, closer your original function.
def iterativeSort(readList):
res = []
for i in range(len(readList)):
print (res)
#loop through the original list
if len(res) == 0:
# if we don't have any items in our result list, we add first element here.
res.append(readList[i])
else:
done = False
for j in range(len(res)):
#loop through the result list this far
if res[j][0] > readList[i][0]:
#if our item in list is smaller than element in res list, we insert it here
res.insert(j, readList[i])
done = True
break
if not done:
#if our item in list is bigger than all the items in result list, we put it last.
res.append(readList[i])
print(res)
return res

Forming the sublist using same indices

Hi I am trying to solve a problem where I have to return the indices in a sublist of the same person. When i say same person , I mean if they have the same username,phone or email(any one of them).
I understand that these identites are mostly unique but for the sake of questions lets assume.
eg.
data = [("username1","phone_number1", "email1"),
("usernameX","phone_number1", "emailX"),
("usernameZ","phone_numberZ", "email1Z"),
("usernameY","phone_numberY", "emailX"),
("username2","phone_number2", "emailX")]
Expected output :
[[0,1,3,4][2]]
Explaination: As 0,1 have the same phone and 3 and 4 have the same email so They all fall under one category. and 2 index falls in the other catoegry.
My approach until now is :
data = [("username1","phone_number1", "email1"),
("usernameX","phone_number1", "emailX"),
("usernameZ","phone_numberZ", "email1Z"),
("usernameY","phone_numberY", "emailX"),
]
def match(t1,t2):
if(t1[0] == t2[0] or t1[1] == t2[1] or t1[2] == t2[2]):
return True
else:
return False
# print(match(data[1],data[3]))
together = []
for i in range(len(data)):
temp = {i}
for j in range(len(data)):
if(match(data[i],data[j])):
temp.add(j)
together.append(temp)
for i in range(len(data)):
ans = together[i]
for j in range(i+1,len(data)):
if(bool(ans.intersection(together[j]))):
ans = ans.union(together[j])
print(ans)
I am not able to reach desired result.
Any help is appreciated. Thank you.
A first solution is similar to yours with some enhancements:
Leveraging any for the match, such that it doesn't require to know the number of items inside the tuples.
Checking if a user is already identified as part of "together" to skip useless comparison
Here it is:
together = set()
for user_idx, user in enumerate(data):
if user_idx in together:
continue # That user is already matched
# No need to check with previous users
for other_idx, other in enumerate(data[user_idx + 1 :], user_idx + 1):
# Match
if any(val_ref == val_other for val_ref, val_other in zip(user, other)):
together.update((user_idx, other_idx))
isolated = set(range(len(data))) ^ together
Another solution use tricks by going through a numpy array to identify isolated users. With numpy it is easy to compare a user to every other user (aka the original array). An isolated user will only match one time to itself on each of its fields, hence summing the boolean values along fields will return, for an isolated user, the length of the tuple of fields.
data = np.array(data)
# For each user, match it with the whole matrice
matches = sum(user == data for user in data)
# Isolated users only match with themselves, hence only have 1 on their line
isolated = set(np.where(np.sum(matches, axis=1) == data.shape[1])[0])
# Together are other users
together = set(range(len(data))) ^ set(isolated)
see the matches array for better understanding:
[[1 2 1]
[1 2 3]
[1 1 1]
[1 1 3]
[1 1 3]]
However, it is not leveraging any of the optimisation mentioned before.
Still, numpy is fast so it should be ok.

List index out of range while creating jagged array

All, I am trying to create a jagged list in Python 3.x. Specifically, I am pulling a number of elements from a list of webpages using Selenium. Each row of my jagged list ("matrix") represents the contents of one of these said webpages. Each of these rows should have as many columns as there are elements pulled from its respective webpage - this number will vary from page to page.
e.g.
webpage1 has 3 elements: a,b,c
webpage2 has 6 elements: d,e,f,g,h,i
webpage3 has 4 elements: j,k,l,m
...
would look like:
[[a,b,c],
[d,e,f,g,h,i],
[j,k,l,m],...]
Here's my code, thus far:
from selenium import webdriver
chromePath = "/Users/me/Documents/2018/chromedriver"
browser = webdriver.Chrome(chromePath)
url = 'https://us.testcompany.com/eng-us/women/handbags/_/N-r4xtxc/to-1'
browser.get(url)
hrefLinkArray = []
hrefElements = browser.find_elements_by_class_name("product-item")
for eachOne in hrefElements:
hrefLinkArray.append(eachOne.get_attribute('href'))
pics = [[]]
for y in range(0, len(hrefLinkArray)): # or type in "range(0, 1)" to debug
browser.get(hrefLinkArray[y])
productViews = browser.find_elements_by_xpath("// *[ # id = 'lightSlider'] / li")
b = -1
for a in productViews:
b = b + 1
# print(y) for debugging
# print(b) for debugging
pics[y][b] = a.get_attribute('src') # <------------ ERROR!
# pics[y][b].append(a.get_attribute('src') GIVES SAME ERROR AS ABOVE
del productViews[:]
browser.quit()
Whenever I run this, I get an error on the first iteration of the a in productViews loop:
line 64, in <module>
pics[y][b] = a.get_attribute('src')
IndexError: list assignment index out of range
From what I can tell, the the integer references are correct (see my debugging lines in the for a in productViews loop), so pics[0][0] is a proper way to reference the jagged list. This being said, I have a feeling pics[0][0] does not yet exist? Or maybe only pics[0] does? I've seen similar posts about this error, but the only solution I've understood seems to be using .append(), and even as such, using this on a 1D list. As you can see in my code, I've used .append() for the hrefLinkArray successfully, whereas it appears unsuccessful on line 64/65. I'm stumped as to why this might be.
Please let me know:
Why my lines .append() and [][]=... are throwing this error.
If there is a more efficient way to accomplish my goal, I'd like to learn!
UPDATE: using #User4343502's answer, in conjunction with #StephenRauch's input, the error resolved and I now and getting the intended-sized jagged list! My amended code is:
listOfLists = []
for y in range(0, len(hrefLinkArray)):
browser.get(hrefLinkArray[y])
productViews = browser.find_elements_by_xpath("// *[ # id = 'lightSlider'] / li")
otherList = []
for other in productViews:
otherList.append(other.get_attribute('src'))
# print(otherList)
listOfLists.append(otherList)
del otherList[:]
del productViews[:]
print(listOfLists)
Note, this code prints a jagged list of totally empty indices e.g. [[][],[][][][],[],[][][],[][],[][][][][]...], but that is a separate issue - I believe related to my productViews object and how it retrieves by xpath... What's important, though, is that my original question was answered. Thanks!
list.append will add an element into a list. This works regardless of what the element is.
a = [1, 2, 3]
b = [float, {}]
c = [[[None]]]
## We will append to this empty list
list_of_lists = []
for x in (a, b, c):
list_of_lists.append(x)
## Prints: [[1, 2, 3], [<type 'float'>, {}], [[[None]]]]
print(list_of_lists)
Try it Online!

translate my sequence?

I have to write a script to translate this sequence:
dict = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
a=""
for y in range( 0, len ( seq)):
c=(seq[y:y+3])
#print(c)
for k, v in dict.items():
if seq[y:y+3] == k:
alle_amino = v[::3] #alle aminozuren op rijtje, a1.1 -a2.1- a.3.1-a1.2 enzo
print (v)
With this script I get the amino acids from the 3 frames under each other, but how can I sort this and get all the amino acids from frame 1 next to each other, and all the amino acids from frame 2 next to each other, and the same for frame 3?
for example , my results must be :
+3 SerIleLeuAlaStpProLysTrpGluProProTyrValAlaStpProIleTyrIleTyrTle
+2 PheAsnThrSerMetThrLysValGlyThrProLeuArgSerMetThrHisIleTyrIleTyr
+1 PheGlnTyrStpHisAspGlnSerGlyAsnProLeuThrStpHisAspProTyrIleTyrIle
TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA
I use Python 3.
i had one more question : can i make this results by some changes in mine own script ?
You can use (Note this would be ridiculously much more easier using biopython translate method):
dictio = {your dictionary here}
def translate(seq):
x = 0
aaseq = []
while True:
try:
aaseq.append(dicti[seq[x:x+3]])
x += 3
except (IndexError, KeyError):
break
return aaseq
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
for frame in range(3):
print('+%i' %(frame+1), ''.join(item.split('|')[1] for item in translate(seq[frame:])))
Note I changed the name of your dictionary with dicti (not to overwrite dict).
Some comments to help you understand:
translate takes you sequence and returns it in the form of a list in which each item corresponds to the amino acid translation of the triplet coding that position. Like:
aaseq = ["L|Leu","L|Leu","P|Pro", ....]
you could process more this data (get only one or three letters code) inside translate or return it as it is to be processed latter as I have done.
translate is called in
''.join(item.split('|')[1] for item in translate(seq[frame:]))
for each frame. For frame value being 0, 1 or 2 it sends seq[frame:] as a parameter to translate. That is, you are sending the sequences corresponding to the three different reading frames processing them in series. Then, in
''.join(item.split('|')[1]
I split the one and three-letters codes for each amino acid and take the one at index 1 (the second). Then they are joined in a single string
Not too pretty, but does what you want
dct = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
def get_amino_list(s):
for y in range(3):
yield [s[x:x+3] for x in range(y, len(s) - 2, 3)]
for n, amn in enumerate(get_amino_list(seq), 1):
print ("+%d " % n + "".join(dct[x][2:] for x in amn))
print(seq)
Here's my solution. I've called your "dict" variable "aminos". The function method3 returns a list of the values to the right of the "|". To merge them into a single string, just join them on "".
From looking at your code, I believe that your aminos dict contains all possible three-letter combinations. Therefore, I've removed the checks that verify this. It should run a lot faster as a result.
def overlapping_groups(seq, group_len=3):
"""Returns `N` adjacent items from an iterable in a sliding window style
"""
for i in range(len(seq)-group_len):
yield seq[i:i+group_len]
def method3(seq, aminos):
return [aminos[k][2:] for k in overlapping_groups(seq, 3)]
for i in range(3):
print("%d: %s" % (i, "".join(method3(seq[i:], aminos))))

Categories

Resources