python: getting rid of values from a list - python

drug_input=['MORPHINE','CODEINE']
def some_function(drug_input)
generic_drugs_mapping={'MORPHINE':0,
'something':1,
'OXYCODONE':2,
'OXYMORPHONE':3,
'METHADONE':4,
'BUPRENORPHINE':5,
'HYDROMORPHONE':6,
'CODEINE':7,
'HYDROCODONE':8}
row is a list.
I would like to set all the members of row[..]='' EXCEPT for those that drug_input defines, in this case it is 0, and 7.
So row[1,2,3,4,5,6,8]=''
If row is initially:
row[0]='blah'
row[1]='bla1'
...
...
row[8]='bla8'
I need:
row[0]='blah' (same as before)
row[1]=''
row[2]=''
row[3]=''
...
...
row[7]='bla7'
row[8]=''
How do I do this?

You could first create a set of all the indexes that should be kept, and then set all the other ones to '':
keep = set(generic_drugs_mapping[drug] for drug in drug_input)
for i in range(len(row)):
if i not in keep:
row[i] = ''

I'd set up a defaultdict unless you really need it to be a list:
from collections import defaultdict # put this at the top of the file
class EmptyStringDict(defaultdict):
__missing__ = lambda self, key: ''
newrow = EmptyStringDict()
for drug in drug_input:
keep = generic_drugs_mapping[drug]
newrow[keep] = row[keep]
saved_len = len(row) # use this later if you need the old row length
row = newrow
Having a list that's mostly empty strings is wasteful. This will build an object that returns '' for every value except the ones actually inserted. However, you'd need to change any iterating code to use xrange(saved_len). Ideally, though, you would just modify the code that uses the list so as not to need such a thing.
If you really want to build the list:
newrow = [''] * len(row) # build a list of empty strings
for drug in drug_input:
keep = generic_drugs_mapping[drug]
newrow[keep] = row[keep] # fill it in where we need to
row = newrow # throw the rest away

Related

dictionary being replaced and I am not sure why it is happening?

I have some code which is something along the lines of
storage = {}
for index, n in enumerate(dates):
if n in specific_dates:
for i in a_list:
my_dict[i] = {}
my_dict[i]["somthing"] = value
my_dict[i]["somthing2"] = value_2
else:
#print(storage[dates[index - 1]["my_dict"][i]["somthing"])
for i in a_list:
my_dict[i] = {}
my_dict[i][somthing] = different_value - storage[dates[index - 1]["my_dict"][i]["somthing"]
my_dict[i]["somthing2"] = different_value_2
storage[n]["my_dict"] = my_dict
The first pass will initiate the code in if n in specific_dates: the second pass goes to for i in a_list:
Essentially the code is getting a value set on specific dates and this value is then used for nonspecific dates that occur after the specific date until the next specific date overrides that value. However, at every date, i save a dictionary of values within a master dictionary called storage.
I found the problem which is when I print my_dict on the second pass my_dict[i] is literally an empty dictionary whereas prior to that loop it was filled. Where I have put the commented-out print line it would print value. I have fixed this by changing storage[n]["my_dict"] = my_dict to storage[n]["my_dict"] = my_dict.copy() and can now access value.
However, I do not really understand why this didnt work how I expected in the first place as I thought by assigning my_dict to storage it was creating new memory.
I was hoping someone could explain why this is happening and why storage[dates[index - 1]["my_dict"][i]["somthing"] doesn't create a new space in memory if that is indeed what is happening.

Python - Get item from a list under a list

I have a list like below.
list = [[Name,ID,Age,mark,subject],[karan,2344,23,87,Bio],[karan,2344,23,87,Mat],[karan,2344,23,87,Eng]]
I need to get only the name 'Karan' as output.
How can I get that?
This is a 2D list,
list[i][j]
will give you the 'i'th list within your list and the 'j'th item within that list.
So to get Karen you want list[1][0]
I upvoted Lio Elbammalf, but decided to provide an answer that made a couple of assumptions that should have been clarified in the question:
The First item of the list is the headers, they are actually in the list (and not there as part of the question), and they are provided as part of the list because there is no guarantee that the headers will always be in the same order.
This is probably a CSV file
Ignoring 2 for the moment, what you would want to do is remove the "headers" from the list (so that the rest of the list is uniform), and then find the index of "Name" (your desired output).
myinput = [["Name","ID","Age","mark","subject"],
["karan",2344,23,87,"Bio"],
["karan",2344,23,87,"Mat"],
["karan",2344,23,87,"Eng"]]
## Remove the headers from the list to simplify everything
headers = myinput.pop(0)
## Figure out where to find the person's Name
nameindex = headers.index("Name")
## Return a list of the Name in each row
return [stats[nameindex] for stats in myinput]
If the name is guaranteed to be the same in each row, then you can just return myinput[0][nameindex] like is suggested in the other answer
Now, if 2 is true, I'm assuming you're using the csv module, in which case load the file using the DictReader class and then just access each row using the 'Name' key:
def loadfile(myfile):
with open(myfile) as f:
reader = csv.DictReader(f)
return list(reader)
def getname(rows):
## This is the same return as above, and again you can just
## return rows[0]['Name'] if you know you only need the first one
return [row['Name'] for row in rows]
In Python 3 you can do this
_, [x, _, _, _, _], *_ = ls
Now x will be karan.

Python create new array for different iteration of a for loop

I want to create a new array (or list) for every iteration. Here is my code:
import numpy as np
data2 = open('pathways.dat', 'r', errors = 'ignore')
pathways = data2.readlines()
special_line_indexes = []
PWY_ID = []
line_cont = []
L_PRMR = [] #Left primary
#i is the line number (first element of enumerate), while line is the line content (2nd elem of enumerate)
for CUI in just_compound_id:
for i,line in enumerate(pathways):
if '//' in line:
#fint the indexes of the lines containing //
special_line_indexes = i+1
elif 'REACTION-LAYOUT -' in line:
if CUI in line:
PWY_ID.append(special_line_indexes)
Specifically I want to create a different array PWY_ID for a different iteration of CUI (the first foor loop). What I end up is instead a long array with all the output. Maybe it would be more efficient to use a dictionary, but I am not sure how to implement it in a for loop...
You can start mapping out the data schema you want. Decide what would be key and what would be value.
Then, you can start by creating a dict()
foo = dict()
And you insert items into dictionary by
foo["KEY"] = "VALUE"
for example,
foo["x"] = 12
then, the value will be stored in the dictionary.
Here is a tutorial about how to use dictionary in python:
https://www.youtube.com/watch?v=2j7ox_zqM4g

enumerate column headers in CSV that belong to the same tag (key) in python

I am using the following sets of generators to parse XML in to CSV:
import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import csv
def flatten_list(aList, prefix=''):
for i, element in enumerate(aList, 1):
eprefix = "{}{}".format(prefix, i)
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, eprefix)
# treat like list
elif element[0].tag == element[1].tag:
yield from flatten_list(element, eprefix)
elif element.text:
text = element.text.strip()
if text:
yield eprefix[:].rstrip('.'), element.text
def flatten_dict(parent_element, prefix=''):
prefix = prefix + parent_element.tag
if parent_element.items():
for k, v in parent_element.items():
yield prefix + k, v
for element in parent_element:
eprefix = element.tag
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, prefix=prefix)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
yield from flatten_list(element, prefix=eprefix)
# if the tag has attributes, add those to the dict
if element.items():
for k, v in element.items():
yield eprefix+k
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
for k, v in element.items():
yield eprefix+k
# finally, if there are no child tags and no attributes, extract
# the text
else:
yield eprefix, element.text
def makerows(pairs):
headers = []
columns = {}
for k, v in pairs:
if k in columns:
columns[k].extend((v,))
else:
headers.append(k)
columns[k] = [k, v]
m = max(len(c) for c in columns.values())
for c in columns.values():
c.extend(' ' for i in range(len(c), m))
L = [columns[k] for k in headers]
rows = list(zip(*L))
return rows
def main():
with open('2-Response_duplicate.xml', 'r', encoding='utf-8') as f:
xml_string = f.read()
xml_string= xml_string.replace('', '') #optional to remove ampersands.
root = ElementTree.XML(xml_string)
# for key, value in flatten_dict(root):
# key = key.rstrip('.').rsplit('.', 1)[-1]
# print(key,value)
writer = csv.writer(open("try5.csv", 'wt'))
writer.writerows(makerows(flatten_dict(root)))
if __name__ == "__main__":
main()
One column of the CSV, when opened in Excel, looks like this:
ObjectGuid
2adeb916-cc43-4d73-8c90-579dd4aa050a
2e77c588-56e5-4f3f-b990-548b89c09acb
c8743bdd-04a6-4635-aedd-684a153f02f0
1cdc3d86-f9f4-4a22-81e1-2ecc20f5e558
2c19d69b-26d3-4df0-8df4-8e293201656f
6d235c85-6a3e-4cb3-9a28-9c37355c02db
c34e05de-0b0c-44ee-8572-c8efaea4a5ee
9b0fe8f5-8ec4-4f13-b797-961036f92f19
1d43d35f-61ef-4df2-bbd9-30bf014f7e10
9cb132e8-bc69-4e4f-8f29-c1f503b50018
24fd77da-030c-4cb7-94f7-040b165191ce
0a949d4f-4f4c-467e-b0a0-40c16fc95a79
801d3091-c28e-44d2-b9bd-3bad99b32547
7f355633-426d-464b-bab9-6a294e95c5d5
This is due to the fact that there are 14 tags with name ObjectGuid. For example, one of these tags looks like this:
<ObjectGuid>2adeb916-cc43-4d73-8c90-579dd4aa050a</ObjectGuid>
My question: is there an efficient method to enumerate the headers (the keys) such that each key is enumerated like so with it's corresponding value (text in the XML data structure):
It would be displayed in Excel as follows:
ObjectGuid_1 ObjectGuid_2 ObejectGuid3 etc.....
Please let me know if there is any other information that you need from me (such as sample XML). Thank you for your help.
It is a mistake to add an element, attribute, or annotative descriptor to the data set itself for the purpose of identity… Normalizing the data should only be done if you own that data and know with 100% guarantee that doing so will not
have any negative effect on additional consumers (ones relying on attribute order to manipulate the DOM). However what is the point of using a dict or nested dicts (which I don’t quite get either t) if the efficiency of the hashed table lookup is taken right back by making 0(n) checks for this attribute new attribute? The point of this hashing is random look up..
If it’s simply the structured in (key, value) pair, which makes sense here.. Why not just use some other contiguous data structure, but treat it like a dictionary.. Say a named tuple…
A second solution is if you want to add additional state is to throw your generator in a class.
class order:
def__init__(self, lines, order):
self.lines = lines
self.order - python(max)
def __iter__(self):
for l, line in enumerate(self.lines, 1);
self.order.append( l, line))
yield line
when open (some file.csv) as f:
lines = oder( f);
Messing with the data a Harmless conversion? For example if were to create a conversion dictionary (see below)
Well that’s fine, that is until one of the values is blank…
types = [ (‘x ’, float’),
(‘y’, float’)
with open(‘some.csv’) as f:
for row in cvs.DictReader(f):
row.update((key, conversion (row [ key]))
for key, conversion in field_types)
[x: ,y: 2. 2] — > that is until there is an empty data point.. Kaboom.
So My suggestion would not be to change or add to the data, but change the algorithm in which deal with such.. If the problem is order why not simply treat say a tuple as a named tuple similar to a dictionary, the caveat being mutability however makes sense with data uniformity...
*I don’t understand the nested dictionary…That is for the y header values yes?
values and order key —> key — > ( key: value ) ? or you could just skip the
first row :p..
So just skip the first row..
for line in {header: list, header: list }
line.next() # problem solved.. or print(line , end = ‘’)
*** Notables
-To iterator over multiple sequences in parallel
h = [a,b,c]
x = [1,2,3]
for i in zip(a,b):
print(i)
(a, 1)
(b, 2)
-Chaining
a = [1,2 , 3]
b= [a, b , c ]enter code here
for x in chain(a, b):
//remove white space

Spark Remove Line (Python)

I have a dataset that I am running statistical functions on and that I need to potentially remove the first and last line (depending on if there is a header trailer). What would be the easiest way to accomplish this?
dataSplit = sc.textFile(inputFile).map(lambda line: line.split(","))
I'm just a beginner with spark, but I guess this would work. Please correct me, if it doesn't work or there are any better practices.
# get file
inputRDD = sc.textFile(inputFile).cache()
# get header
header = inputRDD.first()
# get trailer, but be careful with large RDDs and collect()!
trailer = inputRDD.collect()[-1]
# remove header trailer
filtered_inputRDD = inputRDD.filter(lambda x: x != header).filter(lambda x: x != trailer)
# afterwards you can split
dataSplit = filtered_inputRDD.map(lambda line: line.split(","))
I tried something different to get the trailer in a more efficient way:
# this is a helper function which iterates through
# the part it gets and returns the last item of the part
#
# item is set to "empty" in case part is empty
# replace it with desired output for empty parts
def iterate(part):
item = "empty"
my_iter = iter(part)
for item in my_iter:
pass
return item
# instead of collecting the RDD and returning the last item
# it now does a mapPartitions first and iterates through every part
# and returns the last items of every partition
# then you only have to collect [numPartitions] rows and the
# selection of the last item is much easier
trailer_efficient = inputRDD.mapPartitions(lambda x: [iterate(x)]).collect()[-1]

Categories

Resources