Last item on the reduce Method - python

I have this list of countries:
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
I need to resolve following exersice: Use reduce to concatenate all the countries and to produce this sentence: Estonia, Finland, Sweden, Denmark, Norway, and Iceland are north European countries
def sentece(pais,pais_next):
if pais_next=='Iceland':
return pais+' and '+pais_next + ' are north European countries'
else: return pais+', '+pais_next
countries_reduce=reduce(sentece,countries)
print(countries_reduce)
The code run perfect, but if I want to do in general, How I know what is the last element?.

The reduce function doesn't have a way to tell it what to do about the last item, only what to do about the initialization.
There's two general ways to go about it:
Just do simple concatenation with a comma and a space, but only on the first n-1 items of the list, then manually append the correct format for the last item
Change the last item from Iceland to and Iceland are north European countries, then do the concatenation for the full list.

Figuring out which is the last element is a bad idea; any solution that would give you that information would be a royal hack.
Normally, you wouldn't use reduce to solve this at all (repeated concatenation is a form of Schlemiel the Painter's Algorithm, involving O(n²) work, where efficient algorithms can be O(n)), so you'd just use ', '.join, e.g.:
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries_str = f'{", ".join(countries[:-1])} and {countries[-1]} are north European countries'
or, premodifying countries[-1] to reduce the complexity to the point where an f-string isn't necessary (assuming an Oxford comma is okay):
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries[-1] = 'and ' + countries[-1] # Put the "and " prefix in front ahead of time
countries_str = ', '.join(countries) + ' are north European countries'
where join is used for the consistent join components, wrapped in an f-string that inserts the last item along with the rest of the formatting.
If you must use reduce, you'd still want to handle the final element separately, either by processing it completely separately at the end, e.g.
from functools import reduce
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries_str = f'{reduce(lambda x, y: f"{x}, {y}", countries[:-1])} and {countries[-1]} are north European countries'
print(countries_str)
Try it online!
or by manually tweaking it ahead of time so it can be used in a consistent manner (assuming you're okay with the Oxford comma):
from functools import reduce
countries = ['Estonia', 'Finland', 'Sweden', 'Denmark', 'Norway', 'Iceland']
countries[-1] = 'and ' + countries[-1] # Put the "and " prefix in front ahead of time
countries_str = f'{reduce(lambda x, y: f"{x}, {y}", countries)} are north European countries'
print(countries_str)
Try it online!
Again, reduce is a bad solution to the problem; str.join (using ', '.join) is a O(n) solution (on CPython, it pre-scans the items to join to compute the final length, then preallocates the complete final str, and copies each input exactly once), where reduce is O(n²) (and unlike an actual loop using +=, it can't even benefit from the CPython reference interpreter's implementation detail that sometimes allows concatenation to mutate in-place, reducing the number of data copies).

Related

Issue in getting the correct list items/values using python

I have a list in python like this:
list1 = ['Security Name % to Net Assets* DEBENtURES 0.04, Britannia Industries Ltd. EQUity & RELAtED 96.83, HDFC Bank 6.98, ICICI 4.82']
if I get the length of this list using len(list1) then it gives as 1 simply because it considers all as 1.
How can I transform my list1 such that it would look like this:
list1_altered = ['Security Name % to Net Assets* DEBENtURES 0.04', 'Britannia Industries Ltd. EQUity & RELAtED 96.83', 'HDFC Bank 6.98', 'ICICI 4.82']
after which upon using len(list1_altered) I should be able to get value as 4
I have tried using replace(",", "\',\'") however it doesn't give the desired result.
Please help me do this.
Replacing the commas by quotes is not going to change the fact that you'll still have a single string. You won't transform an object (string) into another (list of strings) this way.
Use str.split to generate a list of multiple strings split on a separator:
list1_altered = list1[0].split(',')

Convert XML file to a list of lists in Python

I have a dataset from HMDB the Saliva Metabolites data.
This data is an XML file. What I want to do is to convert this XML file to a list of lists (nested lists) in Python, however, I don't want all the nodes in the list.
EDITED: AND THIS IS EXAMPLE OF PARTIAL DATA FOR ONE METABOLITE
<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
<version>4.0</version>
<creation_date>2005-11-16 15:48:42 UTC</creation_date>
<update_date>2019-01-11 19:13:56 UTC</update_date>
<accession>HMDB0000001</accession>
<status>quantified</status>
<secondary_accessions>
<accession>HMDB00001</accession>
<accession>HMDB0004935</accession>
<accession>HMDB0006703</accession>
<accession>HMDB0006704</accession>
<accession>HMDB04935</accession>
<accession>HMDB06703</accession>
<accession>HMDB06704</accession>
</secondary_accessions>
<name>1-Methylhistidine</name>
<cs_description>1-Methylhistidine, also known as 1-mhis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine has been found in human muscle and skeletal muscle tissues, and has also been detected in most biofluids, including cerebrospinal fluid, saliva, blood, and feces. Within the cell, 1-methylhistidine is primarily located in the cytoplasm. 1-Methylhistidine participates in a number of enzymatic reactions. In particular, 1-Methylhistidine and Beta-alanine can be converted into anserine; which is catalyzed by the enzyme carnosine synthase 1. In addition, Beta-Alanine and 1-methylhistidine can be biosynthesized from anserine; which is mediated by the enzyme cytosolic non-specific dipeptidase. In humans, 1-methylhistidine is involved in the histidine metabolism pathway. 1-Methylhistidine is also involved in the metabolic disorder called the histidinemia pathway.</cs_description>
<description>One-methylhistidine (1-MHis) is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into b-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with deficient carnosinase activity in plasma show increased 1-MHis excretions when they consume a high meat diet. Reduced serum carnosinase activity is also found in patients with Parkinson's disease and multiple sclerosis and patients following a cerebrovascular accident. Vitamin E deficiency can lead to 1-methylhistidinuria from increased oxidative effects in skeletal muscle. 1-Methylhistidine is a biomarker for the consumption of meat, especially red meat.</description>
<synonyms>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
<synonym>1-Methylhistidine</synonym>
<synonym>Pi-methylhistidine</synonym>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
<synonym>1 Methylhistidine</synonym>
<synonym>1-Methyl histidine</synonym>
</synonyms>
<chemical_formula>C7H11N3O2</chemical_formula>
<smiles>CN1C=NC(C[C#H](N)C(O)=O)=C1</smiles>
<inchikey>BRMWTNUJHUMWMS-LURJTMIESA-N</inchikey>
<diseases>
<disease>
<name>Kidney disease</name>
<omim_id/>
<references>
<reference>
<reference_text>McGregor DO, Dellow WJ, Lever M, George PM, Robson RA, Chambers ST: Dimethylglycine accumulates in uremia and predicts elevated plasma homocysteine concentrations. Kidney Int. 2001 Jun;59(6):2267-72.</reference_text>
<pubmed_id>11380830</pubmed_id>
</reference>
<reference>
<reference_text>Ehrenpreis ED, Salvino M, Craig RM: Improving the serum D-xylose test for the identification of patients with small intestinal malabsorption. J Clin Gastroenterol. 2001 Jul;33(1):36-40.</reference_text>
<pubmed_id>11418788</pubmed_id>
</reference>
<reference>
</reference>
</references>
</disease>
<disease>
Importing the file:
import xml.etree.ElementTree as et
data1 = et.parse('D:/path/to/Tal/my/HMDB/DataSets/saliva_metabolites/saliva_metabolites.xml')
root = data1.getroot()
Now, not sure how to select specific nodes. Meaning, my goal is to create a list of metabolites and each metabolite from the list will contain a list of nodes (say, <accession>, <name>, <synonyms> and <diseases_name>)
In turn, those elements will contain another list (say, inside <synonyms> there will be a list of values names, or inside <diseases_name> will be the list of names of diseases and each disease will contain a list of pub_id values).
# To access the 4'th node of the first metabolit
>> root[0][3].text
'HMDB0000001'
where root[0][3] represents the <accession> node.
Tried to run loop with print so i'll understand the output of the loop but recieved list of None
for node in root:
print(node.find('accession'))
None
None
None
None
None
.
.
.
Also tried
>> root.findall('./metabolite/accession')
[]
But received empty brackets
for list of synonyms of the first metbolite i tried:
>> root[0][9].text
'\n '
# This gave the first value of synonyms
root[0][9][0].text
'\n '
I used those questions to find an answer:
How do I parse XML in Python?
how to create a list of elements from an XML file in python
Python: XML file to pandas dataframe
Convert XML into Lists of Tags and Values with Python
Generating nested lists from XML doc
Any hints, ideas would be a help, thank you for your time
You are ignoring the namespace in the XML.
<hmdb xmlns="http://www.hmdb.ca">
means that there is no <hmdb> element. There is a <hmdb> in the http://www.hmdb.ca namespace. And since it's the default namespace for this element, all descendant elements are in the same namespace, unless they override that.
So this
root.findall('./metabolite/accession')
will not return anything because you're searching in the wrong namespace.
Let's search in the http://www.hmdb.ca namespace by giving it the handle h, for convenience:
ns = {
"h": "http://www.hmdb.ca"
}
accession = root.findall('./h:metabolite/h:accession', ns)
print(accession)
This finds one element (see how it explicitly denotes the namespace when you print it):
[<Element '{http://www.hmdb.ca}accession' at 0x03E6E7B0>]
You can use the same explicit syntax in ElementTree, but it gets unwieldy very quickly:
t.findall('./{http://www.hmdb.ca}metabolite/{http://www.hmdb.ca}accession')
The shorter (and standard) syntax with the prefix: is a lot nicer to work with.

Why don't I get the desired output? In this recursive function?

I've been doing this course on Udacity and this one problem has been stressing me for a while and I don't know why this keeps coming back to me and I can't really get a idea of it due to the fact that I find recursive functions super confusing and complicated.
I would like to find the solution by myself but I need some help to how this works and why it doesn't output in my desired manner. Thank you.
# Single Gold Star
# Family Trees
# In the lecture, we showed a recursive definition for your ancestors. For this
# question, your goal is to define a procedure that finds someone's ancestors,
# given a Dictionary that provides the parent relationships.
# Here's an example of an input Dictionary:
ada_family = { 'Judith Blunt-Lytton': ['Anne Isabella Blunt', 'Wilfrid Scawen Blunt'],
'Ada King-Milbanke': ['Ralph King-Milbanke', 'Fanny Heriot'],
'Ralph King-Milbanke': ['Augusta Ada King', 'William King-Noel'],
'Anne Isabella Blunt': ['Augusta Ada King', 'William King-Noel'],
'Byron King-Noel': ['Augusta Ada King', 'William King-Noel'],
'Augusta Ada King': ['Anne Isabella Milbanke', 'George Gordon Byron'],
'George Gordon Byron': ['Catherine Gordon', 'Captain John Byron'],
'John Byron': ['Vice-Admiral John Byron', 'Sophia Trevannion'] }
# Define a procedure, ancestors(genealogy, person), that takes as its first input
# a Dictionary in the form given above, and as its second input the name of a
# person. It should return a list giving all the known ancestors of the input
# person (this should be the empty list if there are none). The order of the list
# does not matter and duplicates will be ignored.
output = []
def ancestors(genealogy, person):
if person in genealogy:
for candidate in genealogy[person]:
output.append(candidate)
ancestors(genealogy, candidate)
return output
else:
return []
# Here are some examples:
print (ancestors(ada_family, 'Augusta Ada King'))
#>>> ['Anne Isabella Milbanke', 'George Gordon Byron',
# 'Catherine Gordon','Captain John Byron']
print (ancestors(ada_family, 'Judith Blunt-Lytton'))
#>>> ['Anne Isabella Blunt', 'Wilfrid Scawen Blunt', 'Augusta Ada King',
# 'William King-Noel', 'Anne Isabella Milbanke', 'George Gordon Byron',
# 'Catherine Gordon', 'Captain John Byron']
print (ancestors(ada_family, 'Dave'))
#>>> []
You need to declare output within the function scope. Otherwise it's not resetting.

Is there any way to sort nested lists without using operator.itemgetter?

I have a file that i'm reading in, then creating nested lists that i want to then sort on the 4 element(zipcode)
jk43:23 Marfield Lane:Plainview:NY:10023
axe99:315 W. 115th Street, Apt. 11B:New York:NY:10027
jab44:23 Rivington Street, Apt. 3R:New York:NY:10002
ap172:19 Boxer Rd.:New York:NY:10005
jb23:115 Karas Dr.:Jersey City:NJ:07127
jb29:119 Xylon Dr.:Jersey City:NJ:07127
ak9:234 Main Street:Philadelphia:PA:08990
Here is my code:
ex3_3 = open('ex1.txt')
exw = open('ex2_sorted.txt', 'w')
data = []
for line in ex3_3:
items = line.rstrip().split(':')
data.append(items)
print sorted(data, key=operator.itemgetter(4))
Output:
[['jb23', '115 Karas Dr.', 'Jersey City', 'NJ', '07127'], ['jb29', '119 Xylon Dr.', 'Jersey City', 'NJ', '07127'], ['ak9', '234 Main Street', 'Philadelphia', 'PA', '08990'], ['jab44', '23 Rivington Street, Apt. 3R', 'New York', 'NY', '10002'], ['ap172', '19 Boxer Rd.', 'New York', 'NY', '10005'], ['jk43', '23 Marfield Lane', 'Plainview', 'NY', '10023'], ['axe99', '315 W. 115th Street, Apt. 11B', 'New York', 'NY', '10027']]
this all works fine, I just wonder if there is a way to do this without using "import operator"?
Oh yes, there is a way:
print sorted(data,key=lambda x: x[4])
A rough workalike would be:
print sorted(data, key=lambda items: items[4])
but operator.itemgetter is a bit faster. I'm using this program to benchmark both approaches:
#!/usr/bin/env python
import timeit
withlambda = 'lst.sort(key=lambda items: items[4])'
withgetter = 'lst.sort(key=operator.itemgetter(4))'
setup = """\
import random
import operator
random.seed(0)
lst = [(random.randrange(100000), random.randrange(100000), random.randrange(100000), random.randrange(100000) ,random.randrange(100000))
for _ in range(10000)]
"""
n = 10000
print "With lambda:"
print timeit.timeit(withlambda, setup, number=n)
print "With getter:"
print timeit.timeit(withgetter, setup, number=n)
It creates a random list of 100,000 5-item tuples and then runs sort() on the list 1,000 times. On my MacBook Pro with Python 2.7.2, the withlambda version runs in about 55.4s and withgetter runs in about 46.1s.
Note that as the lists grow large, the time spent in the sorting algorithm itself grows faster than the time spent fetching keys. Therefore, the difference is much greater if you're sorting lots of little lists. Running the same test with a 1,000 item list repeated 100,000 times yields 22.4s for withlambda vs. 12.5s for withgetter.
Construct or reorganize your sublist so that the thing you want to sort on is first. In your case, ZIP code, instead of being element 4, should be element 0. Then you can just sort them.
Of course the suitability of this ordering for other uses of the data must also be considered.

How to classify users into different countries, based on the Location field

Most web applications have a Location field, in which uses may enter a Location of their choice.
How would you classify users into different countries, based on the location entered.
For eg, I used the Stack Overflow dump of users.xml and extracted users' names, reputation and location:
['Jeff Atwood', '12853', 'El Cerrito, CA']
['Jarrod Dixon', '1114', 'Morganton, NC']
['Sneakers OToole', '200', 'Unknown']
['Greg Hurlman', '5327', 'Halfway between the boardwalk and Six Flags, NJ']
['Power-coder', '812', 'Burlington, Ontario, Canada']
['Chris Jester-Young', '16509', 'Durham, NC']
['Teifion', '7024', 'Wales']
['Grant', '3333', 'Georgia']
['TimM', '133', 'Alabama']
['Leon Bambrick', '2450', 'Australia']
['Coincoin', '3801', 'Montreal']
['Tom Grochowicz', '125', 'NJ']
['Rex M', '12822', 'US']
['Dillie-O', '7109', 'Prescott, AZ']
['Pete', '653', 'Reynoldsburg, OH']
['Nick Berardi', '9762', 'Phoenixville, PA']
['Kandis', '39', '']
['Shawn', '4248', 'philadelphia']
['Yaakov Ellis', '3651', 'Israel']
['redwards', '21', 'US']
['Dave Ward', '4831', 'Atlanta']
['Liron Yahdav', '527', 'San Rafael, CA']
['Geoff Dalgas', '648', 'Corvallis, OR']
['Kevin Dente', '1619', 'Oakland, CA']
['Tom', '3316', '']
['denny', '573', 'Winchester, VA']
['Karl Seguin', '4195', 'Ottawa']
['Bob', '4652', 'US']
['saniul', '2352', 'London, UK']
['saint_groceon', '1087', 'Houston, TX']
['Tim Boland', '192', 'Cincinnati Ohio']
['Darren Kopp', '5807', 'Woods Cross, UT']
using the following Python script:
from xml.etree import ElementTree
root = ElementTree.parse('SO Export/so-export-2009-05/users.xml').getroot()
items = ['DisplayName','Reputation','Location']
def loop1():
for count,i in enumerate(root):
det = [i.get(x) for x in items]
print det
if count>30: break
loop1()
What is the simplest way to classify people into different countries? Are there any ready lookup tables available that provide me an output saying X location belongs to Y country?
The lookup table need not be totally accurate. Reasonably accurate answers are obtained by querying the location string on Google, or better still, Wolfram Alpha.
You best bet is to use a Geocoding API like geopy (some Examples).
The Google Geocoding API, for example, will return the country in the CountryNameCode-field of the response.
With just this one location field the number of false matches will probably be relatively high, but maybe it is good enough.
If you had server logs, you could try to also look up the users IP address with an IP geocoder (more information and pointers on Wikipedia
Force users to specify country, because you'll have to deal with ambiguities. This would be the right way.
If that's not possible, at least make your best-guess in conjunction with their IP address.
For example, ['Grant', '3333', 'Georgia']
Is this Georgia, USA?
Or is this the Republic of Georgia?
If their IP address suggests somewhere in Central Asia or Eastern Europe, then chances are it's the Republic of Georgia. If it's North America, chances are pretty good they mean Georgia, USA.
Note that mappings for IP address to country isn't 100% accurate, and the database needs to be updated regularly. In my opinion, far too much trouble.

Categories

Resources