Python Variable Amount Of Input - python

I'm working on a program that determines whether a graph is strongly connected.
I am reading standard input on a sequence of lines.
The lines have two or three whitespace-delimited tokens, the name of the source and destination vertices, and an optional decimal edge weight.
Input might look like this:
'''
Houston Washington 1000
Vancouver Houston 300
Dallas Sacramento 800
Miami Ames 2000
SanFrancisco LosAngeles
ORD PVD 1000
'''
How can I read in this input and add it to my graph?
I believe I will be using a collection like this:
flights = collections.defaultdict(dict)
Thank you for any help!

with d as your data, you can use split your line with '\n' in it and then strip trailing white space and find the last occurrence of . With that you can slice your string to get the name and the number associated with it.
Here I've stored the data to a dictionary. You can modify it according to your requirement!
Use regular expression modules re.sub to remove the extra spaces.
>>> import re
>>> d
'\nHouston Washington 1000\nVancouver Houston 300\nDallas Sacramento 800\nMiami Ames 2000\nSanFrancisco LosAngeles\nORD PVD 1000\n'
>>>[{'Name':re.sub(r' +',' ',each[:each.strip().rfind(' ')]).strip(),'Flight Number':each[each.strip().rfind(' '):].strip()} for each in filter(None,d.split('\n'))]
[{'Flight Number': '1000', 'Name': 'Houston Washington'}, {'Flight Number': '300', 'Name': 'Vancouver Houston'}, {'Flight Number': '800', 'Name': 'Dallas Sacramento'}, {'Flight Number': '2000', 'Name': 'Miami Ames'}, {'Flight Number': 'LosAngeles', 'Name': 'SanFrancisco'}, {'Flight Number': '1000', 'Name': 'ORD PVD'}]
Edit:
To match your flights dict,
>>> flights={'Houston':{'Washington':''},'Vancouver':{'Houston':''}} #sample dict
>>> for each in filter(None,d.split('\n')):
... flights[each.split()[0]][each.split()[1]]=each.split()[2]

Related

How do i split this list into pairs or single elements. Is there an easier way of doing this?

In python I have a list of names, however some have a second name and some do not, how would I split the list into names with surnames and names without?
I don't really know how to explain it so please look at the code and see if you can understand (sorry if I have worded it really badly in the title)
See code below :D
names = ("Donald Trump James Barack Obama Sammy John Harry Potter")
# the names with surnames are the famous ones
# names without are regular names
list = names.split()
# I want to separate them into a list of separate names so I use split()
# but now surnames like "Trump" are counted as a name
print("Names are:",list)
This outputs
['Donald', 'Trump', 'James', 'Barack', 'Obama', 'Sammy', 'John', 'Harry', 'Potter']
I would like it to output something like ['Donald Trump', 'James', 'Barack Obama', 'Sammy', 'John', 'Harry Potter']
Any help would be appreciated
As said in the comments, you need a list of famous names.
# complete list of famous people
US_PRESIDENTS = (('Donald', 'Trump'), ('Barack', 'Obama'), ('Harry', 'Potter'))
def splitfamous(namestring):
words = namestring.split()
# create all tuples of 2 adjacent words and compare them to US_PRESIDENTS
for i, name in enumerate(zip(words, words[1:])):
if name in US_PRESIDENTS:
words[i] = ' '.join(name)
words[i+1] = None
# remove empty fields and return the result
return list(filter(None, words))
names = "Donald Trump James Barack Obama Sammy John Harry Potter"
print(splitfamous(names))
The resulting list:
['Donald Trump', 'James', 'Barack Obama', 'Sammy', 'John', 'Harry Potter']

Python for matching paper IDs in Scholarly

I have a list of the following authors for Google Scholar papers: Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn. I want to extract and print titles for the papers present for at least 3 of these.
You can get a dictionary of paper info from each author using Scholarly:
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
print(author)
The output looks something like (just a small excerpt from what you'd get from one author)
{'bib': {'cites': '69',
'title': 'Chalearn looking at people and faces of the world: Face '
'analysis workshop and challenge 2016',
'year': '2016'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
'source': 'citations'},
{'bib': {'cites': '21',
'title': 'The NoXi database: multimodal recordings of mediated '
'novice-expert interactions',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
'source': 'citations'},
{'bib': {'cites': '11',
'title': 'Automatic habitat classification using image analysis and '
'random forest',
'year': '2014'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
'source': 'citations'},
{'bib': {'cites': '10',
'title': 'AutoRoot: open-source software employing a novel image '
'analysis approach to support fully-automated plant '
'phenotyping',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
'source': 'citations'}
How can I collect the bib and specifically title for papers which are present for three or more out of the four authors?
EDIT: in fact it's been pointed out id_citations is not unique for each paper, my mistake. Better to just use title itself
Expanding on my comment, you can achieve this using Pandas groupby:
import pandas as pd
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
# creating DataFrame with authors
df = pd.DataFrame([x.__dict__ for x in author.publications])
df['author'] = Author
frames.append(df.copy())
# joining all author DataFrames
df = pd.concat(frames, axis=0)
# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())
# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index
This finds 202 papers which have 2 or more of the authors in that list (out of 774 total papers). Here is an example of the output:
Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
'722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
'A Luminescent One-Dimensional Copper (I) Polymer',
'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
...
'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
'Virginia review of sociology',
'Wildlife use of log landings in the White Mountain National Forest',
'Yttrium 1995',
'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
'[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
dtype='object', name='title', length=202)
Since all of the data is in Pandas, you can also explore what the attached authors on each of the papers is as well as all of the other information you have access to within the author.publications array coming in from scholarly.
First, let's convert this into a more friendly format. You say that the id_citations is unique for each paper, so we'll use it as a hashtable/dict key.
We can then map each id_citation to the bib dict and author(s) it appears for, as a list of tuples (bib, author_name).
author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
search_query = scholarly.search_author(author_name)
for bib in search_query:
bib = bib.fill()
bibs.setdefault(bib['id_citations'], []).append((bib, author_name))
Thereafter, we can sort the keys in bibs based on how many authors are attached to them:
most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])
and/or filter that list to citations that have only three or more appearances:
cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name
and now we can retrieve the titles of the papers from there:
paper_titles = [bib['bib']['title'] for bib in cited_enough]

Capturing only specific sections/patterns of string with Regex

I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.
use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]

Query dictionary based on a criteria and skip values that are missing

data = [
{'firstname': 'Tom ', 'lastname': 'Frank', 'title': 'Mr',
'education': 'B.Sc'},{'firstname': 'Anne ', 'middlename': 'David', 'lastname': 'Frank', 'title': 'Doctor',
'education': 'Ph.D'} , {'firstname': 'Ben ', 'lastname': 'William', 'title': 'Mr'}
]
I want to query the list of dictionaries based on the key 'education'. If the person's detail does not have this key the entire dictionary will be passed over.The desired output is
[(' Mr Tom Frank', 'B.Sc'),
('Doctor Anne David Frank', 'Ph.D') ]
My attempt would have an extra space between Tom and Frank as in Mr Tom Frank as well as between Anne and David . Here is the actual output
[('Mr Tom Frank', 'B.Sc'), ('Doctor Anne David Frank', 'Ph.D')]
I would like to avoid this if possible.
Here is the code I have written. I apologize if the code does not seem to be readable enough and I am ready to take any comments.
def qualified_applicants(data):
full_name_education=[ ]
keys = ['title','firstname','middlename','lastname']
for record in data:
#check to see if 'education' is one of the key
if 'education' in record.keys():
full_name=[' '.join([record.get(key,'') for key in keys])]
# make a tuple of education and full names
full_name_education.append(tuple(full_name+[record['education']]))
return full_name_education
You can use regex:
import re
data = [
{'firstname': 'Tom ', 'lastname': 'Frank', 'title': 'Mr',
'education': 'B.Sc'},{'firstname': 'Anne ', 'middlename': 'David', 'lastname': 'Frank', 'title': 'Doctor',
'education': 'Ph.D'} , {'firstname': 'Ben ', 'lastname': 'William', 'title': 'Mr'}
]
new_data = [(re.sub('\s{2,}', ' ', ' '.join(re.sub('\s+$', '', i.get(b, '')) for b in ['title', 'firstname', 'middlename', 'lastname'])), i['education']) for i in data if 'education' in i]
Output:
[('Mr Tom Frank', 'B.Sc'), ('Doctor Anne David Frank', 'Ph.D')]
The 'firstname' entries for your data appear to have a trailing blank. You can trim such leading and trailing white space using the strip method of the string returned by record.get(). This would make your list comprehension line be:
full_name = [' '.join([record.get(key,'').strip() for key in keys])]
to be tolerant of the extra whitespace.
FWIW, I think you would probably be better off having full_name not be a list but a plain string.
The codes seems to be working with the addition of one line of code like so:
temp=[' '.join(record.get(key,'') for key in keys)]
full_name=[' '.join(full_name.split() ) for full_name in temp ]
The rest of the lines didn't need any change.
This could be verbose but it is working. What is the most pythonic way of achieving the same result?

Conditionals on replacement in a string

So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'
I want to replace all country instances except when we have an 'of ' or 'of the '
Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.
If these do exist then we do not remove the country, else we do remove country. The examples will become:
'Bank of China', or 'Embassy of China', and 'International'
However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.
Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?
My function is along these lines:
def removeCountry(name):
for country in countries:
if country in name:
if 'of ' + country in name:
return name
if 'of the ' + country in name:
return name
else:
name = re.sub(country + '$', '', name).strip()
return name
return name
EDIT: I did find some info here. This does describe how to do an if, but I really want a
if not 'of '
if not 'of the '
then replace...
You could compile a few sets of regular expressions, then pass your list of input through them. Something like:
import re
countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]
def remove_country(s):
for regex in takes:
if regex.search(s):
return s
for regex in subs:
s = regex.sub('', s)
return s
print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')
''' Output:
the bank of foo
the bank of the baz
the nation
'''
It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.
Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.
I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.
Something like
countries = [
"Afghanistan",
"Albania",
"Algeria",
"Andorra",
"Angola",
"Anguilla",
"Antigua",
"Arabia",
"Argentina",
"Armenia",
"Aruba",
"Australia",
"Austria",
"Azerbaijan",
"Bahamas",
"Bahrain",
"China",
"Russia"
# etc
]
def find_words_from_set_in_string(set_):
set_ = set(set_)
def words_in_string(s):
return set_.intersection(s.split())
return words_in_string
get_countries = find_words_from_set_in_string(countries)
then
get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")
returns
set(['Argentina', 'China', 'Russia'])
... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.
As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\""). You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.
Not tested:
def removeCountry(name):
for country in countries:
name = re.sub('(?<!of (the )?)' + country + '$', '', name).strip()
Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the
The re.sub function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:
import re
def make_regex(countries):
escaped = (re.escape(country) for country in countries)
states = '|'.join(escaped)
return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))
def remove_name(match):
name = match.group()
if name.lstrip().startswith('of'):
return name
else:
return name.replace(match.group('state'), '').strip()
regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'
The result might contain some spurious space (in the above case a last strip() is needed). You can fix this modifying the regex to:
\s*(of(\sthe)?\s)?(?P<state>({}))
To catch the spaces before of or before the country name and avoid the bad spacing in the output.
Note that this solution can handle a whole text, not just text of the form Something of Country and Something Country. For example:
In [38]: regex = make_regex(['China'])
...: text = '''This is more complex than just "Embassy of China" and "International China"'''
In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'
an other example usage:
In [33]: countries = [
...: 'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
...: 'France', 'Italy', 'Australia', 'New Zealand', 'Brazil',
...: 'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
...: 'Spain', 'Portugal', 'Argentina', 'San Marino'
...: ]
In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'
In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)
In [36]: regex = make_regex(countries)
...: result = regex.sub(remove_name, text)
In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'

Categories

Resources