Remove duplicates by match

Remove duplicates by match - python

Considering having this list:
data = ["http://x.com/", "http://x.com/some/dir/", "http://x.com/other", "http://y.com/something", "http://y.com/else"]
I want to remove duplicates that are matched so expected output is:
http://x.com/
http://y.com/something
I know about list(set(data)) trick but it wouldn't work for this case.
I thought of iterating and making it in dict as key : value form so domain is key and value is the whole url and only take one occurance but I think that technique is crappy and not pythonic.

This gets you one entry per domain (happens to be the last, not the first):
from urllib.parse import urlparse
data = ["http://x.com/", "http://x.com/some/dir/", "http://x.com/other", "http://y.com/something", "http://y.com/else"]
result = list({urlparse(url).netloc: url for url in data}.values())
If you prefer the first:
result = list({urlparse(url).netloc: url for url in reversed(data)}.values())
print(result)
Outcome:
['http://y.com/something', 'http://x.com/']
This works as follows:
urlparse('https://somedomain.com/some/path') will break down the URL and one of the parts .netloc is the domain you're after, i.e. 'somedomain.com'
{urlparse(url).netloc: url for url in reversed(data)} reverses the list data and then for each url in the list, it gets the domain and adds an entry to a dictionary that's being constructed with the domain as the key and the URL as the value; since keys in a dictionary have to be unique, every type the same domain comes up, the entry is overwritten (hence the reversal)
list(somedict.values()) just takes the values of the dictionary and turns them into a simple list.
So, that explains how result = list({urlparse(url).netloc: url for url in data}.values()) results in the same as result = ['http://y.com/something', 'http://x.com/'] for your input data.

Related

How can I increase by one each value contained in the values list in Python [duplicate]

I am new to Python and had a question about updating a list using a for loop. Here is my code:
urls = ['http://www.city-data.com/city/javascript:l("Abbott");' , 'http://www.city-data.com/city/javascript:l("Abernathy");' ,
'http://www.city-data.com/city/Abilene-Texas.html' ,'http://www.city-data.com/city/javascript:l("Abram-Perezville");' ,
'http://www.city-data.com/city/javascript:l("Ackerly");' , 'http://www.city-data.com/city/javascript:l("Adamsville");',
'http://www.city-data.com/city/Addison-Texas.html']
for url in urls:
if "javascript" in url:
print url
url = url.replace('javascript:l("','').replace('");','-Texas.html')
print url
for url in urls:
if "javascript" in url:
url = url.replace('javascript:l("','').replace('");','-Texas.html')
print "\n"
print urls
I used the first for loop to check if the syntax was correct, and it worked fine. But the second for loop is the code I want to use, but it's not working properly. How would I go about globally updating the list with the second for loop so I can print or store the updated list outside of the for loop?

You can update the list items using their index:
for i, url in enumerate(urls):
if "javascript" in url:
urls[i] = url.replace('javascript:l("','').replace('");','-Texas.html')
Another alternative is to use a list comprehension:
def my_replace(s):
return s.replace('javascript:l("','').replace('");','-Texas.html')
urls[:] = [my_replace(url) if "javascript" in url else url for url in urls]
Here urls[:] means replace all the items of urls list with the new list created by the list comprehension.
The reason why your code didn't worked is that you're assigning the variable url to something else, and changing one of the reference of an object to point to something else doesn't affect the other references. So, your code is equivalent to:
>>> lis = ['aa', 'bb', 'cc']
>>> url = lis[0] #create new reference to 'aa'
>>> url = lis[0].replace('a', 'd') #now assign url to a new string that was returned by `lis[0].replace`
>>> url
'dd'
>>> lis[0]
'aa'
Also note that str.replace always returns a new copy of string, it never changes the original string because strings are immutable in Python. In case lis[0] was a list and you performed any in-place operation on it using .append, .extend etc, then that would have affected the original list as well.

Thingspeak: Parse json response with Python

I would like to create an Alexa skill using Python to use data uploaded by sensors to Thingspeak. The cases where I only use one specific value is quite easy, the response from Thingspeak is the value only. When I want to use several values, in my case to sum up the athmospheric pressure to determine tendencies, teh response is a json object like this:
{"channel":{"id":293367,"name":"Weather Station","description":"My first attempt to build a weather station based on an ESP8266 and some common sensors.","latitude":"51.473509","longitude":"7.355569","field1":"humidity","field2":"pressure","field3":"lux","field4":"rssi","field5":"temp","field6":"uv","field7":"voltage","field8":"radiation","created_at":"2017-06-25T07:35:37Z","updated_at":"2018-08-04T12:11:22Z","elevation":"121","last_entry_id":1812},"feeds":
[{"created_at":"2018-10-21T18:11:45Z","entry_id":1713,"field2":"1025.62"},
{"created_at":"2018-10-21T18:12:05Z","entry_id":1714,"field2":"1025.58"},
{"created_at":"2018-10-21T18:12:25Z","entry_id":1715,"field2":"1025.56"},
{"created_at":"2018-10-21T18:12:45Z","entry_id":1716,"field2":"1025.65"},
{"created_at":"2018-10-21T18:13:05Z","entry_id":1717,"field2":"1025.58"},
{"created_at":"2018-10-21T18:13:25Z","entry_id":1718,"field2":"1025.63"}]
I now started with
f = urllib.urlopen(link) # Get your data
json_object = json.load(f)
for entry in json_object[0]
print entry["field2"]
The json object is a bit recursive, it is a list containing a list with an element with an array as the value.
Now I am not quite sure how to iterate over the values of the key "field2" in the array. I am quite new to Python and also json. Perhaps anyone can help me out?
Thanks in advance!

This has nothing to do with json - once the json string parsed by json.load(), what you get is a plain python object (usually a dict, sometimes a list, rarely - but this would be legal - a string, int, float, boolean or None).
it is a list containing a list with an element with an array as the value.
Actually it's a dict with two keys "channel" and "feeds". The first one has another dict for value, and the second a list of dicts. How to use dicts and lists is extensively documented FWIW
https://docs.python.org/3/tutorial/datastructures.html#dictionaries
https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
https://docs.python.org/3/tutorial/introduction.html#lists
https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range
Here the values you're looking for are stored under the "field2" keys of the dicts in the "feeds" key, so what you want is:
# get the list stored under the "feeds" key
feeds = json_object["feeds"]
# iterate over the list:
for feed in feeds:
# get the value for the "field2" key
print feed["field2"]

You have a dictionary. Use key to access the value
Ex:
json_object = {"channel":{"id":293367,"name":"Weather Station","description":"My first attempt to build a weather station based on an ESP8266 and some common sensors.","latitude":"51.473509","longitude":"7.355569","field1":"humidity","field2":"pressure","field3":"lux","field4":"rssi","field5":"temp","field6":"uv","field7":"voltage","field8":"radiation","created_at":"2017-06-25T07:35:37Z","updated_at":"2018-08-04T12:11:22Z","elevation":"121","last_entry_id":1812},"feeds":
[{"created_at":"2018-10-21T18:11:45Z","entry_id":1713,"field2":"1025.62"},
{"created_at":"2018-10-21T18:12:05Z","entry_id":1714,"field2":"1025.58"},
{"created_at":"2018-10-21T18:12:25Z","entry_id":1715,"field2":"1025.56"},
{"created_at":"2018-10-21T18:12:45Z","entry_id":1716,"field2":"1025.65"},
{"created_at":"2018-10-21T18:13:05Z","entry_id":1717,"field2":"1025.58"},
{"created_at":"2018-10-21T18:13:25Z","entry_id":1718,"field2":"1025.63"}]}
for entry in json_object["feeds"]:
print entry["field2"]
Output:
1025.62
1025.58
1025.56
1025.65
1025.58
1025.63

I just figured it out, it was just like expected.
You have to get the entries array from the dict and than iterate over the list of items and print the value to the key field2.
# Get entries from the response
entries = json_object["feeds"]
# Iterate through each measurement and print value
for entry in entries:
print entry['field2']

Scraping a URL host and primary path plus query string to produce a list of all possible additional extensions for that query

I am working on a project to to scrape table data from a url.
The main web domain is https://www.pro-football-reference.com. I have already written the code to scrape the table data from this domain.
A search for statistical data on this might start with a query with set parameters. For example: here is the url with table data for all players in the National Football League who have thrown at least 25 passes in their careers:
input_url = 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td'
But this url only contains the statistical table data for players 1 - 100 on this list. So there are 7 additional urls with 100 players each and one additional url with 81 players.
The url for the 2nd url from this query contains a table with players 101-200 is here:
url_passing2 = 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=100'
Notice that these are exactly the same until the very last part, where there is the additional extension string '&offset=100'. Each additional page has the same host/path/query string plus '&offset=200', '&offset=300', '&offset=400', and so on up to '&offset=800'.
My question is this: starting with a url like this, how can I create a Python function that will collect a list of all of the possible url iterations from this host/path/query string, so that I can get the entire list of players who match this query?
My desired output would be a list that looks something like this:
list_or_urls: ['https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=100', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=200', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=300', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=400', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=500', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=600', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=700', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=800']
Or, more concisely:
list of urls = ['&offset=0', '&offset=100', '&offset=200', '&offset=300', '&offset=400', '&offset=500', '&offset=600', '&offset700', '&offset=800']
The following is what I have so far for my attempt at creating the function. My approach is to try to iterate through the urls and check if there is a table on the url or not. The idea is that "if" there is a table on the page, append the url to my output list, and if there is not a table on the page, exit the function. But this only produces a list of the first two urls -- it's not looping back to append the last 7 urls for the list.
def get_url_list(frontpage_url):
offset_extension = ''
output_list = [frontpage_url]
x = 0
for url in output_list:
results_table = pd.read_html(url)
table_results = pd.DataFrame(results_table)
if table_results.empty == False:
x+=1
offset_extension = '&offset=' + '%d' % (100 * x)
output_list.append(frontpage_url + offset_extension)
else:
exit
return output_list[1:-1]
urls_list_output = get_url_list(sports_url_starter)

Your approach looks okay but your for loop is incorrect. When using for you don't need to index into the list you are looping through, ie you shouldn't use output_list[x].
Try replacing it with something like:
results = []
for url in output_list:
try:
# read_html returns a list of dataframes, so add all to the results
results.extend(pd.read_html(url))
...
output_list.append(new_url)
except ValueError:
# if there are no tables on the page return what you have so far
return results

Python List to String Conversion

This seems like a simple task and I'm not sure if I've accomplished it already, or if I'm chasing my tail.
values = [value.replace('-','') for value in values] ## strips out hyphen (only 1)
print values ## outputs ['0160840020']
parcelID = str(values) ## convert to string
print parcelID ##outputs ['0160840020']
url = 'Detail.aspx?RE='+ parcelID ## outputs Detail.aspx?RE=['0160840020']
As you can see I'm trying to append the number attached to the end of the URL in order to change the page via a POST parameter. My question is how do I strip the [' prefix and '] suffix? I've already tried parcelID.strip("['") with no luck. Am I doing this correctly?

values is a list (of length 1), which is why it appears in brackets. If you want to get just the ID, do:
parcelID = values[0]
Instead of
parcelID = str(values)

Assuming you actually have a list of values when you perform this (and not just one item) this would solve you problem (it would also work for one item as you have shown)
values = [value.replace('-','') for value in values] ## strips out hyphen (only 1)
# create a list of urls from the parcelIDs
urls = ['Detail.aspx?RE='+ str(parcelID) for parcelID in values]
# use each url one at a time
for url in urls:
# do whatever you need to do with each URL

Python sorting question - given list of ['url', 'tag1', 'tag2',..]s and search specification ['tag3', 'tag1',...], return relevant url list

I'm quite new to programming so I'm sure there's a terser way to pose this, but I'm trying to create a personal bookmarking program. Given multiple urls each with a list of tags ordered by relevance, I want to be able to create a search consisting of a list of tags that returns a list of most relevant urls. My first solution, below, is to give the first tag a value of 1, the second 2, and so on & let the python list sort function do the rest. 2 questions:
1) Is there a much more elegant/efficient way of doing this (embarrass me!)
2) Any other general approaches to the sorting by relevance given the inputs above problem?
Much obliged.
# Given a list of saved urls each with a corresponding user-generated taglist
# (ordered by relevance), the user enters a "search" list-of-tags, and is
# returned a sorted list of urls.
# Generate sample "content" linked-list-dictionary. The rationale is to
# be able to add things like 'title' etc at later stages and to
# treat each url/note as in independent entity. But a single dictionary
# approach like "note['url1']=['b','a','c','d']" might work better?
content = []
note = {'url':'url1', 'taglist':['b','a','c','d']}
content.append(note)
note = {'url':'url2', 'taglist':['c','a','b','d']}
content.append(note)
note = {'url':'url3', 'taglist':['a','b','c','d']}
content.append(note)
note = {'url':'url4', 'taglist':['a','b','d','c']}
content.append(note)
note = {'url':'url5', 'taglist':['d','a','c','b']}
content.append(note)
# An example search term of tags, ordered by importance
# I'm using a dictionary with an ordinal number system
# This seems clumsy
search = {'d':1,'a':2,'b':3}
# Create a tagCloud with one entry for each tag that occurs
tagCloud = []
for note in content:
for tag in note['taglist']:
if tagCloud.count(tag) == 0:
tagCloud.append(tag)
# Create a dictionary that associates an integer value denoting
# relevance (1 is most relevant etc) for each existing tag
d={}
for tag in tagCloud:
try:
d[tag]=search[tag]
except KeyError:
d[tag]=100
# Create a [[relevance, tag],[],[],...] result list & sort
result=[]
for note in content:
resultNote=[]
for tag in note['taglist']:
resultNote.append([d[tag],tag])
resultNote.append(note['url'])
result.append(resultNote)
result.sort()
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
# Its so hacky i've forgotten how it works!
# It's mostly for display, but suggestions on "best-practice"
# intermediate-form data storage?
finalResult=[]
for note in result:
temp=[]
temp.append(note.pop())
for tag in note:
temp.append(tag[1])
finalResult.append(temp)
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult

1) Is there a much more elegant/efficient way of doing this (embarrass me!)
Sure thing. The basic idea: quit trying to tell Python what to do, and just ask it for what you want.
content = [
{'url':'url1', 'taglist':['b','a','c','d']},
{'url':'url2', 'taglist':['c','a','b','d']},
{'url':'url3', 'taglist':['a','b','c','d']},
{'url':'url4', 'taglist':['a','b','d','c']},
{'url':'url5', 'taglist':['d','a','c','b']}
]
search = {'d' : 1, 'a' : 2, 'b' : 3}
# We can create the tag cloud like this:
# tagCloud = set(sum((note['taglist'] for note in content), []))
# But we don't actually need it: instead, we'll just use a default value
# when looking things up in the 'search' dict.
# Create a [[relevance, tag],[],[],...] result list & sort
result = sorted(
[
[search.get(tag, 100), tag]
for tag in note['taglist']
] + [[note['url']]]
# The result will look like [ [relevance, tag],... , [url] ]
# Note that the url is wrapped in a list too. This makes the
# last processing step easier: we just take the last element of
# each nested list.
for note in content
)
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
finalResult = [
[x[-1] for x in note]
for note in result
]
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult

I suggest you also give a weight to each tag, depending on how rare it is (e.g. a “tarantula” tag would weigh more than a “nature” tag¹). For a given URL, rare tags that are common with other URLs should mark a stronger relevance, while frequently used tags of the given URL not existing in another URL should mark down the relevance.
It's easy to convert the rules I describe above as calculations of a numerical relevance for every other URL.
¹ unless all your URLs are related to “tarantulas”, of course :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates by match - python

Related

How can I increase by one each value contained in the values list in Python [duplicate]

Thingspeak: Parse json response with Python

Scraping a URL host and primary path plus query string to produce a list of all possible additional extensions for that query

Python List to String Conversion

Python sorting question - given list of ['url', 'tag1', 'tag2',..]s and search specification ['tag3', 'tag1',...], return relevant url list

Categories

Resources