I am using .rsplit() to split up all the digits in a string after the last comma using further commas. The transformations should be like this:
Before:
,000
After:
,0,0,0
I am using the following method to do this:
upl = line.rsplit(",",1)[1:]
upl2 = "{}".format(",".join(list(upl[0])))
As a comparison, to ensure that the correct substring is being selected to begin with, I am also using this statement:
upl1 = "{}".format("".join(list(upl[0])))
I then print both to ensure that they are both as expected. In this example I get:
up1 = ,000
up2 = ,0,0,0,
I then use a .replace() statement to substitute out my before substring with my after one:
new_var = ''
for line in new_var.split("\n"):
upl = line.rsplit(",",1)[1:]
upl1 = "{}".format("".join(list(upl[0])))
upl2 = "{}".format(",".join(list(upl[0])))
upl2 = str(upl2)
upl1 = str(upl1)
new_var += line.replace(upl1, upl2) + '\n'
In almost all instances of parsed data the old substring is overwritten with the new correctly. However on a few the subbed in string will display as:
,0,00 when it should be ,0,0,0,
Can anyone see anything obvious as to why this might be as I am at a bit of a loss.
Thanks
EDIT:
Here is the Scrapy code I am using to generate the data I am manipulating. The issues come from line:
new_match3g += line.replace(spl1, spl2).replace(tpl1, tpl2).replace(upl1, upl2) + '\n'
The full code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
class ExampleSpider(CrawlSpider):
name = "mrcrawl2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('/Seasons'),deny=('/News', '/Fixtures', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', '/Players', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
regex = re.compile('DataStore\.prime\(\'history\', { stageId: \d+ },\[\[.*?\]\]?\)?;', re.S)
match2g = re.search(regex, response.body)
if match2g is not None:
match3g = match2g.group()
match3g = str(match3g)
match3g = match3g.replace("'", '').replace("'", '').replace('[', '').replace(']', '').replace('] );', '')
match3g = re.sub("DataStore\.prime\(history, { stageId: \d+ },", '', match3g)
match3g = match3g.replace(');', '')
#print'-' * 170, '\n', match3g.decode('utf-8'), '-' * 170, '\n'
new_match3g = ''
for line in match3g.split("\n"):
upl = line.rsplit(",",1)[1:]
if upl:
upl1 = "{}".format("".join(list(upl[0])))
upl2 = "{}".format(",".join(list(upl[0])))
upl2 = str(upl2)
upl1 = str(upl1)
new_match3g += line.replace(upl1, upl2) + '\n'
print "UPL1 = ", upl1
print "UPL2 = ", upl2
print'-' * 170, '\n', new_match3g.decode('utf-8'), '-' * 170, '\n'
print'-' * 170, '\n', match3g.decode('utf-8'), '-' * 170, '\n'
execute(['scrapy','crawl','mrcrawl2'])
Since you've given us an example, let's trace it through:
>>> line = ',9243,46,Unterhaching,2,11333,8,13,1,133'
>>> split = line.rsplit(",",1)
>>> split
[',9243,46,Unterhaching,2,11333,8,13,1', '133']
>>> upl = split[1:]
>>> upl
['133']
>>> upl0 = upl[0]
>>> upl0
'133'
>>> upl0_list = list(upl0)
>>> upl0_list
['1', '3', '3']
>>> joined1 = "".join(upl0_list)
>>> joined1
'133'
>>> upl1 = "{}".format(joined1)
>>> upl1
'133'
>>> joined2 = ",".join(upl0_list)
>>> joined2
'1,3,3'
>>> upl2 = "{}".format(joined2)
>>> upl2
'1,3,3'
>>> upl2 = str(upl2)
>>> upl2
'1,3,3'
>>> upl1 = str(upl1)
>>> upl1
'133'
>>> r = line.replace(upl1, upl2)
>>> r
',9243,46,Unterhaching,2,11,3,33,8,13,1,1,3,3'
Again, notice that more than half of the steps don't actually do anything at all. You're converting strings to the same strings, then converting them to the same strings again; you're converting them to lists just to join them back together; etc. If you can't explain what each step is supposed to do, why are you doing them? Your code is supposed to be instructions to the computer to do something; just giving it random instructions that you don't understand isn't going to do any good.
More importantly, that's not the output you described. It has a different problem than the one you described: in addition to correctly replacing the 133 at the end with 1,3,3, it's also replacing the embedded 133 in the middle of 11333 with 11,3,33. Because that's exactly what you're asking it to do.
So, assuming that's your actual problem, rather than the problem you asked about, how do you fix that?
Well, you don't. You don't want to replace every '133' substring with '1,3,3', so don't ask it to do that. You want to make a string with everything up to the last comma, followed by the processed version of everything after the last comma. In other words:
>>> ",".join([split[0], upl2])
',9243,46,Unterhaching,2,11333,8,13,1,1,3,3'
I'd do it this way:
>>> ",000".replace("", ",")[2:]
',0,0,0,'
Related
I have a URL as follows:
http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
I need to insert a node 'us' in this case, as follows:
http://www.example.com/boards/results/us/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
Using Python's urlparse library, I can get to the path as follows:
path = urlparse(url).path
... and then using a complicated and ugly routine involving splitting the path based on slashes and inserting the new node and then reconstructing the URL
>>> path = urlparse(url).path
>>> path.split('/')
['', 'boards', 'results', 'current:entry1,current:entry2', 'modular', 'table', 'alltables', 'alltables', 'alltables', '2011-01-01']
>>> ps = path.split('/')
>>> ps.insert(4, 'us')
>>> '/'.join(ps)
'/boards/results/current:entry1,current:entry2/us/modular/table/alltables/alltables/alltables/2011-01-01'
>>>
Is there a more elegant/pythonic way to accomplish this using default libraries?
EDIT:
The 'results' in the URL is not fixed - it can be 'results' or 'products' or 'prices' and so on. However, it will always be right after 'boards'.
path = "http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01"
replace_start_word = 'results'
replace_word_length = len(replace_start_word)
replace_index = path.find(replace_start_word)
new_url = '%s/us%s' % (path[:replace_index + replace_word_length], path[replace_index + replace_word_length:])
I'm new to programming and Python.
Background
My program accepts a url. I want to extract the username from the url.
The username is the subdomain.
If the subdomain is 'www', the username should be the main part of the domain. The rest of the domain should be discard (eg. '.com/', '.org/')
I've tried the following:
def get_username_from_url(url):
if url.startswith(r'http://www.'):
user = url.replace(r'http://www.', '', 1)
user = user.split('.')[0]
return user
elif url.startswith(r'http://'):
user = url.replace(r'http://', '', 1)
user = user.split('.')[0]
return user
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
# output = httpwwwweirdusername (good! expected.)
print get_username_from_url(hard_url)
# output = weirdusername (bad! username should = httpwwwweirdusername)
I've tried many other combinations using strip(), split(), and replace().
Could you advise me on how to solve this relatively simple problem?
There is a module called urlparse that is specifically for the task:
>>> from urlparse import urlparse
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> urlparse(url).hostname.split('.')[0]
'httpwwwweirdusername'
In case of http://www.httpwwwweirdusername.com/ it would output www which is not desired. There are workarounds to ignore www part, like, for example, get the first item from the splitted hostname that is not equal to www:
>>> from urlparse import urlparse
>>> url = "http://www.httpwwwweirdusername.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'
Possible to do this with regular expressions (could probably modify the regex to be more accurate/efficient).
import re
url_pattern = re.compile(r'.*/(?:www.)?(\w+)')
def get_username_from_url(url):
match = re.match(url_pattern, url)
if match:
return match.group(1)
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
print get_username_from_url(hard_url)
Which yields us:
httpwwwweirdusername
httpwwwweirdusername
I am wondering, how could I make an algorithm that parses a string for the hashtag symbol ' # ' and returns the full string, but where ever a word starts with a '#' symbol, it becomes a link. I am using python with Google app engine: webapp2 and Jinja2 and I am building a blog.
Thanks
A more efficient and complete way to find the "hashwords":
import functools
def hash_position(string):
return string.find('#')
def delimiter_position(string, delimiters):
positions = filter(lambda x: x >= 0, map(lambda delimiter: string.find(delimiter), delimiters))
try:
return functools.reduce(min, positions)
except TypeError:
return -1
def get_hashed_words(string, delimiters):
maximum_length = len(string)
current_hash_position = hash_position(string)
string = string[current_hash_position:]
results = []
counter = 0
while current_hash_position != -1:
current_delimiter_position = delimiter_position(string, delimiters)
if current_delimiter_position == -1:
results.append(string)
else:
results.append(string[0:current_delimiter_position])
# Update offsets and the haystack
string = string[current_delimiter_position:]
current_hash_position = hash_position(string)
string = string[current_hash_position:]
return results
if __name__ == "__main__":
string = "Please #clarify: What do you #mean with returning somthing as a #link. #herp"
delimiters = [' ', '.', ',', ':']
print(get_hashed_words(string, delimiters))
Imperative code with updates of the haystack looks a little bit ugly but hey, that's what we get for (ab-)using mutable variables.
And I still have no idea what do you mean with "returning something as a link".
Hope that helps.
not sure where do you get the data for the link, but maybe something like:
[('%s' % word) for word in input.split() if word[0]=='#']
Are you talking about twitter? Maybe this?
def get_hashtag_link(hashtag):
if hashtag.startswith("#"):
return '%s' % (hashtag[1:], hashtag)
>>> get_hashtag_link("#stackoverflow")
'#stackoverflow'
It will return None if hashtag is not a hashtag.
Here's my code:
start_j = raw_input('Enter a name: ')
start_j = start.replace("A", "J")
start_j = start.replace("B", "J")
start_j = start.replace("C", "J")
print "Your name is " + start_j
Is there anyway to put all the alphabets in one list so that I wouldn't have to repeat the same process again and again until I reach letter "Z"
I tried using loops, but I still can't seem to get the right way to do it.
Here's a scenario:
The user will be prompted to input a name.
If the name contains a letter other than "J", it will be automatically replaced using the replace() function.
Hence it will print out the input starting with J
Here's an example:
site = raw_input('Enter your website: ')
site = site.replace("http://", "")
site = site.replace("https://", "")
site = site.replace("ftp://", "")
print "Your website is: " + site
An expected input would be http://www.google.com
So the expected out would become:
Enter your website: http://www.google.com
Your website is: www.google.com
I'm looking for a way to put "http://", "https://", "ftp://" all in one list so I wouldn't have to enter
site = site.replace("something", "something)
many times
You could use a regex to replace all of the letters at once:
>>> import re
>>> re.sub(r'[A-Z]', 'J', 'This Is A Test Name')
'Jhis Js J Jest Jame'
(After edit): You can use .startswith() and string slicing:
>>> name = 'A Name'
>>>
>>> if not name.startswith('J'):
... name = 'J' + name[1:]
...
>>> name
'J Name'
Although I'm not sure why you'd even need to check with .startswith(). Either way, the result will be the same.
You can use this:
remove_from_start = ["http://", "https://", "ftp://"]
for s in remove_from_start:
if site.startswith(s):
site = site[len(s):]
break
Or a regular expression based solution:
import re
regex = '^(https?|ftp)://'
site = re.sub(regex, '', site)
import re
site = raw_input('Enter your website: ')
# input http://www.google.com or https://www.google.com or ftp://www.google.com
site = re.sub('^(?:https?|ftp)://', '', site)
print "Your website is: " + site
use a dictionary:
In [100]: import string
In [101]: dic=dict.fromkeys(string.ascii_uppercase,"J")
In [104]: start_j = raw_input('Enter a name: ')
Enter a name: AaBbCc
In [105]: "".join(dic.get(x,x) for x in start_j)
Out[105]: 'JaJbJc'
Edit:
In [124]: dic={"https:":"","http:":"","ftp:":""}
In [125]: strs="http://www.google.com"
In [126]: "".join(dic.get(x,x) for x in strs.split("//"))
Out[126]: 'www.google.com'
use re, dict and lambda:
import re
replacte_to = {
"http://": "",
"https://": "",
"ftp://": "",
}
re.sub("^(ht|f)tps?://", lambda match: replacte_to[match.group(0)], YOUR_INPUT_STRING)
If I have a keyword, how can I get it to, once it encounters a keyword, to just grab the rest of the line and return it as a string? Once it encounters an end of line, return everything on that line.
Here is the line I'm looking at:
description here is the rest of my text to collect
Thus, when the lexer encounters description, I would like "here is the rest of my text to collect" returned as a string
I have the following defined, but it seems to be throwing an error:
states = (
('bcdescription', 'exclusive'),
)
def t_bcdescription(t):
r'description '
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.valiue.count('\n')
t.lexer.begin('INITIAL')
return t
This is part of the error being returned:
File "/Users/me/Coding/wm/wm_parser/ply/lex.py", line 393, in token
raise LexError("Illegal character '%s' at index %d" % (lexdata[lexpos],lexpos), lexdata[lexpos:])
ply.lex.LexError: Illegal character ' ' at index 40
Finally, if I wanted this functionality for more than one token, how could I accomplish that?
Thanks for your time
There is no big problem with your code,in fact,i just copy your code and run it,it works well
import ply.lex as lex
states = (
('bcdescription', 'exclusive'),
)
tokens = ("BCDESCRIPTION",)
def t_bcdescription(t):
r'\bdescription\b'
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
return t
def t_bcdescription_content(t):
r'[^\n]+'
lexer = lex.lex()
data = 'description here is the rest of my text to collect\n'
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
and result is :
LexToken(BCDESCRIPTION,' here is the rest of my text to collect\n',1,50)
So maybe your can check other parts of your code
and if I wanted this functionality for more than one token, then you can simply capture words and when there comes a word appears in those tokens, start to capture the rest of content by the code above.
It is not obvious why you need to use a lexer/parser for this without further information.
>>> x = 'description here is the rest of my text to collect'
>>> a, b = x.split(' ', 1)
>>> a
'description'
>>> b
'here is the rest of my text to collect'