Extra characters Extracted with XPath and Python (html) - python

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )

The standard XPath function normalize-space() has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)

Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Related

Python: retrieve substring bounded by indentations on a text file

I am having trouble on how to even identify indentations on a text file with Python (the ones that appear when you press tab). I thought that using the split function would be helpful, but it seems like there has to be a physical character that can act as the 'separator'.
Here is a sample of the text, where I am trying to retrieve the string 'John'. Assume that the spaces are the indentations:
15:50:00 John 1029384
All help is appreciated! Thanks!
Dependent on the program that you used for creating the file, what is actually inserted when you press TAB may either be a TAB character (\t) or a series of spaces.
You were actually right in thinking that split() is a way to do what you want. If you don't pass any arguments to it, it treats both series of whitespace and tabs as a single separator:
s = "15:50:00 John 1029384"
t = "15:50:00\tJohn\t1029384"
s.split() # Output: ['15:50:00', 'John', '1029384']
t.split() # Output: ['15:50:00', 'John', '1029384']
Tabs are represented by \t. See https://www.w3schools.com/python/gloss_python_escape_characters.asp for a longer list.
So we can do the following:
s = "15:50:00 John 1029384"
s.split("\t") # Output: ['15:50:00', 'John', '1029384']
If you know regex, then you can use look-ahead and look-behind as follows:
import re
re.search("(?<=\t).*?(?=\t)", s)[0] # Output: "John"
Obviously both methods will need to be made more robust by considering edge cases and error handling (eg., what happens if there are fewer -- or more -- than two tabs in the string -- how do you identify the name in that case?)

How can I code away html and special characters in python?

1543159687.4969957::I think I\u2019ve gotten far enough into my experiment to give an update: Last year, Child of Humanity was free
for Blac\u2026 https://t.co/M3HR5fAoFZ"
This is the result that I am getting. I'd like to create a regex to replace special elements like \u2019 and \u2026 with a space. They always start with "\u" and continue for four more characters.
I'd also like to get rid of the html. It always starts with "https://t.co/" and continues for 10 characters.
I've tried the code below but it is clearly wrong.
tweet = re.sub("#[\\u].{4}", "", tweet)
Those \u characters are simply unicode characters, there's nothing you need to do since they will be automatically converted when you try to print mystring
As to the final url, you can do:
removed = re.sub(r'http\S*$', '', mystring) # remove the final http string.
>>> removed
'1543159687.4969957::I think I’ve gotten far enough into my experiment to give an update: Last year, Child of Humanity was free for Blac… '

How can I count and efficiently replace inner double-quotes in a string?

I have a raw data file with approximately 2.6 million lines of data, and in each of these lines I have a string representing a URL. Unfortunately, some of these URLs have a rogue quotation mark in them:
"www.stackoverflow.com/quest"ions/ask"
My approach as of right now is to count the number of quotations in a line and if it's greater than two, simply use the first quotation and the last quotation in the line to determine where the string is supposed to start and end.
Is there a more efficient way to approach this?
EDIT:
The string that specifies the URL isn't the entire line, it's only a portion of the entire line. An entire line of data is as follows, and is delimited with spaces:
asc755.usask.ca - - [13/Jul/1995:17:27:51 -0600] "GET stackoverflow.com/pos"ts/41656163 HTTP/1.0" 200 2273
So I can't actually edit anything within the intended quotations, because the intended quotations are arbitrary.
I think it depends, how many urls are are broken. But you could skip counting and replace all double quotes. Afterwards add them back to the string.
s = '"www.stackoverflow.com/quest"ions/ask"'
x = '"%s"' % s.replace('"', '')
You may need to use a more powerful tool. Without seeing more examples of your inputs, I imagine you could use a simple regex to purge double-quotes that are embedded within strings. Grab everything between the string with this.
^"(.+)"$
Then replace the " with an empty string. Share more info about the data you're working with if it's more complicated than this.
Heres a link to a working capture. Link

Python regular expression to pull text inside of HTML quotation marks

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "‘" or "’". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*‘*’*“*[^<]*\n',fileText)
Any help would be appreciated.
Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:‘)*|(?:’)*|(?:“)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.
You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Replace newlines in a Unicode string

I am trying to replace newline characters in a unicode string and seem to be missing some magic codes.
My particular example is that I am working on AppEngine and trying to put titles from HTML pages into a db.StringProperty() in my model.
So I do something like:
link.title = unicode(page_title,"utf-8").replace('\n','').replace('\r','')
and I get:
Property title is not multi-line
Are there other codes I should be using for the replace?
Try ''.join(unicode(page_title, 'utf-8').splitlines()). splitlines() should let the standard library take care of all the possible crazy Unicode line breaks, and then you just join them all back together with the empty string to get a single-line version.
Python uses these characters for splitting in unicode.splitlines():
U+000A LINE FEED (\n)
U+000D CARRIAGE RETURN (\r)
U+001C FILE SEPARATOR
U+001D GROUP SEPARATOR
U+001E RECORD SEPARATOR
U+0085 NEXT LINE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
As Hank says, using splitlines() will let Python take care of all of the details for you, but if you need to do it manually, then this should be the complete list.
It would be useful to print the repr() of the page_title that is seen to be multiline, but the obvious candidate would be '\r'.

Categories

Resources