I want to split a list of items with specific symbol.
I have used the following code
data = "launch, 7:30am, watch tv, workout, snap, running, research study and learn"
items = data.split(',')
print(', '.join([items[0], items[-1].split('—')[1]]))
Here what I wanted is that to split this data and print like this:
launch, study and learn
but a problem appears when data changed like this:
data = "launch, 7:30am, watch tv, workout, snap, running, research — discussion, study and learn"
items = data.split(',')
print(', '.join([items[0], items[-1].split('—')[1]]))
and in this I case I expected to get this result:
launch, discussion, study and learn
as such, an error appears "list index out of range"! that is right because there is no symbol "-" after last element, because of "," and I instructed data to be splitted as "," therefore in "discussion, study and learn" will be treated as separate data so an error appears. I wanted to not rewrite any code, is it possible to use code reuse to read both data. is it possible to read after "-" symbol?
Seems like your expected output is dependent on word research
We can implement the same using regex which will search research word and gives you characters after it.
You can try this -
# -*- coding: utf-8 -*-
import re
(re.split(r'*research[^A-Za-z0-9]+',data))[-1]
#study and learn
#discussion, study and learn
Full Code:
# -*- coding: utf-8 -*-
import re
print ("{0}, {1}".format(data.split(',')[0], (re.split(r' *research[^A-Za-z0-9]+',data))[-1]))
#launch, study and learn
#launch, discussion, study and learn
Read more about python Regex :
https://docs.python.org/3/library/re.html
or about expressions here :
https://www.w3schools.com/python/python_regex.asp
Related
Few weeks ago I needed a crawler for data collection and sorting so I started learning python.
Same day I wrote a simple crawler but the code looked ugly as hell. Mainly because I don't know how to do certain things and I don't know how to properly google them.
Example:
Instead of deleting [, ] and ' in one line I did
extra_nr = extra_nr.replace("'", '')
extra_nr = extra_nr.replace("[", '')
extra_nr = extra_nr.replace("]", '')
extra_nr = extra_nr.replace(",", '')
Because I couldn't do stuff to list object and when I did str(list object) It looked like ['this', 'and this'].
Now I'm creating discord bot that will upload data that I feed to it to google spreadsheet. The code is long and ugly. And it takes like 2-3 secs to start the bot (idk if this is normal, I think the more I write the more time it takes to start it which makes me think that code is garbage). Sometimes it works, sometimes it doesn't.
My question is how do I know that I wrote something good? And if I just keep adding stuff like in the example, how will it affect my program? If I have a really long code do I split it and call the parts of it only when they are needed or how does it work?
tl;dr to get good at Python and write good code, write a lot of Python and read other people's code. Learn multiple approaches to different problem types and get a feel for which to use and when. It's something that comes over time with a lot of practice. As far as resources, I highly recommend the book "Automate the Boring Stuff with Python".
As for your code sample, you could use translate for this:
def strip(my_string):
bad_chars = [*"[],'"]
return my_string.translate({ord(c): None for c in bad_chars})
translate does a character by character translation of the string given a translation table, so you create a small translation table with the characters you don't want set to None.
The list of characters you don't want is created by unpacking (splatting) a string of the characters.
>>> [*"abc"] == ["a", "b", "c"]
True
Another option would be using comprehensions:
def strip(my_string):
bad_chars = [*"[],'"]
return "".join(c for c in my_string if c not in bad_chars)
Here we use the comprehension format [x for x in y] to build a new list of xs from y, just specifying to drop the character if it appears in bad_chars. We then join the remaining list of characters into a string that doesn't have the specified characters in it.
You will definitely improve quickly from reading (or listening) up on Python best practices from resources like Real Python and Talk Python To Me.
Meanwhile, I'd recommend starting using some code analysers like pylint and bandit as part of your regular workflow.
In any case, welcome to the world of Python and enjoy! :-)
You can use maketrans() to define characters to remove (3rd parameter):
def clean(S): return S.translate(str.maketrans("","","[],'"))
clean("A['23']") # 'A23'
So I am working with a CSV that has a many to one relationship and I have 2 problems I need assistance in solving. The first is that I have the string set up like
thisismystr=thisisanemail#addy.com,blah,blah,blah, startnewCSVcol
So I need to split the string twice, once on = and once on , as I am basically attempting to get the portion that is an e-mail address (thisisanemail#addy.com) so far I have figured out how to split the string on the = using something like this:
str = thisismystr=thisisanemail#addy.com,blah,blah,blah
print str.split("=")
Which returns this "thisisanemail#addy.com,blah,blah,blah"... however this leaves the ,blah,blah,blah portion to be removed... after a bit of research I am stumped as nothing explains how to remove from the middle, just the 1st part or the last part. Does anyone know how to do this?
For the 2nd part I need to do this from multiple line, so this is more of an advice question... is it best to plug this into a variable and loop through like (i = 1 for i, #endofCSV do splitcmd) or is there a more efficient manner to do this? I am more familiar with LUA and I am learning that the more I work with python the more it differs from LUA.
Please help. Thanks!
Does this solve your problem?
#!/usr/bin/env python
#-*- coding:utf-8 -*-
myString = 'thisismystr=thisisanemail#addy.com,blah,blah,blah'
a = myString.split('=')
b = []
for i in a:
b.extend(i.split(','))
print b
I believe you want the email out of strings in this format: 'thisismystr=thisisanemail#addy.com,blah,blah,blah'
This is how you would do that:
str = 'thisismystr=thisisanemail#addy.com,blah,blah,blah'
email = str.split('=')[1].split(',')[0]
print email
I am scraping a webpage that contains HTML that looks like this in the browser
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
Taking just the LGG®, in the first instance it is LGG® In the second instance, ® is written as ® in the source code.
I am using Python 2.7, mechanize and BeautifulSoup.
My difficulty is that the ® is uplifted by mechanize, and carried through and is ultimately printed out or written to file.
There are many other special characters. Some are 'converted' on output and the ® are converted to a muddle.
The webpage is declared as UTF-8 and the only reference I make to encoding is when I open my out file. I've declared UTF-8. If I don't the writing to file bombs on other characters.
I am working on Windows 7. Other details:
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_GB', 'cp1252')
>>>
Can anyone give me any tips on the best way to handle the special characters? Or should they be called HTML entities? This must be a fairly common problem but I haven't been able to find any straightforward explanations on the web.
UPDATE: I've made some progress here.
The basic algorithm is
Read the webpage in mechanize
Use beautiful soup to do what.. as i write it down i have no idea
what this pre-processing stage is for, exactly.
Use beautiful soup to extract information from a table that is
orderly other than for the treatment of special characters.
Write the information to file delimited by | to account for
punctuation in long cell entries and to allow for importing into
Excel etc.
The progress is in stage 3. I've used some regex and htmlentityrefs to change the code cell entry by cell entry. See this blog post.
Remaining difficulty: the code written to file (and printed to screen) is still incorrect but it appears that the problem is now a matter of specifying the coding correctly. The problem seems smaller at least.
To answer the question from the title:
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
html = u"""
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
"""
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
print(''.join(soup('td', text=True)))
Output
LGG® MAX multispecies probiotic consisting of four bacterial trains
LGG® MAX helps to reduce gastro-intestinal discomfort
I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
®
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("' "", {"'": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml
I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?
You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''
I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...
I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/