re.sub() gives Nameerror when no match - python

So I'm trying to search and replace rows of texts from a csv file, and I keep getting errors from it if re.sub() can't find any matches.
Say if the text in a row is
text = "a00123 一二三四五"
And my codes are
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
namelist_raw = re.sub(r'([a-z])00(\d{3})',r'\1-\2',text)
p = re.findall(r'\w',namelist_raw)
if p:
q = re.findall(r'([a-z]-\d{3})',namelist_raw)
for namelist in q:
print(namelist)
else:
namelist = "failed"
link = html + namelist
print(link)
so for this i should be getting a result of
www.abcdefg.com/a-123
so that's no problem.
but if the text is something like this,
text = "asdfdsdfd123 一二三四五"
I'll get Nameerror saying name 'namelist' is not defined
Why is that? I thought at the if else statement I've already wrote if anything else, namelist is "failed"
my code

Your p = re.findall(r'\w',namelist_raw) is extracting every word char from a string, and later, you only extract the values from the string if there were matches. You do not need that check.
Next, namelist is only populated if there is a match for [a-z]-\d{3}, but if there is no match, you do not get it populated. You need to account for that scenario, too.
Use
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
p = re.findall(r'([a-z])00(\d{3})', text) # Extract a list of tuples
namelist = [] # Init the list
for letter, number in p:
namelist.append(f"{letter}-{number}") # Populate namelist with formatted tuple values
if len(namelist): # If there was a match
namelist = "/".join(namelist) # Create a string by joining namelist items with /
else:
namelist = "failed" # Else, assign failed to the namelist
link = html + namelist
print(link)
See the Python demo.

Related

I wrote a regex inside of a python script to analyse xml files but sadly its not working

I wrote a script to gather information out of an XML file. Inside, there are ENTITY's defined and I need a RegEx to get the value out of it.
<!ENTITY ABC "123">
<!ENTITY BCD "234">
<!ENTITY CDE "345">
First, i open up the xml file and save the contents inside of a variable.
xml = open("file.xml", "r")
lines = xml.readlines()
Then I got a for loop:
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"[^"]*"\>'
for line in lines:
var_search_result = re.match(var_searcher, line)
if var_search_result != None:
var_search_result_list += list(var_search_result.groups())
print(var_search_result_list)
I really want to have the value 123 inside of my var_search_result_list list. Instead, I get an empty list every time I use this. Has anybody got a solution?
Thanks in Advance - Toki
There are a few issues in the code.
You are using re.match which has to match from the start of the string.
Your pattern is ENTITY\sABC.*"([^"]*)"\> which does not match from
the start of the given example strings.
If you want to add 123 only, you have to use a capture group, and add it using var_search_result.group(1) to the result list using append
For example:
import re
xml = open("file.xml", "r")
lines = xml.readlines()
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"([^"]*)"\>'
print(var_searcher)
for line in lines:
var_search_result = re.search(var_searcher, line)
if var_search_result:
var_search_result_list.append(var_search_result.group(1))
print(var_search_result_list)
Output
['123']
A bit more precise pattern could be
<!ENTITY\sABC\s+"([^"]*)"\>
Regex demo

Extract text element separated by hash & comma and store it in separate variable

I have text file having this content
group11#,['631', '1051']#,ADD/H/U_LS_FR_U#,group12#,['1', '1501']#,ADD/H/U_LS_FR_U#,group13#,['31', '28']#,ADD/H/UC_DT_SS#,group14#,['18', '27', '1017', '1073']#,AN/H/UC_HR_BAN#,group15#,['13']#,AD/H/U_LI_NW#,group16#,['1031']#,AN/HE/U_LE_NW_IES#
Requirment is to pull each element separated by #, and to store it in separate variable. And text file above is not having fixed length. So if there are 200 #, separated values then, those should be stored in 200 varaiables.
So the expected output would be
a = group11, b = [631, 1051] c = ADD/H/U_LS_FR_U, d = group12, e = [1, 1501] f = ADD/H/U_LS_FR_U and so on
I'd use those a,b,c,d further as
url = (url+c)
rjson = {"reqparam":{"ids":[str(b)]+str(b)}]}
freq = json.dumps(rjson)
resp = request.request("Post",url,rjson)
Actually in reqparam 'b' have to use values like 631 and 1051
Not sure how to achieve this?
I've started with
with open("filename.txt", "r") as f:
data = f.readlines()
for line in data:
value = line.strip().split('#')
print(value)
You should not use new variable for each object, there are different containers for this, e.g. list.
To parse this string into a list, you can just split string using "#," as a divider and cut last symbol (which is "#") from source before strip:
result = src[:-1].split(",#")
But in output sample you show that you want items which contains list to be converted into a list. You can do this using ast.literal_eval():
import ast
result = [ast.literal_eval(s) if "[" in s else s for s in src[:-1].split("#,")]
I used list comprehesion in previous example, but you can write it using regular for loop:
import ast
result = []
for s in src[:-1].split(",#"):
if "[" in s:
try:
converted = ast.literal_eval(s) # string repr of list into a list
except Exception as e:
print(f"\"{s}\" throws an error: {e}")
else:
result.append(converted)
else:
result.append(s)
You can also use str.strip() to cut "#" and "," from the end of the string (and from the start):
src.strip(",#").split(",#")

Trim links in list with python3.7

I have a little script in python3.7 (see related question here) that scrapes links from a website (http://digesto.asamblea.gob.ni/consultas/coleccion/) and saves them in a list. Unfortunately, they are only partial and I have to trim them to use them as links.
This is the relevant part of the script:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # href
print(list_of_links)# trim
if link.startswith('/'):
list_of_links.append(tld + link)
else:
list_of_links.append(current_url + link)
row.find_element_by_css_selector('button').click()
print(list_of_links)
How can I manipulate the list (as an example only with three entries here) that this
["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
looks like
["http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D"]
Breaking it down: on the example of the first link, I get this link from the website basically as
http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;
and need to trim it to
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D.
How do I achieve this in python from the entire list?
One approach is to split on the string /consultas/coleccion/window.open(', remove the unwanted end of the second string and concatenate the two processed strings to get your result.
This should do it:
new_links = []
for link in list_of_links:
current_strings = link.split("/consultas/coleccion/window.open('")
current_strings[1] = current_strings[1].split("');return")[0]
new_link = current_strings[0] + current_strings[1]
new_links.append(new_link)
This should do the trick:
s = "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"
s = s.replace("/consultas/coleccion/window.open('", "").replace("');return false;", "")
You could use a regular expression, to split the URLs in your list and let urllib.parse.urljoin() make the rest for you:
import re
from urllib.parse import urljoin
PATTERN = r"^([\S]+)window.open\('([\S]+)'"
links = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"]
result = "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D"
for link in links:
m = re.match(PATTERN, link, re.MULTILINE).groups()
# m is now: ('http://digesto.asamblea.gob.ni/consultas/coleccion/', '/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D')
if len(m) == 2:
newLink = urljoin(*m)
print(newLink)
assert newLink == result
Returns:
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D
To that you can use regular expression:
Consider this code:
import re
out = list()
lst = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
for el in lst:
temp = re.sub(r"(.*?)/window.open\('(.*?)'\).*", r"\1\2", el)
out.append(temp)
print(temp)
The function sub allows to replace part of strings matching the pattern specified. Basically it is telling:
(.*?): keeps all the characters before /window.open...
/window.open\( the input string must have the pattern /window.open( but it will not be kept
(.*?) keep all characters after the previous pattern until a ) is found (represented by \()

Beautiful Soup .get_text() does not equal Python string when it should

I am using Beautiful Soup to grab text from an html element.
I am then using a loop and if statement to compare that text to a list of words. If they match I want to return a confirmation.
However, the code is not confirming any matches, even though print statements show there are in fact matches.
def findText():
text = ""
url = 'www.site.com'
#Get url and store
page = requests.get(url)
#Get page content
soup = BeautifulSoup(page.content,"html.parser")
els = soup.select(".className")
lists = els[1].select(".className2")
for l in lists:
try:
text=l.find("li").get_text()
except(AttributeError):
text="null"
return text
def isMatch(text):
#Open csv file
listFile = open('list.csv', 'rb')
#prep file to be read
newListFile =csv.reader(listFile)
match = ""
for r in newListFile:
if r[0]==text.lower():
match = True
else:
match = False
return match
congressCSVFile.close()
match is always False in the output
print(r[0]) returns (let's just say) "cat" in terminal
print(text) also returns "cat" in terminal
Your loop is the problem, or at least one of them. Once you find a record that matches, you keep going. match will only end up True if the last record matches. To fix this, simply return when you find a match:
for r in newListFile:
if r[0]==text.lower():
return True
return False
The match variable is not needed.
Better yet, use the any() function:
return any(r[0] == text.lower() for r in newListFile)
In your try: text = l.find("li").get_text(strip=True)
Soup and html in general adds a significant amount of white space. If you don't parse it out with the strip parameter then you may never get a match unless the white space is included in your list file.

Empty text in Xpath

I have written this line of code for creating a list through XPath
classes=tree.xpath('//a[#class="pathm"]/../../../../../td[3]/font/text()')
It creates a list.Their are also items containing empty text.The list does not contain them.It contains only non empty values.I want to take empty string in the list wherever their is no text. Please help
You can get only //font and later use loop to get text or own text if there is empty text (or rather None)
import lxml.html
data = '''
<font>A</font>
<font></font>
<font>C</font>
'''
tree = lxml.html.fromstring(data)
fonts = tree.xpath('//font')
result = [x.text if x.text else '' for x in fonts]
print(result)
If you don't know how list comprehension works - it do this
result = []
for x in fonts:
if x.text: # not None
result.append(x.text)
else:
result.append('')
print(result)

Categories

Resources