Empty text in Xpath - python

I have written this line of code for creating a list through XPath
classes=tree.xpath('//a[#class="pathm"]/../../../../../td[3]/font/text()')
It creates a list.Their are also items containing empty text.The list does not contain them.It contains only non empty values.I want to take empty string in the list wherever their is no text. Please help

You can get only //font and later use loop to get text or own text if there is empty text (or rather None)
import lxml.html
data = '''
<font>A</font>
<font></font>
<font>C</font>
'''
tree = lxml.html.fromstring(data)
fonts = tree.xpath('//font')
result = [x.text if x.text else '' for x in fonts]
print(result)
If you don't know how list comprehension works - it do this
result = []
for x in fonts:
if x.text: # not None
result.append(x.text)
else:
result.append('')
print(result)

Related

Cleaning up a scraped HTML List

I'm trying to extract names from a wiki page. Using BeautifulSoup I am able to get a very dirty list (including lots of extraneous items) that I want to clean up, however my attempt to 'sanitise' the list leaves it unchanged.
#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')
#2).
#Attain the data I need, plus lot of unhelpful data
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")
#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
#4).
#Check data
print(weapon_names_sanitised)
#Returns a list identical to flithy_scraped_weapon_names
The problem is in this section:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
It should instead be:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in str(s) for xs in dirt)]
The reason is that flithy_scraped_weapon_names contains Tag objects, which will be cast to a string when printed, but need to be explicitly cast to a string for xs in str(s) to work as expected.

Parsing a list of lists and manipulating it in place

So I have a list of lists that I need to parse through and manipulate the contents of. There are strings of numbers and words in the sublists, and I want to change the numbers into integers. I don't think it's relevant but I'll mention it just in case: my original data came from a CSV that I split on newlines, and then split again on commas.
What my code looks like:
def prep_data(data):
list = data.split('\n') #Splits data on newline
list = list[1:-1] #Gets rid of header and last row, which is an empty string
prepped = []
for x in list:
prepped.append(x.split(','))
for item in prepped: #Converts the item into an int if it is able to be converted
for x in item:
try:
item[x] = int(item[x])
except:
pass
return prepped
I tried to loop through every sublist in prepped and change the type of the values in them, but it doesn't seem like the loop does anything as the prep_data returns the same thing as it did before I implemented that for loop.
I think I see what is wrong, you are thinking python is more generous with it's assignment than it actually is.
def prep_data(data):
list = data.split('\n') #Splits data on newline
list = list[1:-1] #Gets rid of header and last row, which is an empty string
prepped = []
for x in list:
prepped.append(x.split(','))
for i in prepped: #Converts the item into an int if it is able to be converted
item = prepped[i]
for x in item:
try:
item[x] = int(item[x])
except:
pass
prepped[i] = item
return prepped
I can't run this on the machine I'm on right now but it seems the problem is that "prepped" wasn't actually receiving any new assignments, you were just changing values in the sub array "item"
I'm not sure about your function, because maybe I didn't understand your income data, but you could try something like the following because if you only pass, you could lose string or weird data:
def parse_data(raw_data):
data_lines = raw_data.split('\n') #Splits data on newline
data_rows_without_header = data_lines[1:-1] #Gets rid of header and last row, which is an empty string
parsed_date = []
for raw_row in data_rows_without_header:
splited_row = raw_line.split(',')
parsed_row = []
for value in splited_row:
try:
parsed_row.append(int(value)
except:
print("The value '{}' is not castable".format(value))
parsed_row.append(value) # if cast fails, add the string as it is
parsed_date.append(parsed_row)
return parsed_date

re.sub() gives Nameerror when no match

So I'm trying to search and replace rows of texts from a csv file, and I keep getting errors from it if re.sub() can't find any matches.
Say if the text in a row is
text = "a00123 一二三四五"
And my codes are
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
namelist_raw = re.sub(r'([a-z])00(\d{3})',r'\1-\2',text)
p = re.findall(r'\w',namelist_raw)
if p:
q = re.findall(r'([a-z]-\d{3})',namelist_raw)
for namelist in q:
print(namelist)
else:
namelist = "failed"
link = html + namelist
print(link)
so for this i should be getting a result of
www.abcdefg.com/a-123
so that's no problem.
but if the text is something like this,
text = "asdfdsdfd123 一二三四五"
I'll get Nameerror saying name 'namelist' is not defined
Why is that? I thought at the if else statement I've already wrote if anything else, namelist is "failed"
my code
Your p = re.findall(r'\w',namelist_raw) is extracting every word char from a string, and later, you only extract the values from the string if there were matches. You do not need that check.
Next, namelist is only populated if there is a match for [a-z]-\d{3}, but if there is no match, you do not get it populated. You need to account for that scenario, too.
Use
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
p = re.findall(r'([a-z])00(\d{3})', text) # Extract a list of tuples
namelist = [] # Init the list
for letter, number in p:
namelist.append(f"{letter}-{number}") # Populate namelist with formatted tuple values
if len(namelist): # If there was a match
namelist = "/".join(namelist) # Create a string by joining namelist items with /
else:
namelist = "failed" # Else, assign failed to the namelist
link = html + namelist
print(link)
See the Python demo.

How to replace multiple words in .docx file and save the docx file using python-docx

I'm trying to change the content of the docx using python-docx library. My changes are about replacing the words. So, I have list of words Original word list: ['ABC','XYZ'] which I need to replace with revised word list: ['PQR', 'DEF']. I also need to preserve the format of these words. Right now, I can save only one change. Here is my code for the reference.
def replace_string(filename='test.docx'):
doc = Document(filename)
list= ['ABC','XYZ']
list2 = ['PQR','DEF']
for p in doc.paragraphs:
print(p.text)
for i in range(0, len(list)):
if list[i] in p.text:
print('----!!SEARCH FOUND!!------')
print(list[i])
print(list2[i])
print('\n')
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
#print(inline[i].text)
if list[i] in inline[i].text:
print('----SEARCH FOUND!!------')
text = inline[i].text.replace(list[i], list2[i])
inline[i].text = text
print(inline[i].text)
doc.save('dest1.docx')
return 1
replace_string()
Original content of test.docx file:
ABC
XYZ
Revised content or saved content of dest1.docx file:
PQR
XYZ
How can I save all the replacements? The list of word may increase and its size is not fix.
This following code works for me. This preserve the format as well. Hope this will help others.
def replace_string1(filename='test.docx'):
doc = Document(filename)
list= ['ABC','XYZ']
list2 = ['PQR','DEF']
for p in doc.paragraphs:
inline = p.runs
for j in range(0,len(inline)):
for i in range(0, len(list)):
inline[j].text = inline[j].text.replace(list[i], list2[i])
print(p.text)
print(inline[j].text)
doc.save('dest1.docx')
return 1
I implemented a version of JT28's solution, using a dictionary to replace the text (instead of two lists) - this lets me generate paired find, replace items more simply. Key is what I'm looking for, and v is what is in the new substring. The function allows replacement in one paragraph or all paragraphs, depending on whether the caller is iterating (or not) over doc.paragraphs.
# NEW FUNCTION:
def replacer(p, replace_dict):
inline = p.runs # Specify the list being used
for j in range(0, len(inline)):
# Iterate over the dictionary
for k, v in replace_dict.items():
if k in inline[j].text:
inline[j].text = inline[j].text.replace(k, v)
return p
# Replace Paragraphs
doc = Document(filename) # Get the file
dict = {'ABC':'PQR', 'XYZ':'DEF'} # Build the dict
for p in doc.paragraphs: # If needed, iter over paragraphs
p = replacer(p, dict) # Call the new replacer function

Python: Split between two characters

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])

Categories

Resources