Parse text response from http request in Python - python

I am trying to fetch data from an APU but as response I am getting the plain text. I want to read all text line by line.
This is the url variable: http://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640
First snippet:
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
for line in soup:
print line
break
Result: It prints out the entire text
Second snippet:
request = requests.get(url)
for line in request.text():
print line
break
Result: It prints out 1 character
request = requests.get(url)
requestText = request.text()
allMf = requestText.splitlines()
Result: Exception: 'unicode' object is not callable
I have tried few more cases but not able to read text line by line.

request.text is a property and not a method, request.text returns a unicode string, request.text() throws the error 'unicode' object is not callable.
for line in request.text.splitlines():
print line

import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
# soup.text is to get the returned text
# split function, splits the entire text into different lines (using '\n') and stores in a list. You can define your own splitter.
# each line is stored as an element in the allLines list.
allLines = soup.text.split('\n')
for line in allLines: # you iterate through the list, and print the single lines
print(line)
break # to just print the first line, to show this works

Try this:
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
for line in soup:
print line.text
break

Related

How to HTML parse a URL list using python

I have a URL list of 5 URLs within a .txt file named as URLlist.txt.
https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp
I need to parse all the HTML content within the 5 URLs one by one for further processing.
My current code to parse an individual URL -
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)
print(soup.prettify())
The way you implement this rather depends on whether you need to process the URLs iteratively or whether it's better to gather all the content for subsequent processing. That's what I suggest. Build a dictionary where each key is a URL and the associated value is the text (HTML) return from the page. Use multithreading for greater efficiency.
import requests
from concurrent.futures import ThreadPoolExecutor
data = dict()
def readurl(url):
try:
(r := requests.get(url)).raise_for_status()
data[url] = r.text
except Exception:
pass
def main():
with open('urls.txt') as infile:
with ThreadPoolExecutor() as executor:
executor.map(readurl, map(str.strip, infile.readlines()))
print(data)
if __name__ == '__main__':
main()
Your problem will be solved using line-by-line readying and then put that line in your request.
sample:
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
f = open("URLlist.txt", "r")
for line in f:
print(line) # CURRENT LINE
r = requests.get(line)
soup = bs(r.content)
print(soup.prettify())
Create a list of your links
with open('test.txt', 'r') as f:
urls = [line.strip() for line in f]
Then u can loop your parse
for url in urls:
r = requests.get(url)
...

Retrieve scrape urls from text file in BeautifulSoup

I have the following script and I would like to retrieve the URL's from a text file rather than an array. I'm new to Python and keep getting stuck!
from bs4 import BeautifulSoup
import requests
urls = ['URL1',
'URL2',
'URL3']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
Could you please be a little more clear about what you want?
Here is a possible answer which might or might not be what you want:
from bs4 import BeautifulSoup
import requests
with open('yourfilename.txt', 'r') as url_file:
for line in url_file:
u = line.strip()
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
The file was opened with the open() function; the second argument is 'r' to specify we're opening it in read-only mode. The call to open() is encapsulated in a with block so the file is automatically closed as soon as you no longer need it open.
The strip() function removes trailing whitespace (spaces, tabs, newlines) at the beginning and end of every line, for instant ' https://stackoverflow.com '.strip() becomes 'https://stackoverflow.com'.

Not getting json when using .text in bs4

In this code I think I made a mistake or something because I'm not getting the correct json when I print it, indeed I get nothing but when I index the script I get the json but using .text nothing appears I want the json alone.
CODE :
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import requests
import selenium.webdriver as webdriver
base_url = 'https://www.instagram.com/{}'
search = input('Enter the instagram account: ')
final_url = base_url.format(quote_plus(search))
response = requests.get(final_url)
print(response.status_code)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
print(scripts[0].text)
Change the line print(scripts[0].text) to print(scripts[0].string).
scripts[0] is a Beautiful Soup Tag object, and its string contents can be accessed through the .string property.
Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
If you want to then turn the string into a json so that you can access the data, you can do something like this:
...
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
json_output = json.loads(scripts[0].string)
Then, for example, if you run print(json_output['name']) you should be able to access the name on the account.

Counting the number of p-tags (<p>) in a website from a user's choice of link input in Python 3

Write a program that asks the user for a URL. It should then retrieve
the contents of the page at that URL and print how many <p> tags are
on that page. Your program should just print an integer.
Here's my code:
import urllib.request
link = input('Enter URL: ')
response = urllib.request.urlopen(link)
html = response.read()
counter = 0
for '<p>' in html:
counter += 1
print(counter)
However, I got this error:
Traceback (most recent call last):
File "python", line 16
SyntaxError: can't assign to literal
What would be the better method to execute this code? Should I use the find method instead?
First of all response.read() returns bytes; hence you need to cast it to string:
html = str(response.read())
then, no need for for loop, you can just use count = html.counter('<p>')
Hope it help.s
Try to use BeautifulSoup
from bs4 import BeautifulSoup
import requests
link = input('Enter URL: ')
response = requests.get(link)
html = response.text
soup = BeautifulSoup(html, 'lxml')
tags = soup.findAll('p')
print(len(tags))
This code works good:
from lxml import html
import requests
page = requests.get(input('Enter URL: '))
root = html.fromstring(page.content)
print(len(root.xpath('//p')))

beautiful soup re-parse a returned set of table rows beautiful soup

I'm trying to parse a second set of data. I make a get request to gigya status page, I parse out the part that is important with beautiful soup. Then I take the return string of html trying to parse that with beautiful soup as well but I get a markup error however the returned content string is a string as well so im not sure why..
error
Traceback (most recent call last):
File "C:\Users\Administraor\workspace\ChronoTrack\get_gigiya.py", line 17, in <module>
soup2 = BeautifulSoup(rows)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
code
import requests
import sys
from bs4 import BeautifulSoup
url = ('https://console.gigya.com/site/apiStatus/getTable.ashx')
r = requests.request('GET', url)
content = str(r.content)
soup = BeautifulSoup(content)
table = soup.findAll('table')
rows = soup.findAll('tr')
rows = rows[8]
soup2 = BeautifulSoup(rows) #this is where it fails
items = soup2.findAll('td')
print items
The line soup2 = BeautifulSoup(rows) is unnecessary; rows at that point is already a BeautifulSoup.Tag object. You can simply do:
rows = rows[8]
items = rows.findAll('td')

Categories

Resources