I am having a problem with the bit of code shown below. My original code worked when I was just puling the tweet information. Once I edited it to extract the URL within the text it started to give me problems. Nothing is printing and I am receiving these errors.
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
I am receiving no errors before I run it, so I am extremely confused about why this is not working. It will probably be something simple as I am new to using both Tweepy and Regex.
import tweepy
import re
TWITTER_APP_SECRET = 'hidden'
TWITTER_APP_KEY = 'hidden'
auth = tweepy.OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET)
api = tweepy.API(auth)
def get_tweets(api, username):
page = 1
while True:
tweets = api.user_timeline(username, page=page)
for tweet in tweets:
text = tweet.text.encode("utf-8")
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-
F]))+', text)
print(url2)
get_tweets(api, "cnn")
Errors again:
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Process finished with exit code 1
Tell me if you need more information to help me, any help is appreciated, thanks in advance.
You're getting that error because you are using a string pattern (your regex) against a string which you've turned into a bytes object via encode().
Try running your pattern directly against tweet.text without encoding it.
Related
First of all, I am getting this error. When I try running
pip3 install --upgrade json
in an attempt to resolve the error, python is unable to find the module.
The segment of code I am working with can be found below the error, but some further direction as for the code itself would be appreciated.
Error:
Traceback (most recent call last):
File "Chicago_cp.py", line 18, in <module>
StopWork_data = json.load(BeautifulSoup(StopWork_response.data,'lxml'))
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
TypeError: 'NoneType' object is not callable
Script:
#!/usr/bin/python
import json
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
# Define Merge
def Merge(dict1, dict2):
res = {**dict1, **dict2}
return res
# Open the URL and the screen name
StopWork__url = "someJsonUrl"
Violation_url = "anotherJsonUrl"
StopWork_response = http.request('GET', StopWork__url)
StopWork_data = json.load(BeautifulSoup(StopWork_response.data,'lxml'))
Violation_response = http.request('GET', Violation_url)
Violation_data = json.load(BeautifulSoup(Violation_response.data,'lxml'))
dict3 = Merge(StopWork_data,Violation_data)
print (dict1)
json.load expects a file object or something else with a read method. The BeautifulSoup object doesn't have a method read. You can ask it for any attribute and it will try to find a child tag with that name, i.e. a <read> tag in this case. When it doesn't find one it returns None which causes the error. Here's a demo:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>hi</p>", "html5lib")
assert soup.read is None
assert soup.blablabla is None
assert json.loads is not None
json.load(soup)
Output:
Traceback (most recent call last):
File "main.py", line 8, in <module>
json.load(soup)
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
TypeError: 'NoneType' object is not callable
If the URL is returning JSON then you don't need BeautifulSoup at all because that's for parsing HTML and XML. Just use json.loads(response.data).
I'm querying from a database in Python 3.8.2
I need the urlencoded results to be:
data = {"where":{"date":"03/30/20"}}
needed_results = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
I've tried the following
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(data))
When I do that I get the following
Traceback (most recent call last):
File "C:\Users\Johnathan\Desktop\Python Snippets\test_func.py", line 17, in <module>
print(urllib.parse.quote_plus(data))
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 855, in quote_plus
string = quote(string, safe + space, encoding, errors)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 839, in quote
return quote_from_bytes(string, safe)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 864, in quote_from_bytes
raise TypeError("quote_from_bytes() expected bytes")
TypeError: quote_from_bytes() expected bytes
I've tried a couple of other methods and received:?where=%7B%27date%27%3A+%2703%2F30%2F20%27%7D
Long Story Short, I need to url encode the following
data = {"where":{"date":"03/30/20"}}
needed_encoded_data = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
Thanks
where is a dictionary - that can't be url-encoded. You need to turn that into a string or bytes object first.
You can do that with json.dumps
import json
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(json.dumps(data)))
Output:
%7B%22where%22%3A+%7B%22date%22%3A+%2203%2F30%2F20%22%7D%7D
Code is
from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf','rb') as file:
pdf=PdfFileReader(file)
pagedd=pdf.getPage(0)
print(pagedd.extractText())
This code raises the error shown below:
TypeError: ord() expected string of length 1, but int found
I searched on internet and found this Troubleshooting "TypeError: ord() expected string of length 1, but int found"
but it doesn't help much. I am aware of what is the background of this error but not sure how is it related here?
Tried changing the pdf file and it works fine. Then what is wrong: pdf file or PyPDF2 is not able to handle it? I know this method is not much reliable as per documentation:
This works well for some PDF files, but poorly for others, depending on the generator used
How should this be handled?
Traceback:
Traceback (most recent call last):
File "pdf_reader.py", line 71, in <module>
print(pagedd.extractText())
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2595, in ex
tractText
content = ContentStream(content, self.pdf)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2673, in __
init__
stream = BytesIO(b_(stream.getData()))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\generic.py", line 841, in
getData
decoded._data = filters.decodeStreamData(self)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 350, in
decodeStreamData
data = LZWDecode.decode(data, stream.get("/DecodeParms"))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 255, in
decode
return LZWDecode.decoder(data).decode()
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 228, in
decode
cW = self.nextCode();
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 205, in
nextCode
nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found
I got the issue. This is just a limitation of PyPDF2. I used tika and BeautifulSoup to parse and extract the text, it worked fine. Although it needs little more work.
from tika import parser
from bs4 import BeautifulSoup
raw=parser.from_file('HTTP_Book.pdf',xmlContent=True)['content']
data=BeautifulSoup(raw,'lxml')
message=data.find(class_='page') # for first page
print(message.text)
I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php
I've written this to try and log onto a forum (phpBB3).
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
output = page.read()
However when I run it it comes up with;
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
TypeError: string indices must be integers
any ideas as to how to solve this?
edit
adding a comma between the string and the data gives this error instead
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login",[logindata])
File "C:\Python25\lib\urllib.py", line 84, in urlopen
return opener.open(url, data)
File "C:\Python25\lib\urllib.py", line 192, in open
return getattr(self, name)(url, data)
File "C:\Python25\lib\urllib.py", line 327, in open_http
h.send(data)
File "C:\Python25\lib\httplib.py", line 711, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall
TypeError: sendall() argument 1 must be string or read-only buffer, not list
edit2
I've changed the code from what it was to;
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
output = page.read()
This doesn't throw any error messages, it just gives 3 blank lines. Is this because I'm trying to read from the log in page which disappears after logging in. If so how do I get it to display the index which is what should appear after hitting log in.
Your line
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
is semantically invalid Python. Presumably you meant
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login", [logindata])
which has a comma separating the arguments. However, what you ACTUALLY want is simply
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
without trying to enclose logindata into a list and using the more up-to-date version of urlopen is the urllib2 library.
How about using a comma between the string,"http:..." and the urlencoded data, [logindata]?
Your URL string shouldn't be
"http://www.woarl.com/board/ucp.php?mode=login"[logindata]
But
"http://www.woarl.com/board/ucp.php?mode=login", logindata
I think, because [] is for array and it require an integer. I might be wrong cause I haven't done a lot of Python.
If you do a type on logindata, you can see that it is a string:
>>> import urllib
>>> logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
>>> type(logindata)
<type 'str'>
Putting it in brackets ([]) puts it in a list context, which isn't what you want.
This would be easier with the high-level "mechanize" module.