Regex errors utilizing Tweepy in python - python

I am having a problem with the bit of code shown below. My original code worked when I was just puling the tweet information. Once I edited it to extract the URL within the text it started to give me problems. Nothing is printing and I am receiving these errors.
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
I am receiving no errors before I run it, so I am extremely confused about why this is not working. It will probably be something simple as I am new to using both Tweepy and Regex.
import tweepy
import re
TWITTER_APP_SECRET = 'hidden'
TWITTER_APP_KEY = 'hidden'
auth = tweepy.OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET)
api = tweepy.API(auth)
def get_tweets(api, username):
page = 1
while True:
tweets = api.user_timeline(username, page=page)
for tweet in tweets:
text = tweet.text.encode("utf-8")
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-
F]))+', text)
print(url2)
get_tweets(api, "cnn")
Errors again:
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Process finished with exit code 1
Tell me if you need more information to help me, any help is appreciated, thanks in advance.

You're getting that error because you are using a string pattern (your regex) against a string which you've turned into a bytes object via encode().
Try running your pattern directly against tweet.text without encoding it.

Related

Troubles merging two json urls

First of all, I am getting this error. When I try running
pip3 install --upgrade json
in an attempt to resolve the error, python is unable to find the module.
The segment of code I am working with can be found below the error, but some further direction as for the code itself would be appreciated.
Error:
Traceback (most recent call last):
File "Chicago_cp.py", line 18, in <module>
StopWork_data = json.load(BeautifulSoup(StopWork_response.data,'lxml'))
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
TypeError: 'NoneType' object is not callable
Script:
#!/usr/bin/python
import json
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
# Define Merge
def Merge(dict1, dict2):
res = {**dict1, **dict2}
return res
# Open the URL and the screen name
StopWork__url = "someJsonUrl"
Violation_url = "anotherJsonUrl"
StopWork_response = http.request('GET', StopWork__url)
StopWork_data = json.load(BeautifulSoup(StopWork_response.data,'lxml'))
Violation_response = http.request('GET', Violation_url)
Violation_data = json.load(BeautifulSoup(Violation_response.data,'lxml'))
dict3 = Merge(StopWork_data,Violation_data)
print (dict1)
json.load expects a file object or something else with a read method. The BeautifulSoup object doesn't have a method read. You can ask it for any attribute and it will try to find a child tag with that name, i.e. a <read> tag in this case. When it doesn't find one it returns None which causes the error. Here's a demo:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>hi</p>", "html5lib")
assert soup.read is None
assert soup.blablabla is None
assert json.loads is not None
json.load(soup)
Output:
Traceback (most recent call last):
File "main.py", line 8, in <module>
json.load(soup)
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
TypeError: 'NoneType' object is not callable
If the URL is returning JSON then you don't need BeautifulSoup at all because that's for parsing HTML and XML. Just use json.loads(response.data).

Python URLencoding Specific Results

I'm querying from a database in Python 3.8.2
I need the urlencoded results to be:
data = {"where":{"date":"03/30/20"}}
needed_results = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
I've tried the following
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(data))
When I do that I get the following
Traceback (most recent call last):
File "C:\Users\Johnathan\Desktop\Python Snippets\test_func.py", line 17, in <module>
print(urllib.parse.quote_plus(data))
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 855, in quote_plus
string = quote(string, safe + space, encoding, errors)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 839, in quote
return quote_from_bytes(string, safe)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 864, in quote_from_bytes
raise TypeError("quote_from_bytes() expected bytes")
TypeError: quote_from_bytes() expected bytes
I've tried a couple of other methods and received:?where=%7B%27date%27%3A+%2703%2F30%2F20%27%7D
Long Story Short, I need to url encode the following
data = {"where":{"date":"03/30/20"}}
needed_encoded_data = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
Thanks
where is a dictionary - that can't be url-encoded. You need to turn that into a string or bytes object first.
You can do that with json.dumps
import json
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(json.dumps(data)))
Output:
%7B%22where%22%3A+%7B%22date%22%3A+%2203%2F30%2F20%22%7D%7D

Getting TypeError: ord() expected string of length 1, but int found error

Code is
from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf','rb') as file:
pdf=PdfFileReader(file)
pagedd=pdf.getPage(0)
print(pagedd.extractText())
This code raises the error shown below:
TypeError: ord() expected string of length 1, but int found
I searched on internet and found this Troubleshooting "TypeError: ord() expected string of length 1, but int found"
but it doesn't help much. I am aware of what is the background of this error but not sure how is it related here?
Tried changing the pdf file and it works fine. Then what is wrong: pdf file or PyPDF2 is not able to handle it? I know this method is not much reliable as per documentation:
This works well for some PDF files, but poorly for others, depending on the generator used
How should this be handled?
Traceback:
Traceback (most recent call last):
File "pdf_reader.py", line 71, in <module>
print(pagedd.extractText())
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2595, in ex
tractText
content = ContentStream(content, self.pdf)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2673, in __
init__
stream = BytesIO(b_(stream.getData()))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\generic.py", line 841, in
getData
decoded._data = filters.decodeStreamData(self)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 350, in
decodeStreamData
data = LZWDecode.decode(data, stream.get("/DecodeParms"))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 255, in
decode
return LZWDecode.decoder(data).decode()
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 228, in
decode
cW = self.nextCode();
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 205, in
nextCode
nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found
I got the issue. This is just a limitation of PyPDF2. I used tika and BeautifulSoup to parse and extract the text, it worked fine. Although it needs little more work.
from tika import parser
from bs4 import BeautifulSoup
raw=parser.from_file('HTTP_Book.pdf',xmlContent=True)['content']
data=BeautifulSoup(raw,'lxml')
message=data.find(class_='page') # for first page
print(message.text)

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

python logging into a forum

I've written this to try and log onto a forum (phpBB3).
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
output = page.read()
However when I run it it comes up with;
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
TypeError: string indices must be integers
any ideas as to how to solve this?
edit
adding a comma between the string and the data gives this error instead
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login",[logindata])
File "C:\Python25\lib\urllib.py", line 84, in urlopen
return opener.open(url, data)
File "C:\Python25\lib\urllib.py", line 192, in open
return getattr(self, name)(url, data)
File "C:\Python25\lib\urllib.py", line 327, in open_http
h.send(data)
File "C:\Python25\lib\httplib.py", line 711, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall
TypeError: sendall() argument 1 must be string or read-only buffer, not list
edit2
I've changed the code from what it was to;
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
output = page.read()
This doesn't throw any error messages, it just gives 3 blank lines. Is this because I'm trying to read from the log in page which disappears after logging in. If so how do I get it to display the index which is what should appear after hitting log in.
Your line
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
is semantically invalid Python. Presumably you meant
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login", [logindata])
which has a comma separating the arguments. However, what you ACTUALLY want is simply
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
without trying to enclose logindata into a list and using the more up-to-date version of urlopen is the urllib2 library.
How about using a comma between the string,"http:..." and the urlencoded data, [logindata]?
Your URL string shouldn't be
"http://www.woarl.com/board/ucp.php?mode=login"[logindata]
But
"http://www.woarl.com/board/ucp.php?mode=login", logindata
I think, because [] is for array and it require an integer. I might be wrong cause I haven't done a lot of Python.
If you do a type on logindata, you can see that it is a string:
>>> import urllib
>>> logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
>>> type(logindata)
<type 'str'>
Putting it in brackets ([]) puts it in a list context, which isn't what you want.
This would be easier with the high-level "mechanize" module.

Categories

Resources