Scrape order numbers from Wix.com - python

I want to organize a PUBG Tournament. To do so, a friend and I have created a website on wix.com. Here interested players can "buy" a ticket for free and get, with this ticket, an order number. Additionally we have set up a Discord Server which is "protected" by a password-bot. The required password should be this order number, which I mentioned earlier.
The "problem" is, that the only way for me, to get these order numbers, is to log in to the wix.com admin site and go to the order overview. There I can either, get an .csv file via pressing a button or scrape the data from the code.
I tried to program the bot to scrape the data, but it just won´t do it.
I really would appreciate any help what so ever. Please be advised, that neither I nor my friend are particularly skilled in Python and we are still learning.
Thanks in advance.
Here is my code:
import os
import requests
from lxml import html
import selenium
from bs4 import BeautifulSoup
EMAIL = "<EMAIL>"
PASSWORD = "<PASSWORD>"
my_secret = os.environ['EMAIL']
my_secret = os.environ['password']
LOGIN_URL = "https://users.wix.com/signin"
URL = "https://manage.wix.com/dashboard/63e1c04d-52a6-49cc-91b2-149afa7ad7a9/events/b1839f48-412d-4274-b4de-2f8417028bcf/guest-list?referralInfo=sidebar"
def main():
session_requests = requests.session()
# Create payload
payload = {
"email": EMAIL,
"password": PASSWORD,
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
Order = tree.xpath("//div[#class='_3Eow9']/a/text()")
print(Order)

Related

How to scrape data behind a login

I am going to extract posts in a forum, named positive wellbeing during isolation" in HealthUnlocked.com
I can extract posts without login, but I cannot extract posts with logging. I used " url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)" to extract pots, but I don't know how I can connect it to login as the URL is in JSON format.
I would appreciate it if you could help me.
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
url = "https://healthunlocked.com/private/programs/subscribed?user-id=1290456"
payload = {
"username" : "my username goes here",
"Password" : "my password goes hereh"
}
s= requests.Session()
p= s.post(url, data = payload)
headers = {"user-agent": "Mozilla/5.0"}
pages =2
data = []
listtitles=[]
listpost=[]
listreplies=[]
listpostID=[]
listauthorID=[]
listauthorName=[]
for page in range(1,pages):
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber=
{0}'.format(page)
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:
sleep(3.5)
listtitles.append(post['title'])
listreplies.append(post ["totalResponses"])
listpostID.append(post["postId"])
listauthorID.append(post ["author"]["userId"])
listauthorName.append(post ["author"]["username"])
url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
listpost.append(soup.select_one('div.post-body').get_text('|', strip=True))
## save to CSV
df=pd.DataFrame(list(zip(*
[listpostID,listtitles,listpost,listreplies,listauthorID,listauthorName]))).add_prefix('Col')
df.to_csv('out1.csv',index=False)
print(df)
sleep(2)
For most websites, you have to first get a token by logging in. Most of the time, this is a cookie. Then, in authorized requests, you can send that cookie along. Open the network tab in developer tools and then log in with your username and password. You'll be able to see how the request is formatted and where it is too. From there, just try to replicate it in your code.

Web scraping account with login info displays inaccurate information

I'm using Python's requests package to login to Beatport and I want to scrape the songs in my shopping cart. This is how my cart looks when I login from a browser:
As you can see, there's a div class="cart-tracks" that contains a list ul. The ul's list items each represent a song which are the objects of interest for me here.
Here's my code:
import requests
from lxml import html
from bs4 import BeautifulSoup
USERNAME = "<MY USERNAME>"
PASSWORD = "<MY PASSWORD>"
LOGIN_URL = "https://www.beatport.com/account/login"
URL = "https://www.beatport.com/cart"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='_csrf_token']/#value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"_csrf_token": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data=payload, headers=dict(referer=LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers=dict(referer=URL))
print(result.text)
if __name__ == '__main__':
main()
However, when I print result.text, it looks like my cart is empty:
<!-- TRACKS BUCKET -->
<div class="bucket tracks cart-tracks cart-section"></div>
<!-- End Tracks Bucket -->
The login request seems to be successful when I print result.ok so I'm not sure where I'm going wrong here. Any help would be much appreciated!
EDIT: Here's the XHR request:
I believe that items is what populates the tracks. I'm pretty new to web scraping, how would I post that request?

Mintos.com login with python requests

I'm trying to write a tiny piece of software that logs into mintos.com, and saves the account overview page (which is displayed after a successful login) in a html file. I tried some different approaches, and this is my current version.
import requests
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
username = 'abc'
password = '123'
loginUrl = 'https://www.mintos.com/en/login'
resp = requests.get(loginUrl, auth=(username, password))
file = codecs.open("mint.html", "w", "UTF-8")
file.write(resp.text)
file.close()
When I run the code, I only save the original page, not the one I should get when logged in. I guess I'm messing up the login (I mean...there's not much else to mess up). I spent an embarrassing amount of time on this problem already.
Edit:
I also tried something along the lines of:
import requests
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
loginUrl = "https://www.mintos.com/en/login";
username = "abc"
password = "123"
payload = {"username": username, "password": password}
with requests.session() as s:
resp = s.post(loginUrl, data = payload)
file = codecs.open("mint.html", "w", "UTF-8")
file.write(resp.text)
file.close()
Edit 2: Another non working version, this time with _csrf_token
with requests.session() as s:
resp = s.get(loginUrl)
toFind = '_csrf_token" value="'
splited = resp.text.split(toFind)[1]
_csrf_token = splited.split('"',1)[0]
payload = {"_username": _username, "_password": _password, "_csrf_token": _csrf_token}
final = s.post(loginUrl, data = payload)
file = codecs.open("mint.html", "w", "UTF-8")
file.write(final.text)
file.close()
But I still get the same result. The downloaded page has the same token as the one I extract, though.
Final Edit: I made it work, and I feel stupid now. I needed to use "'https://www.mintos.com/en/login/check' as my loginUrl.
The auth parameter is just a shorthand for HTTPBasicAuth, which is not what most websites use. Most of them use cookies or session data in order to store your login / info on your computer so they can check who you are while you're browsing the pages.
If you want to be able to log in on the website, you'll have to make a POST request on the login form and then store (and give back every time) the cookies they'll send to you. Also, this implies they don't have any kind of "anti-bot filter" (which makes you unable to login without having a real browser or, at least, not that easily).

How to scrape a website that requires login with Python

First of all, I know there are a bunch of similar questions but unfortunately none of them work for me. I am a relative noob in python and simple explanations and answers would be greatly appreciated.
I need to log into a site programmatically using python. I am attempting to do this using requests. I've watched YouTube videos on the subject and looked at various questions and and answers but it just doesn't work for me.
The following code is as close as I have gotten to achieving my goal. The IDE I am using is Spyder 3.1.2 with python 3.6.0. My output is coming up as [] as shown below my code. I have tried the same method with other sites and the output is always the same. I do not know what this means however. And how would I know if the code has worked?
import requests
from lxml import html
USERNAME = "username"
PASSWORD = "password"
LOGIN_URL = "https://bitbucket.org/account/signin/?next=/"
URL = "https://bitbucket.org/"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='csrfmiddlewaretoken']/#value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"csrfmiddlewaretoken": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[#class='repo-list--repo']/a/text()")
print(bucket_names)
if __name__ == '__main__':
main()
runfile('C:/Users/Thomas/untitled6.py', wdir='C:/Users/Thomas')
[]
Thank you in advance.
chickencreature.
try this.
result = requests.get(LOGIN_URL, auth=(USERNAME,PASSWORD))
Check out the answers of these similar questions This and This.
Here is the documentation of authentication using requests module

python-requests and complicated forms

I'm trying to make a web scraper for my university web, but I can't get past the login page.
import requests
URL = "https://login.ull.es/cas-1/login?service=https%3A%2F%2Fcampusvirtual.ull.es%2Flogin%2Findex.php%3FauthCAS%3DCAS"
USER = "myuser"
PASS = "mypassword"
payload = {
"username": USER,
"password": PASS,
"warn": "false",
"lt": "LT-2455188-fQ7b5JcHghCg1cLYvIMzpjpSEd0rlu",
"execution": "e1s1",
"_eventId": "submit",
"submit": "submit"
}
with requests.Session() as s:
r = s.post(URL, data=payload)
#r = s.get(r"http://campusvirtual.ull.es/my/index.php")
with open("test.html","w") as f:
f.write(r.text)
That code is obviously not working and I don't know where's the mistake, I tried putting only the username and the password in the payload (the other values are in the source code of the web that are marked as hidden) but that is also failing.
Can anyone point me in the right direction? Thanks. (sorry for my english)
The "lt": "LT-2455188-fQ7b5JcHghCg1cLYvIMzpjpSEd0rlu" is a session ID or some sort of anti-CSRF protection or similar (wild guess: hmac-ed random id number). What matters is that it is not a constant value, you will have to read it from the same URL by issuing a GET request.
In the GET response you have something like:
<input type="hidden" name="lt" value="LT-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" />
Additionally, there is a JSESSIONID cookie that might be important.
This should be your flow:
GET the URL
extract the lt parameter and the JSESSIONID cookie from the response
fill the payload['lt'] field
set cookie header
POST the same URL.
Extracting the cookie is very simple, see the requests documentation.
Extracting the lt param is a bit more difficult, but you can do it using BeautifulSoup package. Assuming that you have the response in a variable named text, you can use:
from BeautifulSoup import BeautifulSoup as soup
payload['lt'] = soup(text).find('input', {'name': 'lt', 'type': 'hidden'}).get('value')

Categories

Resources