UrlOpen Redirected to default page - python

The default data link is http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk
But, I do not want data on this default page. I want the data under Portfolio tab. So, I used Firefox to determine the url of the portfolio and attempted following python code:
testpage = urlopen('http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk&tabAction=Portfolio')
However, page is always redirected to the default link. How do I get to the portfolio page?

You need to pay attention to the request that is being made along with all the headers and the data.
For getting the "portfolio" data, if you inspect, you will see that POST request is being along with log of data is sent and payload data (form data) is to used to send the portfolio data back in response.
What you need to do is mimic the request to fetch the response data and then handle that according to your need. You can do something like this :
import requests
from lxml import html
headers = {
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'accept-language':'en-US,en;q=0.8,ms;q=0.6'
}
url = "http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk"
payload = {
'ctl00_ContentPlaceHolder1_aFundScreenerResultControl_ScriptManager1_HiddenField':';;AjaxControlToolkit, Version=3.5.7.123, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-GB:5a4df314-b4a2-4da2-a207-9284f1b1e96c:de1feab2:f2c8e708:720a52bf:f9cec9bc:589eaa30:a67c2700:ab09e3fe:87104b7c:8613aea7:3202a5a2:be6fb298',
'__EVENTTARGET':'TabAction',
'__EVENTARGUMENT':'Portfolio',
'__LASTFOCUS':'',
'__VIEWSTATE':'Or/Z5BkJx2WVGMIPWgbVTVzk9hu+/eDKDHsbG74cJRlPSPW9dXuSQt31f2njq7X4NCZF/VW7u63TU5lF3lWGIAFNRoIIWwlRVMeMWeHygunbmBVxWWO08k90rAhbOCiyeOgKoaL1lVKO0R0DGS9rjl1Gah7C2NiIyLeD8boWobKLRV47aRiqaWI9ZYprxoky4zmuNp4NP51z0QLfb/4TvQKfcXJcUHHAAknVurwXfye3cHiUGf7pOyI84E9KJscHsbowC6mejPX4XmlXLVrVXk/lupYU8yTXSp03D2vfyPcQcrxt3y/uF0kXNG+4A/hFWOQFazVk1SRMYnQlrWtQ9Ulh58Q71zEZvX3yZhnp2EA5ZnYuOfeFWCnwwUBa6s9o8uLocDK1Q4chtjXDqK7q2W89kPZoyYjmgB5xunFDt8A7Sz3IFaDkJEyPYdBPOKx1Y1zv0g3/gwBnd64UXkTlBySHZao2CB/OBNQoqI6RqI6L44nrbESabh+DHBCdcCKeL8Pj+lsM5o7P0ShXpXHbCRTPk4PiWVeP4hk1vyOFA7tiReoWEPwQvDe3sqWh+K7EHLHefW5ke6W9zy5seHuC1vfcVTwT5FUIcTaAnhoDSphsMHWPoVc/vtcfExPWUx/aC2KIf1m+DKtN/no8Frt4SYqxMGtDMSUZjMR5xhFHSaqfjv/0Gs+RVod4N+A4rYeUO07A9VTLTE8SuZ4ovxjhrEtAQ3bYqzt29leHpmFT7Pfl7OZw3t3wt6SjQX+Q3M5ozThannhRKaDJCnBZdFh7ZnY4fgCLpDNyMDq3FccJC0V6PDSuu6enpPWOcy4NJj5H+/rEqo61/e2wgmefzt7Zaygu4v66MmOKLqbWymNa6C1Xuc0u0FhERUrSWrL/rS9kwC+LA7aWFPhdnEnwPewV6yj7kWzb4IZZ6ivGs3CXYH7G2HTnsP/P2bHXNV+YaaXTkdKJXkiPF7/qQ3JKzZYhDJjj0PObqtI2RlhmeecF8Lq8SfRRrBTXWjvg48q7nXurZ7ztX28QRDHC5aP/13+X8RyvLRmiM4V3vRdMjxpt8ySZtKM5wpEA4XjTUtCWtrNKO18yc0pbMaGRm5xEoXLY/i1cHC62OvKRsEYX82q+KuGyqEKwPoElc2SbMfyEit8M6tkA12wBce4cqlUMX4D85OOKCMzhY+h0loQFSsgVFfqKpEHpH9yg5lKtRg0dZ8P301xoGCeXhBhyZIp234EAdOOQySV4iNcBykLFGOuB0w63KqVbQRejqlnj2Qd+OkoXQ4hAh9tgCXdxhOHZ1hLB3nHHNMT3TDBO+j8eXgxAE8PN9zt6Xj7qGqDmkHAlwMP3Q8er0Ms2i80x7pUzvy5ixozAbUgfuEeKtjkK7fSD0UkKMa/YELEjTkgVJm6goPPIR3D2lwNAQLyHM8xLFSy3evkpJojw+QEFw4U9n31CoO6OB15Isqy/E1MPgwq9Wz3mUn2iYH1JruwsgQqQXraUKAiyMlpfbtj2YQL39Zp+AwzPeDDbwRaCCNBFvmpapcJyMpmzlzd0tr9gdV1GoVTtWBg+UcVGSsQi4XkD+32CfUuQ+ZpFlmUoYLuYSAEFV7Y97MlqLMqW89r/BZXRXNacpizFFrnQlCnsM4Bj4DUp+K7pcxAaYRKWcH3tiQO6zhCa2b8YoawWzQ5Ij8b19Z7PLN3Yug9ldeJ2CcYOzUQebT23ofSNtCU+uTbYzzh6RE8Bg/rut8R6A1uwYBWvjfL7N7M2fUSd01pwYgJ0BfsViV1pipzpCvTL5hGf1aK25gR+T7GtIxNbrdlo6Z1LbV/xYQYIDTod5dq6wUttZJVLeLVZRkCAv+M+o7Bvd86pi82TIdC8foOPgo7OR6ykPk+aMt1pr/hBV3tmBOUvMyYADmmOZQR+L/AQ57tRukeRyACeTJq1b5icpxawI+qn71we6eAKmg5POvkbq+pI+YnoSs1Mhk9OWeJ1CPRg3P5TDMIhqXsG4mKY6awMwZXF12/r4qb7bRnfZGFukHBAYJTRsmZsLgiicM2uJ7kchxs2U/jwVcItGHgnIYkg1r7TTJ3oFo1rHEVhFHm8dIem3iI/VUpbe/XZyEKseDxoALSbASjYxM5n2eGfBLFnHMHv8RPfrX5EBfD0ZzMAVc8MoSycTsJuJI8L912Eewk9Cz3mb7o2zF9L+8syg8NpEDy78kIa0lE+QNqvdtk3P7uCxUckKWdmKLUfU2zaTBBGkIcDo8xXktZFgC+yQbUtxD2yFC21tvSA3xJaPVWqycMiVRp3fwIabWylnRnnwLqvAjIPTKiZI5w/szdciCwzx0GhSY14xpVV+jlLlfH8KCqVBVL5NIzxRTw+ELVPHOS3orE1dKtCcOqM22GE5PsU69E7ViA+fC2Gn/HzkUfUHPBKjKixX9hTmZzOnXToBU5sdEMZ1i3Jte+xfk3YVzYv9TO9f5EiibdNgw8MdCrXwxlgYNZUob0PixOajsPed+qv2PTl+kvLOSTkw6Z6K892TJkBvpAGQvP/zSgUorcNhuAJwQVG32TnX0HypMPpVwX0SqOhZLGM9essa7guKOrA3GdIDsoA2/f4JkFlJMtVgXKGPNXr7mTCeq2H8vFfQbH/59wPfMgrxxo6s9C+Tyt5zG3lyRoTEGUr4QwBkSeHq4J6Vya3sFDH911QHrfFuaKF2auqHHGuKyCCViqpb3A1Z4/GbllXBC4cmjyKc8FfI5i2eSSEMOd95N198ZCOD7x1zXPACX9QjaMdzZbadJ9UHXYsb/7l87ujNY4x5S9oQXgfW8fva9i4oqTqMV3VXTQK8lVcFovH0OxXXpNZ+rPm8Tj5kbRGrMgp6CdyxWSLKvqYv8f57ICr6ozaxyAd8XiTM+AhkfnXsN8BcH0u1yP6WUBDkUjhBi+4lfO6Dj5r6pFIN65GqPaz0mRFDpZU3nVQ1CmmeXneh0ZT/u7tG7Ray5Md5jr9onVsWfWnbc0hbUP0ghMANhtZtcrLpFikwxxQybdsS/xWdB4dLenTMAi2hn0KQ196thhQvvhEvEWaSxuEjX+iaQB14kXwOHAsBj8Ikp4lIdBsVctVQFVNzM3+F+UfDIbpTFh4IaAvOWNZzFGZYjdKDKKIuIgSAhdkHZbjQGpvXWdx12WR1/I/aqk5dx8OFpU3Lq/thZxQ+0oODetvex87L6lKWMgUcvQQAzAXbwzFp4wcTHnQuKJ21hqotOfn8F0GmWv59/hqfH1oFpt6/ENAs162hXOdGt5kTYl7u6X+ciQiIioRLiJ/NRIOoa1T++6v2FMk9acnOfNYMxEGeBdtqmLIN70aL8wvoFLliCkUhfe4yPaFQzFo26JsnnAXUpuiDKfs5fjDS+Rk/1BfVScqDIMv8IL8RDIoWxg8NX5DOOJPwAc3uC+s/kCCpoG2L0m9FLgSBv6Nr9wuv1rt59C/K/5RETD/VP415ArnuUBrdGpuYza1FvYyCo85HREzIL2lN6yZUBUXbBBrWxa3LiGaojhfhCyflhhHs+GoM8zfY5IW7Wpvp/YMPAgxXNRtegGL80+HU/dmlkRO8nRx3eyzpcpWZ302rK9m+OYqtfUXwvFKR7ULWnk/2aHsTQe6lwifxK70QG+jhZlrJqbPGi8vSpajsGMw5iU+VJM4CEDcGhvgpzODw3LkXPvsFrdLq8eUzHXo1Ox+yiZ2zSN3vGDcGeEZiQAbG2dcNt7niW+reozfdxVQAi4uLpPGWYu8jvVnRxoMuQEKEGzIiwNNsvpgCMGdUfk0izvvkTplz8lvk6ROlhy417VtiiVXVIMTAovFNO+W3O3/17LJ9Ed7QPYdoUO4n5fidYX6r4QUAoRGowMAPHQRIdg/AqN7N/EfmDRD72t7BOqvzVetXuTVId75vKB2P0CwoQPDIy0ynLZcTRykRs38LIHwYI5irp1NUjCee7mvo1RE0asD670LM03ZFMCOu/hmgln2dk5oFeyysISdxVUQKRmI6VytwEsSviOZeP1cZgB5DakdSgCaloRI2JGVbZ22B9UgO6hFSvfHhox5y1p/CzIrJPd+GUB80wmFX8Kgl0DSjsf9PJNQlAKu/jb85+wvF7exNrPyrShkWE9lYjbcmBHPYc+8J3ia1N3LWtVbR1x554dYoWHVGw3VbU3bWfqLjn5Eon4x3h5R/bCwVBorVCsQ99SzWCv5J9dMRF38r8y1yA4iEUPcX89n4nl91t6cnia4THuk2hhbaBPeu2PFwnTuwiJxAknVEGFUslAXu621wvmyssftVnQ+jzirCQNJAXyE75t+pNrWmQJXrpHDxnR3V9/LFrNy3tZn61H+UkEY1QK29bUJHE+DOfSnS4QkNY3VpLpaBdBeBorOOZ6dEc+lzVDcPrgjL+1fqu/yHwFmxCN8MfreDuX5E8M86YAR/xnyYAJRMxafR1p9eIG+cgHwIBeOhCw1J0p+ydN/bNK7KousyY4OcEr4zTF6crn6LmN6C7zDqabx8cjMRmUeNl24x27LxJhakNbmQjMVPSfo6Ro/edo0L7pG+pbj9SwiaJxGkr65b8pvrTwDFKh7tLUQtZ9j4s4y6MiQ565q2OJp93rm2deHHXUsM/ziI1t9OV00dbjuhTaLTmF5u3rIpKgryYVvmIa6G081WcKCeWz46amFLg7v39SB6XNuL0AIxIow5Hu+S0oIv51+ycUZoLUnypTFe/SnlobjYxAxsJt/cnxZ4wh556EZ9rN0HaJrbbb7pA4uaDBvz4EL8ndM+zEmlacyfQlKSr+jdB0XX+An6zhNQv3D5dkz84QdmPuAavhemrwr2m30Q/tNZ8DZdsBpOyR/U86nplu9Sx5LcFGnWULX+teY24nBUUghfuhGRPEr0dHPUUgMwpQq1fpcz6YQft09B0uthQiYWhNXvnsrlvnLzZTWTZLjFfwDlNn5RZqn0fAxudbM+eOzL9xvx4TBEEpcyf5nLTuNAKvfeZm4KWcRmV+WPnDJxmf7OlTVNKsXiY7Y+bJMjgNfKMh3oQws/1+gtATMlYSdjNIzuYSglhMyXS+BPRI8dpDq2VeF/cb6AII0Pvyq0H7nRadP1xccD6hTKdb4rP7kEAAZClm7P4M4Mog+CAXePDMw3kSkRGzsbT/6rKffKp9crRcOnKwSHU2yuf+NBTES6xeaPD0R7YwjbrRHPDsOoOdQXEcn/bl0oNnnLheSZhDKdFERtlvrpVB8qZ469A2Jqw/X5QMcIrEb0gisLWRSuiCpg/zmFDqaDsj1M8evc2MPGtkKxw9IsuupthsWKxYkbwB2inJdnLwgCDx+2B5oIT7pYbLricSseF1ukjL3uEyHicA3WztLzKoLjumpzevRWBs1VnYCL0Ow0U0yABR/dz3nh0mcE6X0iBb8ulgp+zn/8CNTNEE7lVSPPn3FFr6+mNuYu5O9fn9G6lji/8muhJWTW/9bbrA/2ZVPK4pto7mmfh/OWmkHnw3Te4ZysDIOXcXD7BCixoSB/3l88JQrGB/EAqrNz6oEhXeQ9hof2EhwKI3ZoxvKh5jfDii3PWI+NJPdFFtP+zRS+P1p4aMpQC703rHkmiSFRJIIaPnnbnXNN1NhBefjkjFA6nTvUcYsbBtKQzFJbAEiBnhOo/+jgUdd31gZbZbRi2Iw+Pv07qjDgwVznE5HLwEu8y2k+mdW7f1RKIgjiOhPA4CzBcWumeo7USUDpHaLNWEP0lLiwuxxB8CigRUln763e4xFAvd+vPlBoJJsJBUezJ5OdV5AC6Fe9/UuFT8Ov+Bsknk072xPIHxLks5J6XxDNrm7mnDKTirLE3y2OLpy5gAUPc1n7UpdH08k3C4y+8iqILZXfN6WzR459QcmY7Uu2YFSbxVM6dVYsE7arsp3zyDgjCgnctbrlO2A2iJ2P7f8eGdYEnMjm8Hv4lwFfSHDKVuVoD3+2Amw5CtE9Smtdidm4OTC/C1yePG+IvtXlx+21lgPpRdWOCFmz8/bQusVvRlQCz8any5fVnXJERaeQxC8UMbWgmRPQs1Q0rrpe7V8LKq4W5rwmEClsmUoqWiaDXN/nuzuPY2Bm7l4Qo7B0JQd/AEA3Kw4/4L8XLbQ8JHtnamJExXbDFZp2jPjz/9igiPq5j1+/ZqJtnwHPa54R2gLGbV9plpEMOu94Og993CM4QxKN4LSD/TKUV/ik46I5H1texmN7RWMcL1gAPnO0AVbiI8sP1xNxclROHanYTudyVVKld0qrq6ht4eYOgPL2RWB8ma1i8DiqfEy2Z+iDHIDv4nB92ktT7BKNA9MEC19+Nmbv1nNgiULtt+jOZZ92XlPnVU5fYIQZEsn/VPzxFx+6yoDmdN9+aeC6k/SfWFPJhdQ6e0Kk9sOpZQowS+GpaTGw86m2K/RyyzQyzJetV58Wlon3bYwzST6N/CHxiyWJ6KiC+JOJHjKuZjoX33FKKp/LqkG7PlC599l6afuCmU/e9r+MczU+BqdMZTAHshA5mpgPqT2gjHvD2Er+J9Be/P7YHCUvOUFpnfcDWVDz0Hx3kqibf8iFKBFDzK+YHk7U0I0O7yRgLTpxf3CC3ZpEF0unuuM/29BgViIucoRdPIHceE2aTUcH27myPTT5SJ8SGpJrfr5bS264VZ7la/ewkdSVAvfKZAD95KofW0up8Bt4z14+IQCE1Xe1c2fA6Vr20junvkX4ZgQVrlWGztCX26LigP19olHT8mDFPGzCEuXXqFDSGUaJ9bpRglkd5Ps9JkKbCfnzF+PLFMByr5xDywnCaMDyzLttVzfdu32qVI3UP51UZ1lnnylXuKjuctCJiC3it20cmMf4VK41gZpGbUlWoasmnx5OGapdvwGTtrT0cVNRkg6UXjj2BuZncnTBPzNhwarsZUsBBtOAbDue7xfEOXn1lLUDQ1u8xKqwzN4tqYTbp+VQAuI3LG9FaF6cD0wQX9qyFji7oIcmsHMe9KROCKi83IdG/Td0ML2/h3yUG/VkKVTdKbX2QGNShAKCT3JPPfR49q0WIeuJ4SCskuiZfzzo2m3rdodBvqEPiGSOmlmp6/RLGF0iWtZ177aqYzeEEMwO2BpxR0ZB/+0JTD9u+DE50h83q3eRsikNc4VlZDB8JWJW3WRLU/AxRefA0rP3nes0B7Thx0MTXpRsOGjprtSKYR9h73QxDugTSNA77sCpjSutaswovVdn9Lj6k4IL6av1gRK2Wso/e8YNEglHp+VBEegXw35ZOByleDUwqdOS07xCA9hBS0Oec0v7YvFEAIJnW4FunEV2fscciH2gVpDR2s+FbKjVdT7t6qNnWT4HX2PuL2mrHDKgE/l8tDvJ2z2zZW1fTiExfTngLbTOyhlptrc8RFDAJBi3jdYw4HU9LufawJjUIukBgiX9cM5y2IykqNzM10tMsMVfxRH10lHqieGf9e0u2ht+gRmYwooRzoyenkKlWVHh8E37+yXa0SHuV5qXb/8sk1IGqE5p0wL7qWUfOTRAdWtPll30n16f7Epfl+dYI8m93uTk2FrL0Dsosdkp5BmIilduNXje1bMonEtliHrJ012Q0FIxVjEOZDUTUXYwRw0mF0eaxvKu27cJ1OqYUGfJk9zqAiAc9QnTBDL/f+zljgBzs6FWC2PWASaoMrS5Q0aNlN/y5wmFma9swHoEwrBUXr4Vi2Nyf5jj/FijJz77DaNs1J4G5uUF8Abe8HRvYc0XCtEMWqCcv3W78Px4/v/ThOMvamMcJvBh91/6Ep95/SzHvuEDpb8WsUKjwpXDdmp1k7QgMwb0ymrheZhxj/mYklx4EtnMWYwIPt02RJbEFoEcgB09chAg5x8rTh6FmJzGHmZOv7A8oEg0CvrO2pT+aiKqCTRcJsOKKvZnLXlQg1TwgJh3jCgvVVSPGIEO4RpIWMNT1/Opno6ytmiubgX5NoythDBrG5WtoAltsfomRTkb1NWOhcam9Q=',
'__VIEWSTATEGENERATOR':'7400285C',
'__VIEWSTATEENCRYPTED':'',
'__EVENTVALIDATION':'1p0545yV8Pljo87c7Dlb6kiemgGIXd5S3wYUGUoMyg6IPO12GWyBgNM7bu67YNl9f9Sx9ad9lHLIfwYtw0nDGqWYtWBnM8PHrdmxYdOb5+qUooGamIPBCCel/8Ri+FGpvNGPTZkKeuYzfomnlqr/mYoMcjdnsiQWCf7Dvou8X5p8A/pkHRReFtE8H4xIhr8X3MU6lpxBhHZKj3UK+hBHCWxEnQkGb0Nz2Pi8hyWNt5AUu830RSQnl793RwuxwQ1HCmJYFEx00c11gXmSn36PPP42OCMstDR/GpK2LUPsNQbdJ7TUq25rzG/5SIjYxWA0nQbGY/mWaY3Q9iCo7k5o9QnZEf4yLaOF5g7nEva4lTZNwx27ynyDAWrRBVE0KsGTsbIQMgMPqCV8gzc6irsluosW+EI6zW0mdaeoiBaGYHFQnJ77a5rnbpL0j6fMiDfL+5VW7jAaRnyz3Y10Cn1TlXEY4rvQjHZIxvK/rBe2WSVkyXIhrUgAl7a7EvDGnBniVdsizrBSgASrjcT9svJ0aHPEpfxJmy8nuzV0pZbXGzG1q06Xyij4SoxHfi7tf1dPjOn6+zdR2SY0/sQvXmQ35bAlbnFMKWdzyJHB0uEm6GYQtV4Dcj6fjGijxjIiQW0SgjuRd7/8k1i7MvEnD+I6MRIBhx/KNOdP3os9oP8pyMicIz8V1o7KENKwX4fyUmbIx34adCXt/DXT1sjFtNu4S0vzvOU/AtxBLOEcw+clV25xSp/94dEq/dge+K2ySRuxKt6DcNhjDMYvc1ACXbfVjANG8ar3x7n9kX14EMtnpip0RI9ypma3tOmjqip+Qc+lyc12A0jV714BfTzw6nSjYya75Idztq1gZifNt+pQn7GO7Qw2kIqNnXvpA+UkWbsTlnTKyY9gTqPHbF5XcSvtvDfNYM6mxVMJZ1MyAt4pxrCgyGAC0IswPZ8wMAablR9fNStFs67D4kyeUCU/2IVTD1/pfmMC8meLuaXpgHkl2er3Wr84H2lVL+xUd7/wCUSkLFke68SeRfqPl7dIR9hstVJC0cCTbmco0KxTzcwln3QdoxveE/N8v31Z9teZoJxdeRRFyJQFGHw81JVor2kACBsIkioLUvz4IpUxE8XEwUrjHCBJyZG3QzcQAxSBXprztdoknBgrd38wCssuCa3gQvIoMtbCaedWhmY0pA/AI0aHTHR/j5nTg9jaqeEViZF0hLVEhVz2MojXswtp0aA70YwuMPBmMCdgy5w+wSeThtsyt6j+b2NHHRkE0uqEc8D5XDUB5M2UolpsZCcuOKSwK+jwGdeb3gPWssUQShMgRTEoFLapZJKX7c8/yeZ7Lf4KrFE4pBz8+JeFJOVFl3y1ewckAFZVHdYvu1901Aq6PKTMu6kGz/LElno8fCJbyKacsZ3LtpssGhvFBN0vv0WIn4elkiLCL3u9v6oWOxaK4OIDTaVwDLjb5BvBpd2Szj5diHAG1IXoVQYvJ2VEDVwbiTUChXRZcDY6bAm7dvkYWLOxsa0whGz2xeeApbEceeQrHREelH89ucBenmuENPiF98Kf4mZQ/ThiVhFxiWAux1b0Dn0z7M/mXfposuy8ytqtRry3SJoC5V7I+7E4N5x0JyVwN/vtxBpd5h443R35RvDZ1tnscirsGzNoulevkeUqM+I6TgjrvBF00fv3isZvzUjIXK6E9cAg4G7aPyuirI+KIICi9VNxF5fDRxi2UPTHiB3NT01Vez5GVt0Tu8lpn2iakJSBjihOYORrSI+xJzbQdnCzJa1+h8UiAFXgpqWviJUVXG22wFQ1HQckAbFxU/Pcyx+QrsnDrhwihqmnwFd1fuwOy74SAvPMojpxujxWDe+37nhroEyhrk5yOB65RDUcQFS77a+3RwuNyXAodTC3QMp5lMZD1Ae8zGEBesg4zbkP7aMS+ljYBShRN6n9KYhHZ7s2Iq5V4K6GrUcOFdXP157jN4vBuj8l+UoBIPjpMm9KKpLnCuSjGNPIyoxfPg==',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$ddlPageSize':'20',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnFilterBySelection':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedIndex':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedFunds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnCustomImageFileIds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedSecurityCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnTabs':'Snapshot,ShortTerm,Performance,Portfolio,FeesAndDetails',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnSelectedRow':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnCheckedColumns':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportLimit':'50',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$toExport':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnAllowAllFundsExport':'true',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$txtSaveSearch':'',
'ctl00$__RequestVerificationToken':'e_BrPK0DxBjkgMrhfkdyFJjp1nPzltSn0h20aUjHJPSe3W4w3FRsFQNo_YY3Ml0D1CkNGqC5PEJBigtZuvdbiYrldSMrUoOQFUjaPifPbM41'
}
r = requests.post(url, headers = headers, data = payload)
print(r.content)
root = html.fromstring(r.content)
You can now fetch the elements you need from root using xpath such as :
root.xpath('//input[#class="some_class"]')
Refer to scraping and lxml documentation for more understanding.
I have used all the payload data from the request, you can remove some and check for what is absolutely necessary for the request.
Also, follow the website rules about scraping and scrape gracefully without putting too much pressure on the website.

Related

Request for data that generates chart always empty

I am trying to scrape data that generates a chart on a website using python's request module.
My code currently looks like this:
# load modules
import os
import json
import requests as r
# url to send the call to
postURL = <insert website>
# utiliz get to pull cookie data
cookie_intel = r.get(postURL, verify = False)
# get cookies
search_cookies = cookie_intel.cookies
#### Request Information ####
# API request data
post_data = <insert request json>
# header information
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
# results
results_post = r.post(postURL, data = post_data, cookies = search_cookies, headers = headers, verify = False)
# result
print(results_post.json())
As a quick summary, I first loaded the site to then inspect it, from there I identified the url for the request in the network tab and then checked the required request data in the payload tab. Then I took the user-agent from the request headers tab.
The request itself works, however, it is always empty. I have tried altering all sorts of inputs but without success. I would highly appreciate any sort of tips that would help me to solve this issue. Thank you in advance!
in this case you have to use json= instead of data= when making the post request according to the requests documentation . By replacing this part of your code you should get the expected response.
results_post = r.post(postURL, json = post_data, cookies = search_cookies, headers = headers, verify = False)
You can also try other scraping tools like Scrapy to crawl these data and maybe running the crawler on the cloud using estela.

Scrapy - How does a request sent using requests library to an API differs from the request that is sent using Scrapy.Request?

I am a beginner at using Scrapy and I was trying to scrape this website https://directory.ntschools.net/#/schools which is using javascript to load the contents. So I checked the networks tab and there's an API address available https://directory.ntschools.net/api/System/GetAllSchools If you open this address, the data is in XML format. But when you check the response tab while inspecting the network tab, the data is there in json format.
I first tried using Scrapy, sent the request to the API address WITHOUT any headers and the response that it returned was in XML which was throwing JSONDecode error upon using json.loads(). So I used the header 'Accept' : 'application/json' and the response I got was in JSON. That worked well
import scrapy
import json
import requests
class NtseSpider_new(scrapy.Spider):
name = 'ntse_new'
header = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56',
}
def start_requests(self):
yield scrapy.Request('https://directory.ntschools.net/api/System/GetAllSchools',callback=self.parse,headers=self.header)
def parse(self,response):
data = json.loads(response.body) #returned json response
But then I used the requests module WITHOUT any headers and the response I got was in JSON too!
import requests
import json
res = requests.get('https://directory.ntschools.net/api/System/GetAllSchools')
js = json.loads(res.content) #returned json response
Can anyone please tell me if there's any difference between both the types of requests? Is there a default response format for requests module when making a request to an API? Surely, I am missing something?
Thanks
It's because Scrapy sets the Accept header to 'text/html,application/xhtml+xml,application/xml ...'. You can see that from this.
I experimented and found that server sends a JSON response if the request has no Accept header.

Extract JSON response data only visible in Network>XHR > angular response

Thanks in advance for looking into this query.
I am trying to extract data from the angular response which is not visible in the HTML code using the inspect function of Chrome browser.
I researched and looked for solutions and have been able to find the data in the Network (tab)>Fetch/XHR>Response (screenshots) and also wrote code based on the knowledge I gained researching this topic.
Response
In order to read the response I am trying the below code by passing the parameters and cookies grabbed from the main URL
and passing them into the request via the below code segment from the main code shared further below. The parameters were created based on information I found under tab Network (tab)>Fetch/XHR>Header
http = urllib3.PoolManager()
r = http.request('GET',
'https://www.barchart.com/proxies/core-api/v1/quotes/get?' + urlencode(params),
headers=headers
)
QUESTIONS
Please help confirm what am I missing or doing wrong? I want to read and store the json response what should I be doing?
JSON to be extracted
Also is there a way to read the params using a function?, instead of assigning them as I have done below. What I mean is similar to what I have done for cookies (headers = x.cookies.get_dict()) is there a way to read and assign parameters?
Below is the full code I am using.
import requests
import urllib3
from urllib.parse import urlencode
url = 'https://www.barchart.com/etfs-funds/performance/percent-change/advances?viewName=main&timeFrame=5d&orderBy=percentChange5d&orderDir=desc'
header = {'accept': 'application/json', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
s = requests.Session()
x = s.get(url, headers=header)
headers = x.cookies.get_dict()
params = { 'lists': 'etfs.us.percent.advances.unleveraged.5d',
'orderDir': 'desc',
'fields': 'symbol,symbolName,lastPrice,weightedAlpha,percentChangeYtd,percentChange1m,percentChange3m,percentChange1y,symbolCode,symbolType,hasOptions',
'orderBy': 'percentChange',
'meta': 'field.shortName,field.type,field.description,lists.lastUpdate',
'hasOptions': 'true',
'page': '1',
'limit': '100',
'in(leverage%2C(1x))':'',
'raw': '1'}
http = urllib3.PoolManager()
r = http.request('GET',
'https://www.barchart.com/proxies/core-api/v1/quotes/get?' + urlencode(params),
headers=headers
)
r.data
r.data response is below, returning an error.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">\n<HTML><HEAD><META
HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=iso-8859-1">\n<TITLE>ERROR: The request could not be
satisfied</TITLE>\n</HEAD><BODY>\n<H1>403 ERROR</H1>\n<H2>The request
could not be satisfied.</H2>\n<HR noshade size="1px">\nRequest
blocked.\nWe can\'t connect to the server for this app or website at
this time. There might be too much traffic or a configuration error.
Try again later, or contact the app or website owner.\n<BR
clear="all">\nIf you provide content to customers through CloudFront,
you can find steps to troubleshoot and help prevent this error by
reviewing the CloudFront documentation.\n<BR clear="all">\n<HR noshade
size="1px">\n<PRE>\nGenerated by cloudfront (CloudFront)\nRequest ID:
vcjzkFEpvdtf6ihDpy4dVkYx1_lI8SUu3go8mLqJ8MQXR-KRpCvkng==\n</PRE>\n<ADDRESS>\n</ADDRESS>\n</BODY></HTML>
You can get reponse by name, on your screenshot name get?lists=etfs.us is what you need, you also need to install playwright
There is a guide here: https://www.zenrows.com/blog/web-scraping-intercepting-xhr-requests#use-case-nseindiacom
from playwright.sync_api import sync_playwright
url = "https://www.barchart.com/etfs-funds/performance/percent-change/advances?viewName=main&timeFrame=5d&orderBy=percentChange5d&orderDir=desc"
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("get?lists=etfs.us" in response.url):
print(response.json())
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()

Is there a way I can get data that is being loaded with ajax request on a website using web scraping in python?

I am trying to get the listing data on this page https://stashh.io/collection/secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3?sort=sold_date+desc using web scraping
Because the data is being loaded with Javascript, I can't use something like requests and BeautifulSoup. I checked the network tab to see how the request are being sent. I found that to get the data, I need to get the sid to make further request I can get the sid with the code below
def get_sid():
url = "https://stashh.io/socket.io/?EIO=4&transport=polling&t=NyPfiJ-"
response = requests.get(url)
response.raise_for_status()
text = response.text[1:]
data = {"data": ast.literal_eval(text)}
return data["data"]["sid"]
Then use the SID to send a request to this endpoint which gets the data using the code below
def get_listings():
sid = get_sid()
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
url = f"https://stashh.io/socket.io/?EIO=4&transport=polling&sid={sid}"
response = requests.get(url, headers=headers)
response.raise_for_status()
print(response.content)
return response.json()
I am getting this b'2' as a response instead of this
434[{"nfts":[{"_id":"61ffffd9aa7f94f21e7262c0","collection":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3","id":"354","fullid":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3_354","name":"Amalia","thumbnail":[{"authentication":{"key":"","user":""},"file_type":"image","extension":"png","url":"https://arweave.net/7pVsbsC2M6uVDMHaVxds-oZkDNajhsrIkKEDT-vfkM8/public_image.png"}],"created_at":1644080437,"royalties_decimal_rate":3,"royalties":[{"recipient":null,"rate":20},{"recipient":null,"rate":15},{"recipient":null,"rate":15}],"isTemplate":false,"mint_on_demand":{"serial":null,"quantity":null,"version":null,"from_template":""},"template":{},"likes":[{"from":"secret19k85udnt8mzxlt3tx0gk29thgnszyjcxe8vrkt","timestamp":1644543830855}],"listing"...
I resort to using selenium to get the data, it works but it's quite slow.
Is there a way I can get this data without using selenium?

Return JSON File from Requests in Python

I've had some success using the POST requests in the past on other sites and receiving data from them but for some reason I'm having difficulty with the metacritic site.
Using chrome and the developer tools, I can see that when I begin to type in the search bar, it starts a POST request to the following url.
searchURL = 'http://www.metacritic.com/g00/3_c-6bbb.rjyfhwnynh.htr_/c-6RTWJUMJZX77x24myyux3ax2fx2fbbb.rjyfhwnynh.htrx2ffzytx78jfwhmx3fn65h.rfwpx3dcmw_$/$'
I also know that my headers need to be the following in order to get a response
headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}
When I run this, I get a status code of 200 which indicates it worked but my response text is not what I expected. I am receiving the content of the entire page when I'm expecting json of search results. What am I missing here?
title = 'Grand Theft Auto'
#search request using POST
r = requests.post(searchURL, data = {'searchTerm' : title}, headers = headers)
print(r.status_code)
print(r.text)
You can see in the images below what I'm expecting to get.
Headers
Response
Not sure about the difference - maybe GDPR-related since i live in Europe, or because i have set DNT (Do not track) to true in Chrome - but for me, Metacritic autocomplete requests post simply to http://www.metacritic.com/autosearch with the parameters search_term set to the search value and search_filter set to all :
From your screenshots, i think the URL for autocomplete in your browser is constructed with your session id, maybe to avoid stuff like you intend to do :)
So in your case i would try in following order:
post to the /autosearch URL and if that doesn't work
figure out the session-id to URL-writing logic, then make an initial request in the code to get a session id and work with that

Categories

Resources