How to download file rather than HTML with requests - python

Okay so I decided I'd like a program to download osu maps based on the map number(for lack of a better term). After doing some testing with the links to understand the redirecting, I got a program which gets to the .../download page - when I got to said page, the map will download. However, when trying to download it via requests, I get HTML.
def grab(self, identifier=None):
if not identifier:
print("Missing Argument: 'identifier'")
return
mapLink = f"https://osu.ppy.sh/beatmaps/{identifier}"
dl = requests.get(mapLink, allow_redirects=True)
if not dl:
print("Error: map not found!")
return
mapLink2 = dl.url
mapLink2 = f"https://osu.ppy.sh/beatmapsets/{self.parseLink(mapLink2)}/download"
dl = requests.get(mapLink2)
with open(f"{identifier}.osz", "wb") as f:
f.write(dl.content)
And, in case it is necessary, here is self.parseLink:
def parseLink(self, mapLink=None):
if not mapLink:
return None
id = mapLink.replace("https://osu.ppy.sh/beatmapsets/","")
id = id.split("#")
return id[0]
Ideally, when I open the file at the end of grab(), it should save a usable .osz file - one which is NOT html, and can be dragged into the actual game and used. Of course, this is still extremely early in my testing, and I will figure out a way to make the filename the song name for convenience.
edit: example of an identifier is: OsuMaps().grab("1385415") in case you wanted to test

There is a very quick way to get around:
Needing to be logged in
Needing a specific element
This workaround comes in the form of https://bloodcat.com/osu/ - to get a download link directly to a map, all you need is: https://bloodcat.com/osu/s/<beatmap set number>.
Here is an example:
id = "653534" # this map is ILY - Panda Eyes
mapLink = f"https://bloodcat.com/osu/s/{id}" # adds id to the link
dl = requests.get(mapLink)
if len(dl.content) > 330: # see below for explanation
with open(f"{lines[i].rstrip()}.osz", "wb") as f:
f.write(dl.content)
else:
print("Map doesn't exist")
The line if len(dl.conetent) > 330 is my workaround to a link not working. .osz files can contain thousands upon thousands of lines of unknown characters, whereas the site's "not found" page has less that 330 lines - we can use this to check if the file is too short to be a beatmap or not.
That's all! Feel free to use the code if you'd like.

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

Is there a way to get profile picture updates using python Telegram-Bot?

So I'm using python-telegram-bot for telegram integration into another application. My goal is to have the profile pictures of a user on telegram within my application. (Users and group chats)
Getting a user's or group's avatar is easy, so is downloading and using it in my app. However, what if the user changes their profile picture? I couldn't find any update message or handler in the documentations that allows for a bot to retrieve a picture change, not even for groups.
My first thought was to first retrieve all pictures and store the file_id in a database, then periodically check that user's/group's pictures and go back through their pictures until file_id matches the last saved file_id in the database.
This combined with a JobQueue is the best thing I can come up with, so I'll self-answer using that, but I think it's still not a perfect solution so if anyone has a better idea I'd appreciate an answer.
I'm specifically looking for a better solution for groups, since I don't think there is a way to retrieve any but the most recent picture for groups, and my application should retrieve all of them. Another flaw my self-answer has is that if a user changes the profile picture twice within those six hours, I will only get the most recent one. This can be fixed for users with the offset attribute in the bot call, but the method to get profile pictures of a group does not seem to have that.
tl;dr:
How can I retrieve updates whenever a user changes their or a groups profile picture the most efficient and reliable way using python-telegram-bot and python 3.5?
This is using telegram.ext.JobQueue to check for profile picture updates every 6 hours.
# define job queue
j = updater.job_queue
def dl_pfps(bot, job):
# this assumes that we have a textfile with the following
# layout: "user_id:last_pfp_file_id" - One per line
# later we'll write a list back into it with the newest IDs
user_pfp_list = []
with open("user_pfps.txt") as f:
for line in f:
user_id = line.split(':')[0]
last_file_id = line.split(':')[1]
most_recent_pfp = bot.get_user_profile_photos(user_id, limit=1).photos[0]
if last_file_id == most_recent_pfp[-1].file_id:
print("No change")
user_pfp_list.append(user_id + ":" + last_file_id)
else:
print("User updated profile picture. Geting full size picture...")
# download and process the picture
file_id = most_recent_pfp[-1].file_id
newFile = bot.getFile(file_id)
newFile.download('my/filename.jpg')
user_pfp_list.append(user_id + ":" + file_id)
# write new list back to file (overwrite current list)
with open("user_pfps.txt", "w") as f:
f.write("\n".join(user_pfp_list))
# check for new profile pictures every 6 hours
job_dlpfps = j.run_repeating(dl_pfps, interval=21600, first=0)
This is the best I can come up with. If you want to use this in your code you have to adjust 'my/filename.jpg' to a proper filename and you need to generate an initial list in user_pfps.txt with one line per user like this: user_id:0

Problems with PyPDF ignoring some data

Hoping for some help, as I can't find a solution.
We currently have a lot of manual data inputs through people reading PDF files, and I have been asked to find a way to cut this time down. My solution would be to transform the PDF to a much easier readable format, then using grep to get rid of the standard fields (Just leaving the data behind). This would then be uploaded into a template, then into SAP.
However, then main problem has come at the first hurdle - transforming the PDF into a txt file. The code I use is as follows -
import sys
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
f = open('test.txt', 'w+')
f.write(getPDFContent("Adminform.pdf").encode("ascii", "ignore"))
f.close()
This works, however it ignores some data from the PDF files. To show you what I mean, this PDF page -
http://s23.postimg.org/6dqykomqj/error.png
From the first section (gender, title, name) produces the below -
*Title: *Legal First Name (s): *Your forename and second name (if applicable) as it appears on your passport or birth certificate. Address: *Legal Surname: *Your surname as it appears on your passport or birth certificate
Basically, the actual data that I want to capture is not being converted.
Anyone have a fix for this?
Thanks,
Generally speaking converting pdfs to text is a bad idea. It almost always is messy.
There are linux utilities to do what you have implemented, but I don't expect them to do any better.
I can suggest tabula you can find it at.
http://tabula.technology/
It is meant for extracting tables out of pdfs by manually delineating the boundaries of the table. But running on a pdf with no tables would output text with some formatting retained.
There is some automation, although, limited.
Refer
https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool
Also, may not entirely relevant here, you can use openrefine to manage messy data. Refer
http://openrefine.org/

never ending for loop in website mapping crawler

I am working on my first python project. I want to make a crawler that visits a website to extract all its links (with a depth of 2). It should store the links in two lists that form a ono-to-one register that correlates source links to the corresponding target links they contain. Then it should create a csv file with two columns (Target and Source), so I can open it with gephi to create a graph showing the site's topographic structure.
The code breaks down at the for loop in the code execution section, it just never stops extracting links... (I've tried with a fairly small blog, it just never ends). What is the problem? How can I solve it?
A few points to consider:
- I'm really new to programming and python so I realize that my code must be really unpythonic. Also, as I have been searching for ways to build the code and solve my problems it is somewhat patchy, sorry. Thanks for your help!
myurl = raw_input("Introduce URL to crawl => ")
Dominios = myurl.split('.')
Dominio = Dominios[1]
#Variables Block 1
Target = []
Source = []
Estructura = [Target, Source]
links = []
#Variables Block 2
csv_columns = ['Target', 'Source']
csv_data_list = Estructura
currentPath = os.getcwd()
csv_file = "crawleo_%s.csv" % Dominio
# Block 1 => Extract links from a page
def page_crawl(seed):
try:
for link in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(seed).read(), re.I):
Source.append(seed)
Target.append(link)
links.append(link)
except IOError:
pass
# Block 2 => Write csv file
def WriteListToCSV(csv_file, csv_columns, csv_data_list):
try:
with open(csv_file, 'wb') as csvfile:
writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(csv_columns)
writer.writerows(izip(Target, Source))
except IOError as (errno, strerror):
print("I/O error({0}): {1}".format(errno, strerror))
return
# Block 3 => Code execution
page_crawl(myurl)
seed_links = (links)
for sublink in seed_links: # Problem is with this loop
page_crawl(sublink)
seed_sublinks = (links)
## print Estructura # Line just to check if code was working
#for thirdlinks in seed_sublinks: # Commented out until prior problems are solved
# page_crawl(thirdlinks)
WriteListToCSV(csv_file, csv_columns, csv_data_list)
seed_links and links points to the same list. So when you are adding elements to links in the page_crawl function you are also extending the list that the for loop is looping over. What you need to do is to clone the list where you create seed_links.
This is because Python passes objects by reference. That is, multiple variables can point to the same object under different names!
If you want to see this with your own eyes, try print sublink inside the for loop. You will notice that there are more links printed than you initially put in. You will probably also notice that you are trying to loop over the entire web :-)
I don't see immediately what is wrong. However there are several remarks about this:
you work with global variables which is bad practice. You better use a local variable that is passed back by the return.
Is it possible that a link in the second level refers back to the first level? That way you have a loop in the data. You need to make provisions for that to prevent a loop. So you need to investigate what is returned.
I would implement this recursively (with the earlier provisions), because that makes the code simpler albeit a little more abstract.

Categories

Resources