I am using the Instaloader package to scrape some data from Instagram.
Ideally, I am looking to scrape the posts associated to a specific hashtag. I created the code below, and it outputs lines of scraped data, but I am unable to write this output to a .txt, .csv, or .json file.
I tried to first append the loop output to a list, but the list was empty. My efforts to output to a file have also been unsuccessful.
I know I am missing a step here, please provide any input that you have! Thank you.
import instaloader
import json
L= instaloader.Instaloader()
for posts in L.get_hashtag_posts('NewYorkCity'):
L.download_hashtag('NewYorkCity', max_count = 10)
with open('output.json', 'w') as f:
json.dump(posts, f)
break
print('Done')
Looking at your code, it seems the spacing might be a bit off. When you use the open command with the "with" statement it does a try: / finally: on the file object.
In your case you created a variable f representing this file object. I believe if you make the following change you can write your json data to the file.
import instaloader
import json
L= instaloader.Instaloader()
for posts in L.get_hashtag_posts('NewYorkCity'):
L.download_hashtag('NewYorkCity', max_count = 10)
with open('output.json', 'w') as f:
f.write(json.dump(posts))
break
print('Done')
I am sure you intended this, but if your goal in the break was to only get the first value in the loop returned you could make this edit as well
for posts in L.get_hashtag_posts('NewYorkCity')[0]:
This will return the first item in the list.
If you would like the first 3, for example, you could do [:3]
See this tutorial on Python Data Structures
Related
Objective
I'm trying to extract the GPS "Latitude" and "Longitude" data from a bunch of JPG's and I have been successful so far but my main problem is that when I try to write the coordinates to a text file for example I see that only 1 set of coordinates was written compared to my console output which shows that every image was extracted. Here is an example: Console Output and here is my text file that is supposed be a mirror output along my console: Text file
I don't fully understand whats the problem and why it won't just write all of them instead of one. I believe it is being overwritten somehow or the 'GPSPhoto' module is causing some issues.
Code
from glob import glob
from GPSPhoto import gpsphoto
# Scan jpg's that are located in the same directory.
data = glob("*.jpg")
# Scan contents of images and GPS values.
for x in data:
data = gpsphoto.getGPSData(x)
data = [data.get("Latitude"), data.get("Longitude")]
print("\nsource: {}".format(x), "\n ↪ {}".format(data))
# Write coordinates to a text file.
with open('output.txt', 'w') as f:
print('Coordinates:', data, file=f)
I have tried pretty much everything that I can think of including: changing the write permissions, not using glob, no loops, loops, lists, no lists, different ways to write to the file, etc.
Any help is appreciated because I am completely lost at this point. Thank you.
You're replacing the data variable each time through the loop, not appending to a list.
all_coords = []
for x in data:
data = gpsphoto.getGPSData(x)
all_coords.append([data.get("Latitude"), data.get("Longitude")])
with open('output.txt', 'w') as f:
print('Coordinates:', all_coords, file=f)
I'm trying to produce only the following JSON data fields, but for some reason it writes the entire page to the .html file? What am I doing wrong? It should only produce the boxes referenced e.g. title, audiosource url, medium sized image, etc?
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
# data.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(data, ensure_ascii=False))
You want to differentiate from your input data and your output data. In your for loop, you are referencing the same variable data that you are using to take input in as you are using to output. You want to add the selected data from the input to a list containing the output.
Don't re-use the same variable names. Here is what you want:
import urllib
import json
import io
url = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(url.read().decode('utf-8'))
posts = []
for post in data['posts']:
posts.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(posts, ensure_ascii=False))
You are loading the whole json in the variable data, and you are dumping it without changing it. That's the reason why this is happening. What you need to do is put whatever you want into a new variable and then dump it.
See the line -
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
it does nothing. So, data remains unchanged. Do what Mark Tolonen suggested and it'll be fine.
I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.
I'm trying to pull a few pieces of data for data entry into a server. I've gotten the data from a web API, and they include a lot of information that to me, is garbage. I need to get rid of a ton of it, but I'm having issues with where to start. The data I need is up until "abilities", and then starts again at "name":"Contherious". And here's that link. Most of the data processing I've been doing has been trying to use regex searches to try to process this, and the only search I can think of is that between the names that I need versus the names that I don't need have a space and lead to ID directly after them. I'm just unclear as to how to grab each and every one of these names and any help would be appreciated.
I've tried
DMG_DONE_FILE = "rawDmgDoneData.txt"
out = []
with open(DMG_DONE_FILE, 'r') as f:
line = f.readline()
while line:
regex_id = search('^+"name":"\s"+(\w+)+"id":',line)
if regex_id:
out.append(regex_id.group(1))
line = f.readline()
and I get errors because I generally don't know what I'm doing with regex searches
import sys
import json
# use urllib to fetch from api
# example here for testing is reading from local file
f=open('file.json','r')
data=f.read()
f.close()
entries = json.loads(data)
Now you have a data structure that you can easily address
e.g. entries['entries'][0]['name']
alternatively using jq https://stedolan.github.io/jq/
cat file.json |jq '.entries[]| {name:.name,id:.id,type:.type,itemLevel:.itemLevel,icon:.icon,total:.total,activeTime:.activeTime,activeTimeReduced:.activeTimeReduced}'
I'm almost an absolute beginner in Python, but I am asked to manage some difficult task. I have read many tutorials and found some very useful tips on this website, but I think that this question was not asked until now, or at least in the way I tried it in the search engine.
I have managed to write some url in a csv file. Now I would like to write a script able to open this file, to open the urls, and write their content in a dictionary. But I have failed : my script can print these addresses, but cannot process the file.
Interestingly, my script dit not send the same error message each time. Here the last : req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
So I think my script faces several problems :
1- is my method to open url the right one ?
2 - and what is wrong in the way I build the dictionnary ?
Here is my attempt below. Thanks in advance to those who would help me !
import csv
import urllib
dict = {}
test = csv.reader(open("read.csv","rb"))
for z in test:
sock = urllib.urlopen(z)
source = sock.read()
dict[z] = source
sock.close()
print dict
First thing, don't shadow built-ins. Rename your dictionary to something else as dict is used to create new dictionaries.
Secondly, the csv reader creates a list per line that would contain all the columns. Either reference the column explicitly by urllib.urlopen(z[0]) # First column in the line or open the file with a normal open() and iterate through it.
Apart from that, it works for me.