I show you my problem:
To begin,
I have a dictionary of list in python like this :
"links": [{"url": "http://catherineingram.com/biography.html", "type": {"key": "/type/link"}, "title": "Biography"}, {"url": "http://www.youtube.com/watch?v=4lJK9cfXP3c", "type": {"key": "/type/link"}, "title": "Interview on Consciousness TV"}, {"url": "http://www.huffingtonpost.com/catherine-ingram/", "type": {"key": "/type/link"}, "title": "Blog on Huffington Post"}]
My goal is to got only url and title of link and put them in a database.
For the moment I worked only with url and I did it :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[]
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=(link['url'])
print(a)
links=a
else:
links = ''
The result is :
http://catherineingram.com/biography.html
http://www.youtube.com/watch?v=4lJK9cfXP3c
http://www.huffingtonpost.com/catherine-ingram/
So it's perfect, I got exactely what I wanted but the problem now is when I put links in my database with:
links=a
I got only the last element of url in my database and not the 3 url
So I try to have 3 url in my database but I got only last.
I hope you can help me on my problem
Thanks to listen me !!!!
ps:
If you want more detail on code it's here :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[]
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=(link['url'])
print(a)
links=a
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
db.commit()
Thanks to answer me !
My goal is to put all my url in my database
So I did it :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
if 'bio' in j and 'value' in j['bio']:
bio = j['bio']['value']
else:
bio = None
if 'alternate_names' in j:
for n in j['alternate_names']:
alternate_names = n
else:
alternate_names = None
if 'uris' in j:
for n in j['uris']:
uris = n
else:
uris = None
if 'links' in j:
for link in j['links']:
dico=({'url': link['url']})
print(dico['url'])
links=dico['url']
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
db.commit()
But when I did it, all other elements ( bio,alternate names ...) work but not links beause I need to do an other method because it's a list of dictionary like it :
"links": [{"url": "http://catherineingram.com/biography.html", "type": {"key": "/type/link"}, "title": "Biography"}, {"url": "http://www.youtube.com/watch?v=4lJK9cfXP3c", "type": {"key": "/type/link"}, "title": "Interview on Consciousness TV"}, {"url": "http://www.huffingtonpost.com/catherine-ingram/"
For moment I take only element url in a dictionary and I try to put in my database all the url of links. It works perfectely when I have only 1 url but sometimes I have 2 or 3 Url and when it happens, only the last url is in my database and not others. Thtat's my problem
Thanks !
As I mentioned in my comment to your question you have some indentation issues. I am only taking a guess as to what you are trying to achieve. You also have assignment to variables that are then never referenced in the code shown, so it may very well be that they have not been declared at the right level.
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[] # what do you do with variable result? Shiuld this be delclared before the 'for record' statment?
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=link['url'] # what do you do with variable a?
print(a)
links=a # do you need both variables a and links?
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
db.commit() # should this be moved to following the c.execute statement rather than doing one commit for all the inserts?
The above is now writing multiple rows with identical data bit with different links. That leads to an unnormalized database. Could you have meant instead to have written out one row where the column contained 3 links? That, too, would be a case of an unnormalized database. Again, I am just guessing as to what you meant by "I try to have 3 url in my database but I got only last."
Related
I'm new to scrapy.
I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect.
Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug.
My printing structure is as bellow (as coded in the script)
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
when I put 2 URLs to scrap I get this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
It is still fine, but when I add more, the structure messes up and become like this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
If you see closely you will notice that the title of the third URL is below the title of the second one.
Can somebody help, please?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
UPDATE:
I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.
Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....
For example :
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results
*New to Programming
Question: I need to use the below "Data" (two rows as arrays) queried from sql and use it to create the message structure below.
data from sql using fetchall()
Data = [[100,1,4,5],[101,1,4,6]]
##expected message structure
message = {
"name":"Tom",
"Job":"IT",
"info": [
{
"id_1":"100",
"id_2":"1",
"id_3":"4",
"id_4":"5"
},
{
"id_1":"101",
"id_2":"1",
"id_3":"4",
"id_4":"6"
},
]
}
I tried to create below method to iterate over the rows and then input the values, this is was just a starting, but this was also not working
def create_message(data)
for row in data:
{
"id_1":str(data[0][0],
"id_2":str(data[0][1],
"id_3":str(data[0][2],
"id_4":str(data[0][3],
}
Latest Code
def create_info(data):
info = []
for row in data:
temp_dict = {"id_1_tom":"","id_2_hell":"","id_3_trip":"","id_4_clap":""}
for i in range(0,1):
temp_dict["id_1_tom"] = str(row[i])
temp_dict["id_2_hell"] = str(row[i+1])
temp_dict["id_3_trip"] = str(row[i+2])
temp_dict["id_4_clap"] = str(row[i+3])
info.append(temp_dict)
return info
Edit: Updated answer based on updates to the question and comment by original poster.
This function might work for the example you've given to get the desired output, based on the attempt you've provided:
def create_info(data):
info = []
for row in data:
temp_dict = {}
temp_dict['id_1_tom'] = str(row[0])
temp_dict['id_2_hell'] = str(row[1])
temp_dict['id_3_trip'] = str(row[2])
temp_dict['id_4_clap'] = str(row[3])
info.append(temp_dict)
return info
For the input:
[[100, 1, 4, 5],[101,1,4,6]]
This function will return a list of dictionaries:
[{"id_1_tom":"100","id_2_hell":"1","id_3_trip":"4","id_4_clap":"5"},
{"id_1_tom":"101","id_2_hell":"1","id_3_trip":"4","id_4_clap":"6"}]
This can serve as the value for the key info in your dictionary message. Note that you would still have to construct the message dictionary.
I have two columns:
One looks like:
"Link": "https://url.com/item?variant=",
"Link": "https://url2.com/item?variant=",
"Link": "https://url3.com/item?variant=",
2nd looks like:
"link extension": ["1","2"],
"link extension": ["1","2"],
"link extension": ["1","1","3"],
What I'm trying to do is to combine them together so that my Link column looks like this:
"Link": "https://url.com/item?variant=1"
"Link": "https://url.com/item?variant=2"
"Link": "https://url2.com/item?variant=1"
"Link": "https://url2.com/item?variant=2"
"Link": "https://url3.com/item?variant=1"
"Link": "https://url3.com/item?variant=2"
"Link": "https://url3.com/item?variant=3"
However, I'm a beginner of Python - and even basic level at Pandas. I tried to find the answer, and I came across map/append options but none of them seem to work throwing different TypeError
Any help or advice on what/where to read would be very helpful.
Thank you in advance.
Here is my basic code:
def parse(self, response):
items = response.xpath("//*[#id='bc-sf-filter-products']/div")
for item in items:
link = item.xpath(".//div[#class='figcaption product--card--text under text-center']/a/#href").get()
yield response.follow(url=link, callback=self.parse_item)
def parse_item(self, response):
Title = response.xpath(".//div[#class='hide-on-mobile']/div[#class='productTitle']/text()").get()
Item_Link = response.url
n_item_link = f"{Item_Link}?variant="
idre = r'("id":\d*)' #defining regex
id = response.xpath("//script[#id='ProductJson-product']/text()").re(idre) #applying regex
id1 = [item.replace('"id":', '') for item in id] #cleaning list of url-ids
id2 = id1[1:] #dropping first item
test = n_item_link.append(id2) # doesn't work
test2 = n_item_link.str.cat(id2) # doesn't work either
yield{
'test':test,
'test2':test2
}
# recreating the DataFrame
df = pd.DataFrame({
"link": ["https://url.com/item?variant=",
"https://url2.com/item?variant=",
"https://url3.com/item?variant="],
"variants" : [["1","2"],
["1","2"],
["1","1","3"]]
}
)
#creating a new column containg the lenght of each list
df["len_list"] = [len(x) for x in df["variants"].to_list()]
# creating a list of all values in df.variants and converting values to string
flat_list_variants = [str(item) for sublist in df["variants"].to_list() for item in sublist]
# creating a new DataFrame which contains each index replicated by the lenght of df["len_list"]
df_new = df.loc[df.index.repeat(df.len_list)]
# assign the list to a new column
df_new["flat_variants"] = flat_list_variants
#compose the result by sum strings
df_new["results"] = df_new["link"] + df_new["flat_variants"]
I don't know how exactly your input looks like but assuming you have a list (or a different iterable) for your links and your extensions this will work:
def join_url(links, ext_lists):
urls = []
for link, extension_list in zip(links, ext_lists):
for extension in extension_list:
urls.append(link + extension)
return urls
Sample input:
websites = ['web1&hello=', 'web2--', 'web3=']
extensions = [['1', '2'], ['1', '2', '3'], ['3', '1']]
url_list = join_url(websites, extensions)
print(url_list)
Output:
['web1&hello=1', 'web1&hello=2', 'web2--1', 'web2--2', 'web2--3', 'web3=3', 'web3=1']
I want to get the hyperlink of a cell (A1, for example) in Python. I have this code so far. Thanks
properties = {
"requests": [
{
"cell": {
"HyperlinkDisplayType": "LINKED"
},
"fields": "userEnteredFormat.HyperlinkDisplayType"
}
]
}
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=rangeName, body=properties).execute()
values = result.get('values', [])
How about using sheets.spreadsheets.get? This sample script supposes that service of your script has already been able to be used for spreadsheets().values().get().
Sample script :
spreadsheetId = '### Spreadsheet ID ###'
range = ['sheet1!A1:A1'] # This is a sample.
result = service.spreadsheets().get(
spreadsheetId=spreadsheetId,
ranges=range,
fields="sheets/data/rowData/values/hyperlink"
).execute()
If this was not useful for you, I'm sorry.
It seems to me like this is the only way to actually get the link info (address as well as display text):
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheetId, range=range_name,
valueRenderOption='FORMULA').execute()
values = results.get('values', [])
This returns the raw content of the cells which for hyperlinks look like this for each cell:
'=HYPERLINK("sample-link","http://www.sample.com")'
For my use I've parsed it with the following simple regex:
r'=HYPERLINK\("(.*?)","(.*?)"\)'
You can check the hyperlink if you add at the end:
print (values[0])
I'm querying the onename api in an effort to get the bitcoin addresses of all the users.
At the moment I'm getting all the user information as a json-esque list, and then piping the output to a file, it looks like this:
[{'0': {'owner_address': '1Q2Tv6f9vXbdoxRmGwNrHbjrrK4Hv6jCsz', 'zone_file': '{"avatar": {"url": "https://s3.amazonaws.com/kd4/111"}, "bitcoin": {"address": "1NmLvYVEZqPGeQNcgFS3DdghpoqaH4r5Xh"}, "cover": {"url": "https://s3.amazonaws.com/dx3/111"}, "facebook": {"proof": {"url": "https://facebook.com/jasondrake1978/posts/10152769170542776"}, "username": "jasondrake1978"}, "graph": {"url": "https://s3.amazonaws.com/grph/111"}, "location": {"formatted": "Mechanicsville, Va"}, "name": {"formatted": "Jason Drake"}, "twitter": {"username": "000001"}, "v": "0.2", "website": "http://1642.com"}', 'verifications': [{'proof_url': 'https://facebook.com/jasondrake1978/posts/10152769170542776', 'service': 'facebook', 'valid': False, 'identifier': 'jasondrake1978'}], 'profile': {'website': 'http://1642.com', 'cover': {'url': 'https://s3.amazonaws.com/dx3/111'}, 'facebook': {'proof': {'url': 'https://facebook.com/jasondrake1978/posts/10152769170542776'}, 'username': 'jasondrake1978'}, 'twitter': {'username': '000001'}, 'bitcoin': {'address': '1NmLvYVEZqPGeQNcgFS3DdghpoqaH4r5Xh'}, 'name': {'formatted': 'Jason Drake'}, 'graph': {'url': 'https://s3.amazonaws.com/grph/111'}, 'location': {'formatted': 'Mechanicsville, Va'}, 'avatar': {'url': 'https://s3.amazonaws.com/kd4/111'}, 'v': '0.2'}}}]
what I'm really interested in is the field {"address": "1NmLvYVEZqPGeQNcgFS3DdghpoqaH4r5Xh"}, the rest of the stuff I don't need, I just want the addresses of every user.
Is there a way that I can just write only the addresses to a file using python?
I'm trying to write it as something like:
1NmLvYVEZqPGeQNcgFS3DdghpoqaH4r5Xh,
1GA9RVZHuEE8zm4ooMTiqLicfnvymhzRVm,
1BJdMS9E5TUXxJcAvBriwvDoXmVeJfKiFV,
1NmLvYVEZqPGeQNcgFS3DdghpoqaH4r5Xh,
...
and so on.
I've tried a number of different ways using dump, dumps, etc. but I haven't yet been able to pin it down.
My code looks like this:
import os
import json
import requests
#import py2neo
import csv
# set up authentication parameters
#py2neo.authenticate("46.101.180.63:7474", "neo4j", "uni-bonn")
# Connect to graph and add constraints.
neo4jUrl = os.environ.get('NEO4J_URL',"http://46.101.180.63:7474/db/data/")
#graph = py2neo.Graph(neo4jUrl)
# Add uniqueness constraints.
#graph.run("CREATE CONSTRAINT ON (q:Person) ASSERT q.id IS UNIQUE;")
# Build URL.
apiUrl = "https://api.onename.com/v1/users"
# apiUrl = "https://raw.githubusercontent.com/s-matthew-english/26.04/master/test.json"
# Send GET request.
Allusersjson = requests.get(apiUrl, headers = {"accept":"application/json"}).json()
#print(json)])
UsersDetails=[]
for username in Allusersjson['usernames']:
usernamex= username[:-3]
apiUrl2="https://api.onename.com/v1/users/"+usernamex+"?app-id=demo-app-id&app-secret=demo-app-secret"
userinfo=requests.get(apiUrl2, headers = {"accept":"application/json"}).json()
# try:
# if('bitcoin' not in userinfo[usernamex]['profile']):
# continue
# else:
# UsersDetails.append(userinfo)
# except:
# continue
try:
address = userinfo[usernamex]["profile"]["bitcoin"]["address"]
UsersDetails.append(address)
except KeyError:
pass # no address
out = "\n".join(UsersDetails)
print(out)
open("out.csv", "w").write(out)
# f = csv.writer(open("test.csv", "wb+"))
# Build query.
query = """
RETURN {json}
"""
# Send Cypher query.
# py2neo.CypherQuery(graph, query).run(json=json)
# graph.run(query).run(json=json)
#graph.run(query,json=json)
anyway, in such a situation, what's the best way to write out those addresses as csv :/
UPDATE
I ran it, and at first it worked, but then I got the following error:
Instead of adding all the information to the UsersDetails list
UsersDetails.append(userinfo)
you can add just the relevant part (address)
try:
address = userinfo[usernamex]["profile"]["bitcoin"]["address"]
UsersDetails.append(address)
except KeyError:
pass # no address
except TypeError:
pass # illformed data
To print the values to the screen:
out = "\n".join(UsersDetails)
print(out)
(replace "\n" with "," for comma separated output, instead of one per line)
To save to a file:
open("out.csv", "w").write(out)
You need to reformat the list, either through map() or a list comprehension, to get it down to just the information you want. For example, if the top-level key used in the response from the api.onename.com API is always 0, you can do something like this
UsersAddresses = [user['0']['profile']['bitcoin']['address'] for user in UsersDetails]