get full text from pubmed - python

I am using the python api Bio to access to the pubmed central database but unfortunately I can get only the abstract from this api
I want to know if it's possible to get the full text and how
molp5 is a file containing a listing of molecules like below
Flavopiridol
4-azapaullone
here is my code:
def search(query):
Entrez.email = 'xxxxx#gmail.com'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='3000',
retmode='text',
rettype='Medline',
term=query)
results = Entrez.read(handle)
return results
def fetch_details(id_list):
ids = ','.join(id_list)
Entrez.email = 'xxxxx#gmail.com'
handle = Entrez.efetch(db='pubmed',
retmode='xml',
id=ids)
results = Entrez.read(handle)
return results
if __name__ == '__main__':
#load the file containing the name of the molecules
mol = pd.read_csv('/W2V/molp5.csv')
mol["idx"] = mol["idx"].apply(lambda x:lower(x))
txt = ""
retmax = []
for m in mol["idx"]:
results = search(m)
#print the number of article available and the name of the molecule
print m, results['RetMax']
id_list = results['IdList']
papers = fetch_details(id_list)
for i, paper in enumerate(papers):
try:
#concatenate the abstract together
txt += paper['MedlineCitation']['Article']['ArticleTitle']
for j in paper['MedlineCitation']['Article']['Abstract']['AbstractText']:
txt += j+'\n'
except KeyError:
pass

Related

Loops | list | extraction

I'm trying to import some elements from file, do some work with them and print [extract to file in the future] results in a list. As Im breaking the code peace by peace, Im getting all information that I need, but when Im trying to extract all info at once with a loop I get this:
spotify:track:0PJ4RVL5wCeHDO8wHpk3YG
t
0 K S s
1 n t p
2 a e o
3 s v t
Loop breaks the word [last element of the list] by letter
I'm using several loops, there are some ids that I don't know how to attach needed info to these ids
import spotipy
import openpyxl
import pandas as pd
import xlsxwriter
import xlrd
from spotipy.oauth2 import SpotifyClientCredentials
path = "C:\\Users\\Karolis\\Desktop\\Python\\Failai\\Gabalai.xlsx"
wb = xlrd.open_workbook(path)
sheet = wb.sheet_by_index(0)
sheet2 = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
sheet2.cell_value(0, 0)
client_id = ''
client_secret = ''
for ix in range(sheet.nrows):
title = (sheet.cell_value(ix, 0))
artist = (sheet2.cell_value(ix, 1))
# cia apacioj istraukia gabalo uri
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace = False
search_query = title + ' ' + artist
result = sp.search(search_query)
for i in result['tracks']['items']:
# Find a songh that matches title and artist
if (i['artists'][0]['name'] == artist) and (i['name'] == title):
uri = (i['uri'])
break
else:
try:
# Just take the first song returned by the search (might be named differently)
print(result['tracks']['items'][0]['uri'])
except:
# No results for artist and title
print("Cannot Find URI")
# cia apacioj istraukia gabalo info pagal jo uri
uri = (i['uri'])
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace = False
features = sp.audio_features(uri)
for p in range(len(title)):
print(p, artist[p], title[p],uri[p])
book1.close()
I'm a beginner at programming and I'm totally lost at this point.
Thank you for any help provided
Your last loop,
for p in range(len(title)):
print(p, artist[p], title[p],uri[p])
is explicitly printing each character of the artist, title, and uri. I think you mean to be looping over a list of titles, and indexing into lists of artists, titles, and uris, but instead you are looping over the length of a string and indexing into strings.

Web scraping when the item order appears to shuffle with pagination - indeed jobs

I'm attempting to record the information from the job site Indeed.com for all jobs resulting from a specific search term.
However, although when I search that term it says there are X jobs available (i.e., "showing Page 1 of X jobs"), the number of unique jobs (I remove I record is much less. The number is also not consistent if I repeat the scraper and there are duplicates.
This makes me wonder if there is some shuffling of the contents (think sampling with replacement) so that there are unique jobs that I don't see.
Alternative could be that the number of jobs isn't correctly shown on the site. For example, if you go to the last page it's showing only approximately 620 jobs of the alleged 920. But this doesn't explain why I'm not getting consistently the same number of unique results if I run the code twice in quick succession.
Any thoughts?
The Python3 code is here if you want to run it.
Requires: requests, bs4, pandas, numpy
# modified from https://medium.com/#msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import json
from html.parser import HTMLParser
from datetime import datetime
import numpy as np
import re
from itertools import chain
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
result = s.get_data()
# result = result.strip('\n')
result = result.replace('\n',' ')
return result
keyword = 'sustainability'
state = 'mi'
page_limit = 50
# init the dataframe
columns = ['job_code','keyword', 'job_title', 'company_name', 'location', 'salary']
jobs = []
div_list = []
# determine max number of results
page = requests.get('http://www.indeed.com/jobs?q={}&l={}'.format(keyword,state.lower()))
soup = BeautifulSoup(page.content, "html.parser")
max_result_bs4 = soup.find(name='div', attrs = {'id':'searchCount'})#.find('directory').text
max_results = int(max_result_bs4.contents[0].split(' ')[-2].replace(',',''))
#scraping code:
# loop through pages
for start in chain(range(0, max_results, page_limit),range(0, max_results, page_limit)):
url = 'https://www.indeed.com/jobs?q={}&start={}&limit={}&l={}&sort=date'.format(keyword, start, page_limit, state.lower())
page = requests.get(url)
time.sleep(0.01) #ensuring at least 0.01 second between page grabs
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
# record the divs
div_list += soup.find_all(name='div', attrs={'data-tn-component': 'organicJob'})
# format the scrapes
for div in div_list:
#creating an empty list to hold the data for each posting
job_post = []
# get the job code
job_code = div['data-jk']
job_post.append(job_code)
#append keyword name
job_post.append(keyword)
#grabbing job title
for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
title = a['title']
if title:
job_post.append(title)
else:
job_post.append('Not found')
#grabbing company name
company = div.find_all(name='span', attrs={'class':'company'})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
for span in sec_try:
job_post.append(span.text)
if len(job_post) == 3:
job_post.append('Not found')
#grabbing location name
job_post.append(state)
#grabbing salary
try:
job_post.append(div.find('nobr').text)
except:
try:
job_post.append(div.find(name='span', attrs={'class':'salary no-wrap'}).text.strip())
except:
job_post.append('Nothing_found')
#appending list of job post info to dataframe at index num
jobs.append(job_post)
df = pd.DataFrame(jobs, columns = columns)
#saving df as a local csv file — define your own local path to save contents
todays_date = datetime.now().strftime('%Y%m%d')
df.to_csv('indeed_{}.csv'.format(todays_date), encoding='utf-8')
df_len = len(df)
unique_jobs = len(np.unique(df.job_code))
print('Found {} unique jobs from an alleged {} after recording {} postings'.format(unique_jobs, max_results, df_len))

How to fix json.decoder.JSONDecodeError when I use googletrans API?

I am trying to translate a series of tweet from Italian into English. They are contained in a csv file so I extract them with pandas to compute the sentiment with Vader. Unfortunately, I get this error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).
I have tried both to remove the emoji from the tweet and to use a vpn as indicated on some other posts but it doesn't work.
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
def extract_emojis(str):
return ''.join(c for c in str if c in emoji.UNICODE_EMOJI)
def clean_emojis(text):
toreturn = ""
for c in text:
if c not in emoji.UNICODE_EMOJI:
toreturn += c
return toreturn
def sentiment_analyzer_scores(text, engl=True):
if engl:
translation = text
else:
try:
emojis = extract_emojis(text)
text = clean_emojis(text)
demoji.replace(text)
text = remove_emoji(text)
text = text.encode('ascii', 'ignore').decode('ascii')
# translator= Translator(from_lang="Italian",to_lang="English")
# translation = translator.translate(text)
translation = translator.translate(text).text
# print(translation)
except Error as e:
print(text)
print(e)
pass
text = translation + emojis
# print(text)
score = analyser.polarity_scores(text)
return score['compound']
def anl_tweets(lst, engl=True):
sents = []
id = 0
for tweet_text in lst:
try:
sentiment = sentiment_analyzer_scores(tweet_text, engl)
sents.append(sentiment)
id = id + 1
print("Sentiment del tweet n° %s = %s" % (id, sentiment))
except Error as e:
sents.append(0)
print(e)
return sents
#Main
translator = Translator()
analyser = SentimentIntensityAnalyzer()
file_name = 'file.csv'
df = pd.read_csv(file_name)
print(df.shape)
# Calculate Sentiment and add column
df['tweet_sentiment'] = anl_tweets(df.tweet_text, False)
# Save the modifies
df.to_csv(file_name, encoding='utf-8', index=False)
This has nothing to do with emoji, Google has limitations on how many characters you can translate and when you reach that limit, Google API simply blocks you.
Read about the quota here
Simple solution is to break your script into multiple chunks and use proxy server / different IP address.
Another option is https://pypi.org/project/translate/
(I haven't tried it though)

Entrez (biopython): how to restrict the term search to a specific journal? (PubMed)

I want to obtain all the articles in a specific journal that are related to a specific term/topic.
I am trying to do so through PubMed using the Entrez package contained in Biopython.
The corresponding Advanced PubMed search is:
(topic/term) AND "Name of the journal"[Journal]
What I tried so far is based on the code wrote by Marco Bonzanini (GitHub page containing the original code https://gist.github.com/bonzanini/5a4c39e4c02502a8451d).
from Bio import Entrez
def search(query):
Entrez.email = 'example#mail.com'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='20',
retmode='xml',
term=query,
mindate= "2018/11",
maxdate= "2019/02")
results = Entrez.read(handle)
return results
def fetch_details(id_list):
ids = ','.join(id_list)
Entrez.email = 'example#mail.com'
handle = Entrez.efetch(db='pubmed',
retmode='xml',
id=ids)
results = Entrez.read(handle)
return results
if __name__ == '__main__':
results = search('attention')
id_list = results['IdList']
papers = fetch_details(id_list)
for i, paper in enumerate(papers['PubmedArticle']):
print("%d) %s" % (i + 1, paper['MedlineCitation']['Article']['ArticleTitle']))
E.g. to look for articles that appeared in the Journal of experimental child psychology, change you main body like so:
if __name__ == '__main__':
results = search('attention')
id_list = results['IdList']
papers = fetch_details(id_list)
i = 0
for paper in papers['PubmedArticle']:
if (paper['MedlineCitation']['Article']['Journal']['Title'] ==
'Journal of experimental child psychology'):
i += 1
print("%d) %s" % (i, paper['MedlineCitation']['Article']
['ArticleTitle']))

Python saves only one row of data

def get_user_data(self,start_url):
html = self.session.get(url=start_url,headers=self.headers,cookies=self.cookies).content
selector = etree.fromstring(html,etree.HTMLParser(encoding='utf-8'))
all_user = selector.xpath('//div[contains(#class,"c") and contains(#id,"M")]')
for i in all_user:
user_id = i.xpath('./div[1]/a[#class="nk"]/#href')[0]
content = i.xpath('./div[1]/span[1]')[0]
contents = content.xpath('string(.)')
times = i.xpath('./div/span[#class="ct"]/text()')[0]
if len(i.xpath('./div[3]')):
imgages = i.xpath('./div[2]/a/img/#src')
praise_num = i.xpath('./div[3]/a[2]/text()')
transmit_num = i.xpath('./div[3]/a[3]/text()')
elif len(i.xpath('./div[2]')):
imgages = i.xpath('./div[2]/a/img/#src')
praise_num = i.xpath('./div[2]/a[3]/text()')
transmit_num = i.xpath('./div[2]/a[4]/text()')
else :
imgages = ''
praise_num = i.xpath('./div[1]/a[2]/text()')
transmit_num = i.xpath('./div[1]/a[3]/text()')
try:
if re.search('from',times.encode().decode('utf-8')):
month_day, time, device = times.split(maxsplit=2)
self.data['mobile_phone'] = device
else:
time,device = times.split(maxsplit=1)
self.data['month_day'] = ''
self.data['create_time'] = month_day + ' ' + time
except Exception as e:
print('failure:',e)
self.data['crawl_time'] = datetime.strftime(datetime.now(),'%Y-%m-%d %H:%M:%S')
self.data['user_id'] = user_id
self.data['contents'] = contents.encode().decode('utf-8').replace('\u200b','')
self.data['imgages'] = imgages
self.data['praise_num'] = praise_num
self.data['transmit_num'] = transmit_num
with open('a.txt','a',encoding='utf-8') as f:
f.write(json.dumps(self.data)+'\n')
I try to grab every page of data and save it to data.But I wrote it wrong, because I saved only one piece of data on each page in 'a.txt'.So how do I write to save every page of data correctly in 'a.txt'?
Write operation is outside the for loop thats why it is only adding last iteration data to file
with open('a.txt','a',encoding='utf-8') as f:
f.write(json.dumps(self.data)+'\n')
You're overwriting the various values in self.data in every iteration of the loop.
Instead, self.data should be a list. You should create a new dictionary in each iteration and append it to the data at the end.
self.data = []
for i in all_user:
values = {}
...
values['crawl_time'] = ...
values['user_id'] = ...
...
self.data.append(values)

Categories

Resources