XLRD/ Entrez: Search through Pubmed and extract the counts - python

I am working on a project that requires me to search through pubmed using inputs from an Excel spreadsheet and print counts of the results. I have been using xlrd and entrez to do this job. Here is what I have tried.
I need to search through pubmed using the name of the author, his/her medical school, a range of years, and his/her mentor's name, which are all in an Excel spreadsheet. I have used xlrd to turn each column with the required information into lists of strings.
from xlrd import open_workbook
book = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)
med_name = []
for row in sheet.col(2):
med_name.append(row)
med_school = []
for row in sheet.col(3):
med_school.append(row)
mentor = []
for row in sheet.col(9):
mentor.append(row)
I have managed to print the counts of my specific queries using Entrez.
from Bio import Entrez
Entrez.email = "your#email.edu"
handle = Entrez.egquery(term="Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) ")
handle_1 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Leoard P. Byk")
handle_2 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Southern Illinois University School of Medicine")
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = []
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
print(pubmed_count)
>>>['3', '0', '0']
The problem is that I need to replace the student name ("Jennifer Runch") with the next student name in the list of student names("med_name"), the medical school with the next school, and the current mentor's name with the next mentor's name from the list.
I think I should write a for loop after declaring my email to pubmed, but I am not sure how to link the two blocks of code together. Does anyone know of an efficient way to connect the two blocks of code or know how to do this with a more efficient way than the one I have tried?
Thank you!

You got most of the code in place. It just needed to be modified slightly.
Assuming your table looks like this:
Jennifer Bunch |Southern Illinois University School of Medicine|Leonard P. Rybak
Philipp Robinson|Stanford University School of Medicine |Roger Kornberg
you could use the following code
import xlrd
from Bio import Entrez
sheet = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)
med_name = list()
med_school = list()
mentor = list()
search_terms = list()
for row in range(0, sheet.nrows):
search_terms.append([sheet.cell_value(row, 0), sheet.cell_value(row,1), sheet.cell_value(row, 2)])
pubmed_counts = list()
for search_term in search_terms:
handle = Entrez.egquery(term="{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
handle_1 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
handle_2 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = ['', '', '']
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[0] = row["Count"]
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[1] = row["Count"]
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[2] = row["Count"]
print(pubmed_count)
pubmed_counts.append(pubmed_count)
Output
['3', '0', '0']
['1', '0', '0']
The required modification is to make the queries variable using format.
Some other modifications which are not necessary but might be helpful:
loop over the Excel sheet only once
store the pubmed_count in a predefined list because if values come back empty, the size of the output will vary making it hard to guess which value belongs to which query
everything could be even further optimized and prettified, e.g. store the queries in a list and loop over them which would give less code repetition but now it does the job.

Related

How to extract certain paragraph from text file

def extract_book_info(self):
books_info = []
for file in os.listdir(self.book_folder_path):
title = "None"
author = "None"
release_date = "None"
last_update_date = "None"
language = "None"
producer = "None"
with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
book_info = content.readlines()
for lines in book_info:
if lines.startswith('Title'):
title = lines.strip().split(': ')
elif lines.startswith('Author'):
try:
author = lines.strip().split(': ')
except IndexError:
author = 'Empty'
elif lines.startswith('Release date'):
release_date = lines.strip().split(': ')
elif lines.startswith('Last updated'):
last_update_date = lines.strip().split(': ')
elif lines.startswith('Produce by'):
producer = lines.strip().split(': ')
elif lines.startswith('Language'):
language = lines.strip().split(': ')
elif lines.startswith('***'):
pass
books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))
with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
for book_info in books_info:
book_file.write(book_info.__str__() + "\n")
I was using this code tried to extract the book title , author , release_date ,
last_update_date, language, producer, book_path).
This the the output I achieve:
['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;
This is the output I should achieved.
May I know what method I should used to achieve the following output
The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;
This is the example of input:
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]
Language: English
Character set encoding: UTF-8
Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez
*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***
cover
str.split gives you a list as a result. You're using it to assign to a single value instead.
'Title: Sherlock Holmes'.split(':') # => ['Title', 'Sherlock Holmes']
What I can gather from your requirement you want to access the second element from the split every time. You can do so by:
...
for lines in book_info:
if lines.startswith('Author'):
_, author = lines.strip().split(':')
elif...
Be careful since this can throw an IndexError if there is no second element in a split result. (That's why there's a try on the author param in your code)
Also, avoid calling __str__ directly. That's what the str() function calls for you anyway. Use that instead.

Cannot get the value if the sharepoint column type is "Person" - Python

I am trying to extract a list from Sharepoint. The thing is that if the column type is "Person or Group" Python show me a KeyError but if the column type is different I can get it.
This is my code to to get the values:
print("Item title: {0}, Id: {1}".format(item.properties["Title"], item.properties['AnalystName']))
And Title works but AnalystName does not. both are the internal names in the sharepoint.
authcookie = Office365('https://xxxxxxxxx.sharepoint.com', username='xxxxxxxxx', password='xxxxxxxxx').GetCookies()
site = Site('https://xxxxxxxxxxxx.sharepoint.com/sites/qualityassuranceteam', authcookie=authcookie)
new_list = site.List('Process Review - Customer Service Opt In/Opt Out')
query = {'Where': [('Gt', 'Audit Date', '2020-02-16')]}
sp_data = new_list.GetListItems(fields=['App ID', 'Analyst Name', 'Team Member Name', "Team Member's Supervisor Name",
'Audit Date', 'Event Date (E.g. Call date)', 'Product Type', 'Master Contact Id',
'Location', 'Team member read the disclosure?', 'Team member withheld the disclosure?',
'Did the team member take the correct action?', 'Did the team member notate the account?',
'Did the team member add the correct phone number?', 'Comment (Required)',
'Modified'], query=query)
#print(sp_data[0])
final_file = '' #Create an empty File
num = 0
for k in sp_data:
values = sp_data[num].values()
val = "|".join(str(v).replace('None', 'null') for v in values) + '\n'
num += 1
final_file += val
file_name = 'test.txt'
with open(file_name, 'a', encoding='utf-8') as file:
file.write(final_file)
So right now I´m getting what I want but there is a problem. When a Column is empty it skips the column instead of bring an empty space. for example:
col-1 | col-2 | col-3 |
HI | 10 | 8 |
Hello | | 7 |
So in this table the row 1 is full so it will bring me evertything as:
HI|10|8
but the second row brings me
Hello|7
and I need Hello||7
Person Fields are getting parsed with different names from items
Ex: UserName gets changed to UserNameId and UserNameString
That is the reason for 'KeyError' since the items list is not having the item
Use Below code to get the person field values
#Python Code
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
site_url = "enter sharepoint url"
sp_list = "eneter list name"
ctx = ClientContext(site_url).with_credentials(UserCredential("username","password"))
tasks_list = ctx.web.lists.get_by_title(sp_list)
items = tasks_list.items.get().select(["*", "UserName/Id", "UserName/Title"]).expand(["UserName"]).execute_query()
for item in items: # type:ListItem
print("{0}".format(item.properties.get('UserName').get("Title")))

Running multiple querys on YouTube API by looping through title columns of CSV python

I am using YouTubes API to get comment data from a list of music videos. The way I have it working right now is by manually typing in my query and then writing the data to a csv file and repeating for each song like such.
query = "song title"
query_results = service.search().list(
part = 'snippet',
q = query,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
What I would like to do is use the title and artist columns from a csv file that I have containing the song titles I am trying to gather data for so I can run the program once without having to manually type in the song each time.
A friend suggested using something like this
import pandas as pd
data = pd.read_csv("metadata.csv")
def songtitle():
for i in data.index:
title = data.loc[i,'title']
title = '\"' + title + '\"'
artist = data.loc[i,'artist']
return(artist, title)
But I am not sure how I would make this work because when I run this, it is only returning the final row of data, and even if it did run correctly, how I would handle getting the entire program to repeat it self for every instance of a new song.
you can save song title and artist to a list, the loop over that list to get details.
def get_songTitles():
data = pd.read_csv("metadata.csv")
return data['artist'].tolist(),data['title'].tolist()
artist, song_titles = get_songTitles()
for song in song_titles:
query_results = service.search().list(
part = 'snippet',
q = song,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()

Converting a text file into csv file using python

I have a requirement where in I need to convert my text files into csv and am using python for doing it. My text file looks like this ,
Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football
I want my CSV file to have the column names as Employee Name, Employee Number , Age and Hobbies and when a particular value is not present it should have a value of NA in that particular place. Any simple solutions to do this? Thanks in advance
You can do something like this:
records = """Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football"""
for record in records.split('Employee Name'):
fields = record.split('\n')
name = 'NA'
number = 'NA'
age = 'NA'
hobbies = 'NA'
for field in fields:
field_name, field_value = field.split(':')
if field_name == "": # This is employee name, since we split on it
name = field_value
if field_name == "Employee Number":
number = field_value
if field_name == "Age":
age = field_value
if field_name == "Hobbies":
hobbies = field_value
Of course, this method assumes that there is (at least) Employee Name field in every record.
Maybe this helps you get started? It's just the static output of the first employee data. You would now need to wrap this into some sort of iteration over the file. There is very very likely a more elegant solution, but this is how you would do it without a single import statement ;)
with open('test.txt', 'r') as f:
content = f.readlines()
output_line = "".join([line.split(':')[1].replace('\n',';').strip() for line in content[0:4]])
print(output_line)
I followed very simple steps for this and may not be optimal but solves the problem. Important case here I can see is there can be multiple keys ("Employee Name" etc) in single file.
Steps
Read txt file to list of lines.
convert list to dict(logic can be more improved or complex lambdas can be added here)
Simply use pandas to convert dict to csv
Below is the code,
import pandas
etxt_file = r"test.txt"
txt = open(txt_file, "r")
txt_string = txt.read()
txt_lines = txt_string.split("\n")
txt_dict = {}
for txt_line in txt_lines:
k,v = txt_line.split(":")
k = k.strip()
v = v.strip()
if txt_dict.has_key(k):
list = txt_dict.get(k)
else:
list = []
list.append(v)
txt_dict[k]=list
print pandas.DataFrame.from_dict(txt_dict, orient="index")
Output:
0 1
Employee Number 12345 123456
Age 45 None
Employee Name XXXXX xxx
Hobbies Tennis Football
I hope this helps.

In Python, trying to convert geocoded tsv file into geojson format

trying to convert a geocoded TSV file into JSON format but i'm having trouble with it. Here's the code:
import geojson
import csv
def create_map(datafile):
geo_map = {"type":"FeatureCollection"}
item_list = []
datablock = list(csv.reader(datafile))
for i, line in enumerate(datablock):
data = {}
data['type'] = 'Feature'
data['id'] = i
data['properties']={'title': line['Movie Title'],
'description': line['Amenities'],
'date': line['Date']}
data['name'] = {line['Location']}
data['geometry'] = {'type':'Point',
'coordinates':(line['Lat'], line['Lng'])}
item_list.append(data)
for point in item_list:
geo_map.setdefault('features', []).append(point)
with open("thedamngeojson.geojson", 'w') as f:
f.write(geojson.dumps(geo_map))
create_map('MovieParksGeocode2.tsv')
I'm getting a TypeError:list indices must be integers, not str on the data['properties'] line but I don't understand, isn't that how I set values to the geoJSON fields?
The file I'm reading from has values under these keys: Location Movie Title Date Amenities Lat Lng
The file is viewable here: https://github.com/yongcho822/Movies-in-the-park/blob/master/MovieParksGeocodeTest.tsv
Thanks guys, much appreciated as always.
You have a couple things going on here that need to get fixed.
1.Your TSV contains newlines with double quotes. I don't think this is intended, and will cause some problems.
Location Movie Title Date Amenities Formatted_Address Lat Lng
"
Edgebrook Park, Chicago " A League of Their Own 7-Jun "
Family friendly activities and games. Also: crying is allowed." Edgebrook Park, 6525 North Hiawatha Avenue, Chicago, IL 60646, USA 41.9998876 -87.7627672
"
2.You don't need the geojson module to dump out JSON - which is all GeoJSON is. Just import json instead.
3.You are trying to read a TSV, but you don't include the delimiter=\t option that is needed for that.
4.You are trying to read keys off the rows, but you aren't using DictReader which does that for you.Hence the TypeError about indices you mention above.
Check out my revised code block below..you still need to fix your TSV to be a valid TSV.
import csv
import json
def create_map(datafile):
geo_map = {"type":"FeatureCollection"}
item_list = []
with open(datafile,'r') as tsvfile:
reader = csv.DictReader(tsvfile,delimiter='\t')
for i, line in enumerate(reader):
print line
data = {}
data['type'] = 'Feature'
data['id'] = i
data['properties']={'title': line['Movie Title'],
'description': line['Amenities'],
'date': line['Date']}
data['name'] = {line['Location']}
data['geometry'] = {'type':'Point',
'coordinates':(line['Lat'], line['Lng'])}
item_list.append(data)
for point in item_list:
geo_map.setdefault('features', []).append(point)
with open("thedamngeojson.geojson", 'w') as f:
f.write(json.dumps(geo_map))
create_map('MovieParksGeocode2.tsv')

Categories

Resources