Noob here. I have a large number of json files, each is a series of blog posts in a different language. The key-value pairs are meta data about the posts, e.g. "{'author':'John Smith', 'translator':'Jane Doe'}. What I want to do is convert it to a python dictionary, then extract the values so that I have a list of all the authors and translators across all the posts.
for lang in languages:
f = 'posts-' + lang + '.json'
file = codecs.open(f, 'rt', 'utf-8')
line = string.strip(file.next())
postAuthor[lang] = []
postTranslator[lang]=[]
while (line):
data = json.loads(line)
print data['author']
print data['translator']
When I tried this method, I keep getting a key error for translator and I'm not sure why. I've never worked with the json module before so I tried a more complex method to see what happened:
postAuthor[lang].append(data['author'])
for translator in data.keys():
if not data.has_key('translator'):
postTranslator[lang] = ""
postTranslator[lang] = data['translator']
It keeps returning an error that strings do not have an append function. This seems like a simple task and I'm not sure what I'm doing wrong.
See if this works for you:
import json
# you have lots of "posts", so let's assume
# you've stored them in some list. We'll use
# the example text you gave as one of the entries
# in said list
posts = ["{'author':'John Smith', 'translator':'Jane Doe'}"]
# strictly speaking, the single-quotes in your example isn't
# valid json, so you'll want to switch the single-quotes
# out to double-quotes, you can verify this with something
# like http://jsonlint.com/
# luckily, you can easily swap out all the quotes programmatically
# so let's loop through the posts, and store the authors and translators
# in two lists
authors = []
translators = []
for post in posts:
double_quotes_post = post.replace("'", '"')
json_data = json.loads(double_quotes_post)
author = json_data.get('author', None)
translator = json_data.get('translator', None)
if author: authors.append(author)
if translator: translators.append(translator)
# and there you have it, a list of authors and translators
Related
I have a list of Uniprot IDs and need to know the PDB IDs plus the Chain IDs.
With the code given on the Uniprot website I can get the PDB IDs but not the Chain Information.
import urllib.parse
import urllib.request
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': 'ACC+ID',
'to': 'PDB_ID',
'format': 'tab',
'query': UniProtIDs
}
data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with open('UniProt_PDB_IDs.txt', 'a') as f:
with urllib.request.urlopen(req) as q:
response = q.read()
f.write(response.decode('utf-8'))
so this code gets me this:
From To
A0A075B6N1 5HHM
A0A075B6N1 5HHO
A0A075B6N1 5NQK
A0A075B6T6 1AO7
A0A075B6T6 4ZDH
for the Protein A0A075B6N1 with PDB ID 5HHM the Chains are E and J so i need a way to also retrieve the chains to get something like that:
A0A075B6N1 5HHM_E
A0A075B6N1 5HHM_J
A0A075B6N1 5HHo_E
A0A075B6N1 5NQK_B
It doesen't has to be in this format, later I convert it into a dictionary with the UniProt IDs as keys and the PDB IDs as values.
Thank you for your help in advance!
A tool called localpdb was just recently released that might does exactly what you want: https://labstructbioinf.github.io/localpdb/.
Another way would be to split the structures by segments, which can be easily done with MDanalysis universe objects (https://www.mdanalysis.org). Assuming you have a list of PDB IDs:
#fetch structures
universe_objects = []
for pdb_id in pdb_ids:
mmtf_object = mda.fetch_mmtf(pdb_id)
universe_objects.append(mmtf_object)
#get rid of water and ligands and split structures into chains
universe_chains = []
for universe_object in universe_objects:
universe_chain = universe_object.select_atoms('protein').split('segment')
universe_chains.append(universe_chain)
#flatten nested list
universe_chain_list = [item for sublist in universe_chains for item in sublist]
Of course there is other tools you can do this with. E.g. via the ProDy Hierview function!
Hope that helps.
I am trying to get a json response decode with utf and access the dictionaries in the list. The following is the JSON response
'[{"id":26769687,"final_price":58.9,"payment_method_cost":"\\u003cem\\u003e+ 0,00 €\\u003c/em\\u003e \\u003cspan\\u003eΑντικαταβολή\\u003c/span\\u003e","net_price":53.9,"net_price_formatted":"53,90 €","final_price_formatted":"58,90 €","shop_id":649,"no_credit_card":false,"sorting_score":[-5.0,-156,-201,649,20],"payment_method_cost_supported":true,"free_shipping_cost_supported":false,"shipping_cost":"\\u003cem\\u003e+ 5,00 €\\u003c/em\\u003e \\u003cspan\\u003eΜεταφορικά\\u003c/span\\u003e","link":"/products/show/26769687"},
{"id":26771682,"final_price":55.17,"payment_method_cost":"\\u003cem\\u003e+ 2,83 €\\u003c/em\\u003e \\u003cspan\\u003eΑντικαταβολή\\u003c/span\\u003e","net_price":48.5,"net_price_formatted":"48,50 €","final_price_formatted":"55,17 €","shop_id":54,"no_credit_card":false,"sorting_score":[-3.6,-169,-84,54,10],"payment_method_cost_supported":true,"free_shipping_cost_supported":false,"shipping_cost":"\\u003cem\\u003e+ 3,84 €\\u003c/em\\u003e \\u003cspan\\u003eΜεταφορικά\\u003c/span\\u003e","link":"/products/show/26771682"}]'
which is produce by the following
url2besearched = 'https://www.skroutz.gr/personalization/20783507/product_prices.js?_=1569161647'
Delays = [25,18,24,26,20,22,19,30]
no_of_pagedowns= 20
RandomDelays = np.random.choice(Delays)
#WAIT TIME
time.sleep(RandomDelays)
fp = urllib.request.urlopen(url2besearched)
mybytes = fp.read()
post_elems =[]
mystr = mybytes.decode("utf8")
fp.close()
mystr1 = mystr.rsplit('=')
mystr2 = mystr1[1].split(";")
#I ADD THE FOLLOWING BECAUSE THE INITIAL DOES NOT HAVE ENDING BRACKETS
mystr3 = mystr2[0]+"}"+"]"
for d in mystr3:
for key in d:
post_elems.append([d[key],d['final_price'],d['shop_id']])
When I do the for loop is getting character by character the mystr3 variable and not as a dictionary
How can I have a list with the key of dictionary and final_price with shop_id
My desired output needs to be a list like
post_elems =['26769687','58.9','649']
First the API you are calling for some reason gives a weird response. So .json() on response will not work as there is a field in front. It would be good to understand why or check the URL query strings are correct. Anyway. You have removed them. So I'll copy that code:
import requests, json
mystr = requests.get('https://www.skroutz.gr/personalization/20783507/product_prices.js?_=1569161647').text
mystr1 = mystr.rsplit('=')
mystr2 = mystr1[1].split(";")[0]
json.loads(mystr2)
This works. However. there are two things that are not great here. mystr1 is a systems Hugarian notation, this is very unpythonic. Use type-hinting to help remind what class something belongs to, not the variable name. Also your mystr2 gives a list, a nice example why Hugarian notation is bad.
I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.
I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.
product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()
def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''
#Iterating through the string to find keys and values to put in to
#single_record_dict.
while record_string != record_string[::-1]:
try:
k = record_string.index('<')
l = record_string.index('>')
temp_key = record_string[k + 1:l]
record_string = record_string[l+1:]
m = record_string.index('<')
temp_value = record_string[:m]
#Cleaning thhe keys and values of unnecessary characters and symbols.
if '\n' in temp_value:
temp_value = temp_value[3:]
if temp_key[-1] == '/':
temp_key = temp_key[:-1]
n = record_string.index('\n')
record_string = record_string[n+2:]
#Checking parent_rss dict to see if the key from the record is present. If it is,
#the key is replaced with keys and added to single_record_dictionary.
if temp_key in mapped_nodes.keys():
temp_key = mapped_nodes[temp_key]
single_record_dict[temp_key] = temp_value
except Exception:
break
while len(read_product_data) > 10:
#Goes through read_product_data to create blocks, each of which is a single
#record.
i = read_product_data.index('<record>')
j = read_product_data.index('</record>') + 8
single_record_string = read_product_data[i:j]
single_record_string = single_record_string[9:-10]
#Runs previous function with the input being the single string found previously.
record_string_to_dict(single_record_string)
#Flushes single_record_dict to final_list, and empties the dict for the next
#record.
final_list.append(single_record_dict)
single_record_dict = {}
#Removes the record that was previously processed.
read_product_data = read_product_data[j:]
#For keeping track/ease of use.
print('Record ' + str(break_counter) + ' has been appended.')
#Keeps track of the number of records. Once the set value is reached
#in the if loop, it is flushed to a new file.
break_counter += 1
flush_counter += 1
if break_counter == 100 or flush_counter == break_counter:
record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
record_list.write(str(final_list))
#file_counter keeps track of how many files have been created, so the next
#file has a different int at the end.
file_counter += 1
record_list.close()
#resets break counter
break_counter = 0
final_list = []
#For testing purposes. Causes execution to stop once the number of files written
#matches the integer.
if file_counter == 2:
break
print('All records have been appended.')
Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples:
How can i convert an xml file into JSON using python?
Relevant code reproduced from above post:
xml2json
import xml2json
s = '''<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>'''
print xml2json.xml2json(s)
xmltodict
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
See this post if working in Python 3:
https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/
import json
import xmltodict
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return json.dumps(d, indent=4)
You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.
For my programming assignment, one of the functions involves taking input from a text file (twitter data) and returning a tuple of the tweet information (see doctests for correct results on a sample file).
Sample text file: http://pastebin.com/z5ZkN3WH
Full description of function is as follows:
The parameter is the full name of a file. Open the file specified by the parameter, which is formatted as described in the data files section, and read all of the data from it. The keys of the dictionary should be the names of the candidates, and the items in the list associated with each candidate are the tweets they have sent. A tweet tuple should have the form (candidate, tweet text, date, source, favorite count, retweet count). The date, favorite count, and retweet count should be integers, and the rest of the items in the tuple should be strings.
My code so far is below:
def extract_data(metadata):
""" list of str -> tuple of str/int
Return extracted metadata in specified format.
"""
date = int(metadata[1])
source = metadata[3]
favs = int(metadata[4])
retweets = int(metadata[5])
return date, source, favs, retweets
def read_tweets(file):
""" (filename) -> dict of {str: list of tweet tuples}
Read tweets from file and categorize into dictionary.
>>> read_tweets('very_short_data.txt')
{'Donald Trump': [('Donald Trump', 'Join me live in Springfield, Ohio!\\nhttps://t (dot) co/LREA7WRmOx\\n', 1477604720, 'Twitter for iPhone', 5251, 1895)]}
"""
result = {}
with open(file) as data:
tweets = data.read().split('<<<EOT')
for i, tweet in enumerate(tweets):
line = tweet.splitlines()
content = ' '.join(line[2:])
meta = line[1].split(',')
if ':' in line[0]:
author = line[0]
metadata = extract_data(meta)
else:
metadata = extract_data(meta)
candidate = author
result[candidate] = [(candidate, content, metadata)]
return result
This currently results in an error: "date = int(metadata[1]) IndexError: list index out of range". I am not sure why, or what to do next. Any help would be appreciated.
Thanks
I don't think it is a good idea spliting by EOT considering candidates with empty tweets don't have EOT. it is better to loop through the contents instead of reading all the data at once. it makes it a lot easier.
doing same assignment stuck on this func aswell :(
I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!