My ultimate goal is to load meta data received from PubMed into a pyspark dataframe.
So far, I have managed to download the data I want from the PubMed data base using a shell script.
The downloaded data is in asn1 format. Here is an example of a data entry:
Pubmed-entry ::= {
pmid 31782536,
medent {
em std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
},
cit {
title {
name "Impact of CYP2C19 genotype and drug interactions on voriconazole
plasma concentrations: a spain pharmacogenetic-pharmacokinetic prospective
multicenter study."
},
authors {
names std {
{
name ml "Blanco Dorado S",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
},
{
name ml "Maronas O",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Latorre-Pellicer A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Rodriguez Jato T",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Lopez-Vizcaino A",
affil str "Pharmacy Department, University Hospital Lucus Augusti
(HULA). Lugo, Spain."
},
{
name ml "Gomez Marquez A",
affil str "Pharmacy Department, University Hospital Ourense
(CHUO). Ourense, Spain."
},
{
name ml "Bardan Garcia B",
affil str "Pharmacy Department, University Hospital Ferrol (CHUF).
A Coruna, Spain."
},
{
name ml "Belles Medall D",
affil str "Pharmacy Department, General University Hospital
Castellon (GVA). Castellon, Spain."
},
{
name ml "Barbeito Castineiras G",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Perez Del Molino Bernal ML",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Campos-Toimil M",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Otero Espinar F",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Blanco Hortas A",
affil str "Epidemiology Unit. Fundacion Instituto de Investigacion
Sanitaria de Santiago de Compostela (FIDIS), University Hospital Lucus
Augusti (HULA), Spain."
},
{
name ml "Duran Pineiro G",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Zarra Ferro I",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain."
},
{
name ml "Carracedo A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain.; Galician Foundation of Genomic Medicine,
Health Research Institute of Santiago de Compostela (IDIS), SERGAS, Santiago
de Compostela, Spain."
},
{
name ml "Lamas MJ",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Fernandez-Ferreiro A",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
}
}
},
from journal {
title {
iso-jta "Pharmacotherapy",
ml-jta "Pharmacotherapy",
issn "1875-9114",
name "Pharmacotherapy"
},
imp {
date std {
year 2019,
month 11,
day 29
},
language "eng",
pubstatus aheadofprint,
history {
{
pubstatus other,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus pubmed,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus medline,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
}
}
}
},
ids {
pubmed 31782536,
doi "10.1002/phar.2351",
other {
db "ELocationID doi",
tag str "10.1002/phar.2351"
}
}
},
abstract "BACKGROUND: Voriconazole, a first-line agent for the treatment
of invasive fungal infections, is mainly metabolized by cytochrome P450 (CYP)
2C19. A significant portion of patients fail to achieve therapeutic
voriconazole trough concentrations, with a consequently increased risk of
therapeutic failure. OBJECTIVE: To show the association between
subtherapeutic voriconazole concentrations and factors affecting voriconazole
pharmacokinetics: CYP2C19 genotype and drug-drug interactions. METHODS:
Adults receiving voriconazole for antifungal treatment or prophylaxis were
included in a multicenter prospective study conducted in Spain. The
prevalence of subtherapeutic voriconazole troughs were analyzed in the rapid
metabolizer and ultra-rapid metabolizer patients (RMs and UMs, respectively),
and compared with the rest of the patients. The relationship between
voriconazole concentration, CYP2C19 phenotype, adverse events (AEs), and
drug-drug interactions was also assessed. RESULTS: In this study 78 patients
were included with a wide variability in voriconazole plasma levels with only
44.8% of patients attaining trough concentrations within the therapeutic
range of 1 and 5.5 microg/ml. The allele frequency of *17 variant was found
to be 29.5%. Compared with patients with other phenotypes, RMs and UMs had a
lower voriconazole plasma concentration (RM/UM: 1.85+/-0.24 microg/ml versus
other phenotypes: 2.36+/-0.26 microg/ml, ). Adverse events were more common
in patients with higher voriconazole concentrations (p<0.05). No association
between voriconazole trough concentration and other factors (age, weight,
route of administration, and concomitant administration of enzyme inducer,
enzyme inhibitor, glucocorticoids, or proton pump inhibitors) was found.
CONCLUSION: These results suggest the potential clinical utility of using
CYP2C19 genotype-guided voriconazole dosing to achieve concentrations in the
therapeutic range in the early course of therapy. Larger studies are needed
to confirm the impact of pharmacogenetics on voriconazole pharmacokinetics.",
pmid 31782536,
pub-type {
"Journal Article"
},
status publisher
}
}
This is where I am stuck. I do not know how to extract the information from asn1 and get it into a pyspark dataframe. Could anyone suggest a way of doing this?
The above data is definitely in an "ASN.1 format". This format is called ASN.1 Value Notation and is used to represent ASN.1 values textually. (This format pre-dates the standardization of the JSON encoding rules. Today, one could use JSON for the same purpose, with some differences in the way the JSON would be processed compared to the ASN.1 value notation).
The ASN.1 schema that YaFred posted above contains a few errors, as YaFred himself noted. The notation you posted yourself also seems to contain a few errors. I have looked at the whole set of ASN.1 files of NCBI and noticed that they contain several errors. Because of this, they cannot be handled by a standard-conforming ASN.1 tool (such as the ASN.1 playground) unless they are fixed. Some of those errors are easy to fix, but fixing other errors require knowledge of the intent of the author of those files. This state of affairs is probably due to the fact that the NCBI project uses their own ASN.1 toolkit, which perhaps uses ASN.1 in some non-standard way.
I would imagine that in the NCBI toolkit there should be some means for you to decode the above value notation, so if I were you I would look into that toolkit. I am unable to give you a better suggestion because I don't know the NCBI toolkit.
Your problem may not be simple but it's worth experimenting.
Method 1:
As you have the specification, you can try looking for an ASN.1 tool (aka ASN.1 compiler) that will create a data model. In your case, because you downloaded a textual ASN.1 value, you need this tool to provide ASN.1 value decoders.
If the tool was generating Java code, it would go like this:
// decode a Pubmed-entry
// input is your data
Asn1ValueReader reader = new Asn1ValueReader(input);
PubmedEntry obj = PubmedEntry.readPdu(reader);
// access the data
obj.getPmid();
obj.getMedent();
A few caveats:
Tools that can do all that will not be free (if you find one at all). The problem here is that you have a textual ASN1 value while tools will generally provide binary decoders (BER, DER, etc ..)
You have a lot of glue code to write to create the record that goes into you pyspark dataframe
I wrote this some time ago but it does not have the textual ASN1 value decoders
Method 2:
If your data are simple enough and as they are textual data, you can try and write your own parser (using a tool like ANTLR) ... Not easy, to evaluate this method if you are not familiar with parsers.
EDIT:
Unfortunately, the specification is not valid.
I have crawled some data in json file, but I am unable to load it due to Extra Data Error? May I know what approach I can used to solve it?
My code is
with open('liverpool.json', 'r') as read:
data = json.loads(read.read())
print(data['text'])
My JSON file is
{"created_at":"Thu May 09 03:48:38 +0000 2019","id":1126333127452794881,"id_str":"1126333127452794881","text":"RT #andihiyat: UCL\nLiverpool lawan tottenham\n\nUEL\nArsenal lawan chelsea\n\nMU jadi penonton bayaran aja udah daripada ngga ada kerjaan","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":89329977,"id_str":"89329977","name":"whatever it takes","screen_name":"ikhsanrnldy","location":null,"url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":370,"friends_count":331,"listed_count":0,"favourites_count":76,"statuses_count":16478,"created_at":"Thu Nov 12 00:43:34 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"01010F","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"A8C7F7","profile_sidebar_fill_color":"C0DFEC","profile_text_color":"540AF5","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1121250950570582022\/x9GrAczT_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1121250950570582022\/x9GrAczT_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/89329977\/1556161880","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Wed May 08 21:27:39 +0000 2019","id":1126237248783917056,"id_str":"1126237248783917056","text":"UCL\nLiverpool lawan tottenham\n\nUEL\nArsenal lawan chelsea\n\nMU jadi penonton bayaran aja udah daripada ngga ada kerjaan","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":886934161,"id_str":"886934161","name":"andihiyat","screen_name":"andihiyat","location":null,"url":null,"description":"Malu bercanda sesat di jalan | http:\/\/instagram.com\/andihiyat | andihiyat#gmail.com","translator_type":"none","protected":false,"verified":false,"followers_count":606131,"friends_count":545,"listed_count":226,"favourites_count":1979,"statuses_count":31038,"created_at":"Wed Oct 17 14:26:34 +0000 2012","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"id","contributors_enabled":false,"is_translator":false,"profile_background_color":"131516","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_tile":true,"profile_link_color":"009999","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"EFEFEF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1114013770072780800\/vutmo5hd_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1114013770072780800\/vutmo5hd_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/886934161\/1547662374","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":222,"reply_count":370,"retweet_count":1715,"favorite_count":4202,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"in"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"andihiyat","name":"andihiyat","id":886934161,"id_str":"886934161","indices":[3,13]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"in","timestamp_ms":"1557373718499"}
{"created_at":"Thu May 09 04:02:53 +0000 2019","id":1126336711649366017,"id_str":"1126336711649366017","text":"RT #Omojuwa: A Barcelona x Ajax final became a Liverpool x Tottenham final. Whatever you are going through, you are not down and out. Pleas\u2026","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":3311255787,"id_str":"3311255787","name":"Man in caftan \ud83d\udc73\ud83c\udffd\u200d\u2642\ufe0f","screen_name":"muhd_burga","location":"North, NG","url":"http:\/\/instagram.com\/moomtaz_ng","description":"knowledge seeker \ud83d\udcda| Part time entrepreneur\ud83d\udcca| FC BAR\u00c7A\u26bd|\nBussiness account\ud83d\udc47\ud83c\udffb","translator_type":"none","protected":false,"verified":false,"followers_count":612,"friends_count":437,"listed_count":1,"favourites_count":6566,"statuses_count":6603,"created_at":"Sun Jun 07 00:22:41 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1105581178629832704\/h0J3Rd22_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1105581178629832704\/h0J3Rd22_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/3311255787\/1555512617","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Wed May 08 21:06:25 +0000 2019","id":1126231906444558337,"id_str":"1126231906444558337","text":"A Barcelona x Ajax final became a Liverpool x Tottenham final. Whatever you are going through, you are not down and\u2026 https:\/\/t.co\/LIw44MJXEq","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":115628224,"id_str":"115628224","name":"JJ. Omojuwa","screen_name":"Omojuwa","location":"The skies | jj#omojuwa.com","url":"http:\/\/en.wikipedia.org\/wiki\/Japheth_J._Omojuwa","description":"Non nobis, sed Omnibus.","translator_type":"none","protected":false,"verified":true,"followers_count":654440,"friends_count":3331,"listed_count":1364,"favourites_count":16003,"statuses_count":530374,"created_at":"Fri Feb 19 10:01:16 +0000 2010","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"6D63FF","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme7\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme7\/bg.gif","profile_background_tile":true,"profile_link_color":"E81C4F","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"7AC3EE","profile_text_color":"3D1957","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1116486680448569344\/wNthfONq_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1116486680448569344\/wNthfONq_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/115628224\/1553848557","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"A Barcelona x Ajax final became a Liverpool x Tottenham final. Whatever you are going through, you are not down and out. Please, never give up! Never ever giving up after what I have seen the last 24 hours.","display_text_range":[0,206],"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]}},"quote_count":56,"reply_count":70,"retweet_count":2239,"favorite_count":4287,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/LIw44MJXEq","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1126231906444558337","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"Omojuwa","name":"JJ. Omojuwa","id":115628224,"id_str":"115628224","indices":[3,11]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1557374573038"}
{"created_at":"Thu May 09 04:02:53 +0000 2019","id":1126336712198717440,"id_str":"1126336712198717440","text":"RT #LVPibai: El United eliminando al PSG en el 94\nLa Juve remontando al Atleti\nEl Ajax carg\u00e1ndose al campe\u00f3n de Europa y la Juve de Cristia\u2026","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2490066259,"id_str":"2490066259","name":"el penzil","screen_name":"p23nzil","location":"Valencia","url":"https:\/\/www.twitch.tv\/no23_","description":"Mi imaginaci\u00f3n no tiene l\u00edmites.","translator_type":"none","protected":false,"verified":false,"followers_count":88,"friends_count":410,"listed_count":1,"favourites_count":1654,"statuses_count":2618,"created_at":"Sun May 11 15:50:58 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"1BE01B","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_tile":true,"profile_link_color":"19CF86","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1051623579043987457\/lDmHd_9b_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1051623579043987457\/lDmHd_9b_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2490066259\/1465492298","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Wed May 08 21:00:49 +0000 2019","id":1126230495954702337,"id_str":"1126230495954702337","text":"El United eliminando al PSG en el 94\nLa Juve remontando al Atleti\nEl Ajax carg\u00e1ndose al campe\u00f3n de Europa y la Juve\u2026 https:\/\/t.co\/WUoh4FtBTF","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2754746065,"id_str":"2754746065","name":"Ibai","screen_name":"LVPibai","location":"Barcelona, Espa\u00f1a","url":"http:\/\/twitch.tv\/ibaailvp","description":"Comentarista profesional trabajando para #LVPes. Tengo 23 a\u00f1os.","translator_type":"regular","protected":false,"verified":true,"followers_count":650201,"friends_count":545,"listed_count":897,"favourites_count":37987,"statuses_count":50886,"created_at":"Fri Aug 22 11:50:45 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1102578175983407104\/jtxwZRZd_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1102578175983407104\/jtxwZRZd_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2754746065\/1528424597","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"El United eliminando al PSG en el 94\nLa Juve remontando al Atleti\nEl Ajax carg\u00e1ndose al campe\u00f3n de Europa y la Juve de Cristiano \nEl City eliminado despu\u00e9s de celebrar un gol en el 93\nEl Liverpool remontando un 3-0 \nEl Ajax eliminado en el 95\n\nLa mejor Champions de la historia.","display_text_range":[0,278],"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]}},"quote_count":405,"reply_count":278,"retweet_count":23636,"favorite_count":60937,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/WUoh4FtBTF","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1126230495954702337","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"es"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"LVPibai","name":"Ibai","id":2754746065,"id_str":"2754746065","indices":[3,11]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"es","timestamp_ms":"1557374573169"}
The error showed:
raise JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
Your JSON file is containing valid JSON each line while the whole file is not JSON. Read them line by line:
for line in open('liverpool.json'):
# be aware of empty lines
if not line.strip():
continue
data = json.loads(line)
print(data['text'])
This is valid JSON:
[{"key1": "val1"}, {"key2": "val2"}]
And this is not:
{"key1": "val1"}
{"key2": "val2"}
You can do:
import json
with open('liverpool.json', 'r') as read:
data = json.loads(read.readline())
print(data['text'])
Notice I changed the read.read() to read.readline(). This is because your json has multiple documents in it.
The problem here is that your json file contains multiple dicts. json.loads() expects either a single dict or a list of dicts. Solution is to wrap all your json dicts in a list and then use json.loads() on that list of dicts.
json.loads([dict1, dict2])
I have to directly convert a xls file to a JSON document using python3 and xlrd.
Table is here.
It's divided in three main categories (PUBLICATION, CONTENU, CONCLUSION) whose names are on column one (first column is zero) and number of rows by category can vary. Each rows has three key values (INDICATEURS, EVALUATION, PROPOSITION) on column 3, 5 and 7. There can be empty lines, or missing values
I have to convert that table to the following JSON data I have written directly has a reference. It's valid.
{
"EVALUATION": {
"PUBLICATION": [
{
"INDICATEUR": "Page de garde",
"EVALUATION": "Inexistante ou non conforme",
"PROPOSITION D'AMELIORATION": "Consulter l'example sur CANVAS"
},
{
"INDICATEUR": "Page de garde",
"EVALUATION": "Titre du TFE non conforme",
"PROPOSITION D'AMELIORATION": "Utilisez le titre avalisé par le conseil des études"
},
{
"INDICATEUR": "Orthographe et grammaire",
"EVALUATION": "Nombreuses fautes",
"PROPOSITION D'AMELIORATION": "Faire relire le document"
},
{
"INDICATEUR": "Nombre de page",
"EVALUATION": "Nombre de pages grandement différent à la norme",
"PROPOSITION D'AMELIORATION": ""
}
],
"CONTENU": [
{
"INDICATEUR": "Développement du sujet",
"EVALUATION": "Présentation de l'entreprise",
"PROPOSITION D'AMELIORATION": ""
},
{
"INDICATEUR": "Développement du sujet",
"EVALUATION": "Plan de localisation inutile",
"PROPOSITION D'AMELIORATION": "Supprimer le plan de localisation"
},
{
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran excessives",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
},
{
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran Inutiles",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
},
{
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran illisibles",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
},
{
"INDICATEUR": "Conclusion",
"EVALUATION": "Conclusion inexistante",
"PROPOSITION D'AMELIORATION": ""
},
{
"INDICATEUR": "Bibliographie",
"EVALUATION": "Inexistante",
"PROPOSITION D'AMELIORATION": ""
},
{
"INDICATEUR": "Bibliographie",
"EVALUATION": "Non normalisée",
"PROPOSITION D'AMELIORATION": "Ecrire la bibliographie selon la norme APA"
}
],
"CONCLUSION": [
{
"INDICATEUR": "",
"EVALUATION": "Grave manquement sur le plan de la présentation",
"PROPOSITION D'AMELIORATION": "Lire le document 'Conseil de publication' disponible sur CANVAS"
},
{
"INDICATEUR": "",
"EVALUATION": "Risque de refus du document par le conseil des études",
"PROPOSITION D'AMELIORATION": ""
}
]
}
}
My intention is to loop through lines, check rows[1] to identify the category, and sub-loop to add data as dictionary in a list by category.
Here is my code so far :
import xlrd
file = '/home/eh/Documents/Base de Programmation/Feedback/EvaluationEI.xls'
wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
def readRows():
for rownum in range(2,sheet.nrows):
rows = sheet.row_values(rownum)
indicateur = rows[3]
evaluation = rows[5]
amelioration = rows[7]
publication = []
contenu = []
conclusion = []
if rows[1] == "PUBLICATION":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
continue
else:
publication.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
if rows[1] == "CONTENU":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
continue
else:
contenu.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
if rows[1] == "CONCLUSION":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
continue
else:
conclusion.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
print (publication)
print (contenu)
print (conclusion)
readRows()
I am having a hard time figuring out how to sub-loop for the right number of rows to separate data by categories.
Any help would be welcome.
Thank you in advance
Using the json package and the OrderedDict (to preserve key order), I think this gets to what you're expecting, and I've modified slightly so we're not building a string literal, but rather a dict which contains the data that we can then convert with json.dumps.
As Ron noted above, your previous attempt was skipping the lines where rows[1] was not equal to one of your three key values.
This should read every line, appending to the last non-empty key:
def readRows(file, s_index=0):
"""
file: path to xls file
s_index: sheet_index for the xls file
returns a dict of OrderedDict of list of OrderedDict which can be parsed to JSON
"""
d = {"EVALUATION" : OrderedDict()} # this will be the main dict for our JSON object
wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_index(s_index)
# getting the data from the worksheet
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
# fill the dict with data:
for _,row in enumerate(data[3:]):
if row[1]: # if there's a value, then this is a new categorie element
categorie = row[1]
d["EVALUATION"][categorie] = []
if categorie:
i,e,a = row[3::2][:3]
if i or e or a: # as long as there's any data in this row, we write the child element
val = OrderedDict([("INDICATEUR", i),("EVALUATION", e),("PROPOSITION D'AMELIORATION", a)])
d["EVALUATION"][categorie].append(val)
return d
This returns a dict which can be easily parsed to json. Screenshot of some output:
Write to file if needed:
import io # for python 2
d = readRows(file,0)
with io.open('c:\debug\output.json','w',encoding='utf8') as out:
out.write(json.dumps(d,indent=2,ensure_ascii=False))
Note: in Python 3, I don't think you need io.open.
Is pandas not an option? Would add as a comment but don't have the rep.
From Documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
df = pandas.read_excel('path_to_file.xls')
df.to_json(path_or_buf='output_path.json', orient='table')
I've looked over this many times, and cant seem to find the problem with it.
I am trying to pull 3 fields from a JSON response (engagements, shares, comments), sum them together, then print the sum.
It seems to be returning the fields correctly, but it returns zero in the final print.
I'm very new to this stuff, but would appreciate any help anyone can give me. I'm guessing there is something fundamental I am missing here.
import urllib2,time,csv,json,requests,urlparse,pdb
SEARCH_URL = urllib2.unquote("http://soyuz.elastic.tubularlabs.net:9200/intelligence/video/_search?q=channel_youtube_id:""%s""%20AND%20published:%3E20150715T000000Z%20AND%20published:%3C20150716T000000Z")
reader = csv.reader(open('input.csv', 'r+U'), delimiter=',', quoting=csv.QUOTE_NONE)
cookie = {"user": "2|1:0|10:1438908462|4:user|36:eyJhaWQiOiA1Njk3LCAiaWQiOiA2MzQ0fQ==|b5c4b3adbd96e54833bf8656625aedaf715d4905f39373b860c4b4bc98655e9e"}
# idsToProcess = []
# for row in reader:
# if len(row)>0:
# idsToProcess.append(row[0])
idsToProcess = ['qdGW_m8Rim4FeMM29keDEg']
for userID in idsToProcess:
# print "fetching for %s.." % fbid
url = SEARCH_URL % userID
soyuzResponse = None
response = requests.request("GET", url, cookies=cookie)
ret = response.json()
soyuzResponse = ret['hits']['hits'][0]['_source']['youtube']
print soyuzResponse
totalDelta = 0
totalEngagementsVal = 0
totalSharesVal = 0
totalCommentsVal = 0
valuesArr = []
for entry in valuesArr:
arrEngagements = entry['engagements']
arrShares = entry['shares']
arrComments = entry['comments']
if len(arrEngagements)>0:
totalEngagementsVal = arrEngagements
elif len(arrShares)>0:
totalSharesVal = arrShares
elif len(arrComments)>0:
totalCommentsVal = arrComments
print "%s,%s" % (userID,totalEngagementsVal+totalSharesVal+totalCommentsVal)
totalDelta += totalEngagementsVal+totalSharesVal+totalCommentsVal
time.sleep(0)
print "%s,%s" % (userID,totalDelta)
exit()
Here is the json I am parsing:
took: 371,
timed_out: false,
_shards: {
total: 128,
successful: 128,
failed: 0
},
hits: {
total: 1,
max_score: 9.335125,
hits: [
{
_index: "intelligence_v2",
_type: "video",
_id: "jW7mjVdzR_U",
_score: 9.335125,
_source: {
claim: [
"Blucollection%2Buser"
],
topics: [
{
title_non: "Toy",
topic_id: "/m/0138tl",
title: "Toy"
},
{
title_non: "Sofia the First",
topic_id: "/m/0ncq483",
title: "Sofia the First"
}
],
likes: 1045,
duration: 318,
channel_owner_type: "influencer",
category: "Entertainment",
imported: "20150809T230652Z",
title: "Princess Sofia Cash Register Toy Surprise - Play Doh Caja Registradora Disney Sofia the First",
audience_location: [
{
country: "US",
value: 100
}
],
comments: 10,
twitter: {
tweets: 6,
engagements: 6
},
description: "Disney Princess "Sofia Cash Register" toy unboxing review by DisneyCollector. This is the authentic Royal toy of Sofia fit for a little Princess of Enchantia. Young Girls learn early on how mathematics is important in our lives, and learn to do math, developing creativity with a super fun game! Thx 4 watching this "Disney Princess Sofia cash register" unboxing review. In this video i also used Disney Frozen Princess Anna, Nickelodeon Peppa Pig blind bag and plastilina Play-Doh. Revisión del juguete Princesita Sofía Caja Registradora Real para niños y niñas. Las niñas aprenden desde muy temprano cómo las matemáticas es importante en nuestras vidas, y aprenden a hacer matemáticas, el desarrollo de la creatividad con un juego súper divertido! Here's how to say Princess in other languages: printzesa, 公主, prinses, prenses, printsess, princesse, Prinzessin, puteri, banphrionsa, Principesse, principessa, プリンセス, princese, puteri, prinsessa,prinsesse, princesa, công chúa, tywysoges, Princesses Disney, Prinzessinen, 공주, Princesas Disney, Disney πριγκίπισσες, Дисней принцесс, 디즈니 공주, ディズニーのお姫様, Vorstin, koningsdochter, Fürstin, πριγκίπισσα, księżniczka, królewna, принцесса. Here's how register is called in other languages: Caja Registradora de Princesa Sofía, Caisse Enregistreuse Princesse Sofia, Kassa, Registrierkasse Sofia die Erste Auf einmal Prinzessin, Registratore di Cassa di La Principessa Sofia, Caixa Registadora da Princesa Sofia, ηλεκτρονική ταμειακή μηχανή Σοφία η Πριγκίπισσα, 電子式金銭登録機 ちいさなプリンセス ソフィア, София Прекрасная кассовый аппарат, 디즈니주니어 리틀 프린세스 소피아 전자 금전 등록기, máy tính tiền điện tử, daftar uang elektronik, elektronik yazarkasa, Sofia den första kassaapparat leksak, Jej Wysokość Zosia kasa zabawki, Sofia het prinsesje kassa speelgoed, София Първа касов апарат играчка, casa de marcat jucărie Sofia Întâi. Princess Sofia SLEEPOVER Slumber Party - Princesita Sofía Pijamada Real. https://www.youtube.com/watch?v=WSa-Tp7HfyQ Princesita Sofía Castillo Mágico Parlante juguete de niñas. https://www.youtube.com/watch?v=ALQm_3uhIyg Sofia the First Magical Talking Castle Royal Prep Academy. https://www.youtube.com/watch?v=gcUiY0Suzrc Play-Doh Meal Makin' Kitchen with Princess Sofia the First. https://www.youtube.com/watch?v=x_-OxnRXj6g Sofia the First Royal Prep Academy Dolls Character Collection. https://www.youtube.com/watch?v=_kNY6AkSp9g Peppa Pig Picnic Adventure Car With Princess Sofia the First. https://www.youtube.com/watch?v=KIPH3txlq1o Watch "Sofia the First Talking Magic Castle" talking Clover: https://www.youtube.com/watch?v=ALQm_3uhIyg Play-Doh Sofia the First Magic Talking Castle w/ Peppa Pig: https://www.youtube.com/watch?v=-slXqMiDrY0 Play-Doh Sofia the First Going to School Portable Classroom http://www.youtube.com/watch?v=0R-dkVAIUlA",
views: 941726,
channel_network: null,
channel_subscribers: 5054024,
youtube_id: "jW7mjVdzR_U",
facebook: {
engagements: 9,
likes: 2,
shares: 7,
comments: 0
},
location_demo_count: 1,
is_public: true,
engagements: 1070,
channel_country: "US",
demo_count: null,
monetizable: true,
youtube: {
engagements: 1055,
likes: 1045,
comments: 10
},
published: "20150715T100003Z",
channel_youtube_id: "qdGW_m8Rim4FeMM29keDEg"
}
}
]
}
}
Response from terminal after running script:
{u'engagements': 1055, u'likes': 1045, u'comments': 10}
qdGW_m8Rim4FeMM29keDEg,0
qdGW_m8Rim4FeMM29keDEg,0
Your problem is these two lines:
valuesArr = []
for entry in valuesArr:
Because valuesArr is empty, the for loop never iterates, and that's where your totals are being summed.