How do I parse json correctly? - python

from urllib.request import urlopen
import json
def downloadPage(url):
webpage = urlopen(url).readlines()
return webpage
json_string = downloadPage('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=The%20Terminator')
str1 = ''.join(map(bytes.decode, json_string))
parsed_json = json.loads(str1)
print(parsed_json)
Seems like the json is not parsed properply and when I do
print(parsed_json['extract'])
I get
Traceback (most recent call last):
File "D:/Universitet/PythonProjects/myapp.py", line 14, in <module>
print(parsed_json['extract'])
KeyError: 'extract'
How can I make it work so it extracts the json I want with
print(parsed_json['extract'])

If this is your JSON object, then you have to traverse the whole object to get to the value you want to extract.
{
"batchcomplete":"",
"query":{
"pages":{
"30327":{
"pageid":30327,
"ns":0,
"title":"The Terminator",
"extract":"The Terminator is a 1984 American science fiction film directed by James Cameron. It stars Arnold Schwarzenegger as the Terminator, a cyborg assassin sent back in time from 2029 to 1984 to kill Sarah Connor (Linda Hamilton), whose son will one day become a savior against machines in a post-apocalyptic future. Michael Biehn plays Kyle Reese, a reverent soldier sent back in time to protect Sarah. The screenplay is credited to Cameron and producer Gale Anne Hurd, while co-writer William Wisher Jr. received a credit for additional dialogue. Executive producers John Daly and Derek Gibson of Hemdale Film Corporation were instrumental in financing and production.The Terminator topped the United States box office for two weeks and helped launch Cameron's film career and solidify Schwarzenegger's status as a leading man. Its success led to a franchise consisting of several sequels, a television series, comic books, novels and video games. In 2008, The Terminator was selected by the Library of Congress for preservation in the National Film Registry as \"culturally, historically, or aesthetically significant\"."
}
}
}
}
Something like:
result = json["query"]["pages"]["30327"]["extract"]
But of course you should search for the property in a proper way, iterating over properties / arrays and testing if the keys exist.
EDIT
If you know the structure is always the same, but the ID differs, then you can try something like this to handle arbitrary IDs.
for key, value in d["query"]["pages"].items():
result = value["extract"]
print(result)

JSON file have levels so if you want to get some values you should get via chains of keys:
parsed_json['query']['pages']['30327']['extract']

Related

How to scrape specified div tag from HTML code using Pandas?

Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;
Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Scraping Issue with BeautifulSoup only while using for loop

I am writing a python code using the 'BeautifulSoup' library that would pull titles and authors of all the opinion pieces from a news website. While the for loop is working as intended for the titles, the find function within it meant to pull the author's name for each of the titles is repeatedly returning the author of the first piece as the output.
Any ideas where I am going wrong?
The Code
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml')
opinion = soup.find('div', class_='css-717c4s')
for story in opinion.find_all('article'):
title = story.h2.text
print(title)
author = opinion.find('div', class_='css-1xdt15l')
print (author.text)
The Output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Judy Batalion
Old Pol, New Tricks
Judy Batalion
Do We Have to Pay Businesses to Obey the Law?
Judy Batalion
I Don’t Want My Role Models Erased
Judy Batalion
Progressive Christians Arise! Hallelujah!
Judy Batalion
What the 2020s Need: Sex and Romance at the Movies
Judy Batalion
Biden Has Disappeared
Judy Batalion
What Republicans Could Learn From My Grandmother
Judy Batalion
Your Home’s Value Is Based on Racism
Judy Batalion
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Judy Batalion
You should do,
author = story.find('div', class_='css-1xdt15l')
What's wrong is, you are doing
author = opinion.find('div', class_='css-1xdt15l')
and it fetches the first author in all the authors in the opinions section and since you run this statement in every iteration of the loop, hence no matter how many times you do it, you will only get the first author.
Replacing it with
author = story.find('div', class_='css-1xdt15l')
fetches you the first author for each story iteration and since each story has a single author, it works fine.
It's because you target only the first tag, hence the one author.
Here's a fix for your code using zip():
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml').find('div', class_='css-717c4s')
authors = soup.find_all('div', class_='css-1xdt15l')
stories = soup.find_all('article')
for author, story in zip(authors, stories):
print(author.text)
print(story.h2.text)
Output:
Judy Batalion
The Nazi-Fighting Women of the Jewish Resistance
Gracy Olmstead
My Great-Grandfather Knew How to Fix America’s Food System
Maureen Dowd
Old Pol, New Tricks
Nikolas Bowie
Do We Have to Pay Businesses to Obey the Law?
Elizabeth Becker
I Don’t Want My Role Models Erased
Nicholas Kristof
Progressive Christians Arise! Hallelujah!
Ross Douthat
What the 2020s Need: Sex and Romance at the Movies
Frank Bruni
Biden Has Disappeared
Cecilia Gentili
What Republicans Could Learn From My Grandmother
Dorothy A. Brown
Your Home’s Value Is Based on Racism
Linsey Marr, Juliet Morrison and Caitlin Rivers
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Very little mistake.
You're supposed to do a search on each "article"/story object in the loop Not your initial "opinion" object i.e
author = story.find('div', class_='css-1xdt15l')
This produces the desired output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Gracy Olmstead
Old Pol, New Tricks
Maureen Dowd
Do We Have to Pay Businesses to Obey the Law?
Nikolas Bowie
I Don’t Want My Role Models Erased
Elizabeth Becker
Progressive Christians Arise! Hallelujah!
Nicholas Kristof
What the 2020s Need: Sex and Romance at the Movies
Ross Douthat
Biden Has Disappeared
Frank Bruni
What Republicans Could Learn From My Grandmother
Cecilia Gentili
Your Home’s Value Is Based on Racism
Dorothy A. Brown
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Linsey Marr, Juliet Morrison and Caitlin Rivers

how to read nested json file in pandas dataframe?

I learned how to load and read json file in pandas dataframe. However, I have multiple json files about news and each json file hold a rather complicated nested structure to represent news content and its metadata. I need to read them in pandas dataframe for next downstream analysis. So I figured out how to load and read json file in python. However, the solution that I learned for my json file doesn't work for me. Here is example json data snippet on the fly: example json file and here is what I tried:
import os, json
import pandas as pd
path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/' // multiple json files
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
with open('json_files[0]') as f:
data = pd.DataFrame(json.loads(line) for line in f)
but I didn't get expected pandas dataframe. How can I read json file with nested structure into pandas dataframe nicely? Is there anyone take a look example json data snippet and provide a possible idea to make this work in pandas dataframe? Any thoughts? Thanks
source of json data:
I used json data from this github repository: FakeNewsNet Dataset, so you can browse the how original data looks like and create neat pandas dataframe from it. Any idea to get this done easily? Thanks again
update 2:
I tried following solution but it didn't work for me:
import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
ValueError: arrays must all be same length
import os
import glob
import json
from pandas.io.json import json_normalize
path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'
json_paths = glob.glob(os.path.join(path_to_json, "*.json"))
df = pd.concat((json_normalize(json.load(open(p))) for p in json_paths), axis=0)
df = df.reset_index(drop=True) # Optionally reset index.
This will load all your json files into a single dataframe.
It will also flatten the nested json hierarchy by adding '.' between the keys.
You will probably need to perform further data cleaning, for e.g., by replacing the NaNs with appropriate values. This can be done with the the dataframe's fillna, or by applying a function to transform individual values.
Edit
As I mentioned in the comment, the data is actually messy, so words such as "View All Post" can be one of the values for "authors". See the JSON "BuzzFeed_Fake_26-Webpage.json" for an example.
To remove these entries and possibly others,
# This will be a set of entries you wish to remove.
# Here we only consider "View All Posts".
invalid_entries = {"View All Posts"}
import functools
def fix(x, invalid):
if isinstance(x, list):
return [i for i in x if i not in invalid]
else:
# You can optionally choose to return [] here to fix the NaNs
# and to standardize the types of the values in this column
return x
fix_author = functools.partial(fix, invalid=invalid_entries)
df["authors"] = df.authors.apply(fix_author)
You need to orient your dataframe. Try the below code to update your Update 2 Approach:
x = {"top_img": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "text": "On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a \u201cpressure cooker\u201d device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.\n\nGiven that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn\u2019t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the \u201cbad guys.\u201d Unfortunately, our leadership \u2013 who ostensibly wants to protect us \u2013 finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.\n\nNew York City Mayor Bill de Blasio \u2013 who famously ended \u201cstop-and-frisk\u201d profiling in his city \u2013 was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. \u201cThere is no specific and credible threat to New York City from any terror organization,\u201d de Blasio said late Saturday at the news conference. \u201cWe believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and \u2026 agencies are at full alert\u201d, he said. Isn\u2019t \u201can intentional act\u201d terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O\u2019Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York\u2019s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word \u201cterrorism\u201d is defined. \u201cA bomb exploding in New York is obviously an act of terrorism.\u201d Cuomo hit the nail on the head, but why did need to clarify and caveat before making his \u201cobvious\u201d assessment?\n\nThe two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that \u201cwe need to do everything we can to support our first responders \u2013 also to pray for the victims\u201d and that \u201cwe need to let this investigation unfold.\u201d Trump was more direct. \u201cI must tell you that just before I got off the plane a bomb went off in New York and nobody knows what\u2019s going on,\u201d he said. \u201cBut boy we are living in a time\u2014we better get very tough folks. We better get very, very tough. It\u2019s a terrible thing that\u2019s going on in our world, in our country and we are going to get tough and smart and vigilant.\u201d\n\nUnfortunately, an incident like the Chelsea explosion reminds us how vulnerable our country is particularly in venues defined as \u201csoft targets.\u201d Now more than ever, America needs strong leadership which is laser-focused on protecting her citizens from terrorist attacks of all genres and is not afraid of being politically incorrect.\n\nThe views expressed in this opinion article are solely those of their author and are not necessarily either shared or endorsed by EagleRising.com", "authors": ["View All Posts", "Leonora Cravotta"], "keywords": [], "meta_data": {"description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "og": {"site_name": "Eagle Rising", "description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "title": "Another Terrorist Attack in NYC...Why Are we STILL Being Politically Correct", "locale": "en_US", "image": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "updated_time": "2016-09-22T10:49:05+00:00", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "type": "article"}, "robots": "noimageindex", "fb": {"app_id": 256195528075351, "pages": 135665053303678}, "article": {"section": "Political Correctness", "tag": "terrorism", "published_time": "2016-09-22T07:10:30+00:00", "modified_time": "2016-09-22T10:49:05+00:00"}, "viewport": "initial-scale=1,maximum-scale=1,user-scalable=no", "googlebot": "noimageindex"}, "canonical_link": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "images": ["http://constitution.com/wp-content/uploads/2017/08/confederatemonument_poll_pop.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46772-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/03/eagle-rising-logo3-1.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46729-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46764-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46731-featured-300x130.jpg", "http://pixel.quantserve.com/pixel/p-52ePUfP6_NxQ_.gif", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=112&d=mm&r=g", "http://b.scorecardresearch.com/p?c1=2&c2=22315475&cv=2.0&cj=1", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46784-featured-300x130.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/terrorism-2-800x300.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/coup-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2017/04/crtv_300x600_1.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46774-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/superstar-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46763-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46612-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46761-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46642-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46735-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46750-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46755-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46752-featured-300x130.png", "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46743-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46712-featured-300x130.jpg", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=100&d=mm&r=g", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46757-featured-300x130.png"], "title": "Another Terrorist Attack in NYC\u2026Why Are we STILL Being Politically Correct \u2013 Eagle Rising", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "summary": "", "movies": [], "publish_date": {"$date": 1474528230000}, "source": "http://eaglerising.com"}
import pandas as pd
df = pd.DataFrame.from_dict(x, orient='index')
print df
Reading from JSON file:
import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame.from_dict(data, orient='index')
print df

Extracting parts of emails in text files

I am trying to do some text processing corpus which has emails.
I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.
To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.
An example text file can look like below:
Message-ID: <3490571.1075846143093.JavaMail.evans#thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean#enron.com
To: kelly.kimberly#enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings#ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron#Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).
I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.
I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).
Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.
Fields:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above
^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
Make sure you're using dotall, multiline, and extended modes on your regex engine.
For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean#enron.com`
Group `to` 132-156 `kelly.kimberly#enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
And use it to iterate over your stream of text, you mention python so there you go:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
Thats directly stolen from here just some names changed and stuff.

Trouble Converting Apostrophes in JSON response

I have a response from an API returned in json format. It looks as follows:
page = requests.get(link)
page_dict = json.loads(page.content)
print page_dict
>> {u'sm_api_title': u'The Biggest Mysteries of Missing Malaysian Flight MH370', u'sm_api_keyword_array': [u'flight', u'plane', u'pilot', u'crash', u'passenger'], u'sm_api_content': u' Since the plane's disappearance early Saturday, revelations about the passenger list and plane's flight plan have left officials scrambling to decipher new complicated clues. The most dangerous parts of a flight are traditionally the takeoff and landing, but the missing jetliner disappeared about two hours into a six-hour flight, when it should have been cruising safely around 35,000 feet. The last plane to crash at altitude was Air France Flight 447, which crashed during a thunderstorm in the Atlantic Ocean en route from Rio De Janeiro to Paris. A day after the flight disappeared the biggest question authorities are asking is did the plane turn around and why? The first officer on the flight was identified as Fariq Hamid, 27, and had about 2,800 flight hours since 2007.', u'sm_api_limitation': u'Waited 0 extra seconds due to API limited mode, 89 requests left to make for today.', u'sm_api_character_count': u'773'}
As you can see the response comes back with characters like ' which are included in the response. What is the best way to clean this response?
I've used xmllib before and gotten it to work, but when I use it with django it gives me deprecation warnings.
Thank you for the help!
You need to unescape the strings in order to decode the HTML characters. You can unescape HTML strings using the standard library:
import HTMLParser
parser = HTMLParser.HTMLParser()
unescaped_string = parser.unescape(html_escaped_string)

Categories

Resources