how to read nested json file in pandas dataframe?

how to read nested json file in pandas dataframe? - python

I learned how to load and read json file in pandas dataframe. However, I have multiple json files about news and each json file hold a rather complicated nested structure to represent news content and its metadata. I need to read them in pandas dataframe for next downstream analysis. So I figured out how to load and read json file in python. However, the solution that I learned for my json file doesn't work for me. Here is example json data snippet on the fly: example json file and here is what I tried:
import os, json
import pandas as pd
path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/' // multiple json files
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
with open('json_files[0]') as f:
data = pd.DataFrame(json.loads(line) for line in f)
but I didn't get expected pandas dataframe. How can I read json file with nested structure into pandas dataframe nicely? Is there anyone take a look example json data snippet and provide a possible idea to make this work in pandas dataframe? Any thoughts? Thanks
source of json data:
I used json data from this github repository: FakeNewsNet Dataset, so you can browse the how original data looks like and create neat pandas dataframe from it. Any idea to get this done easily? Thanks again
update 2:
I tried following solution but it didn't work for me:
import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
ValueError: arrays must all be same length

import os
import glob
import json
from pandas.io.json import json_normalize
path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'
json_paths = glob.glob(os.path.join(path_to_json, "*.json"))
df = pd.concat((json_normalize(json.load(open(p))) for p in json_paths), axis=0)
df = df.reset_index(drop=True) # Optionally reset index.
This will load all your json files into a single dataframe.
It will also flatten the nested json hierarchy by adding '.' between the keys.
You will probably need to perform further data cleaning, for e.g., by replacing the NaNs with appropriate values. This can be done with the the dataframe's fillna, or by applying a function to transform individual values.
Edit
As I mentioned in the comment, the data is actually messy, so words such as "View All Post" can be one of the values for "authors". See the JSON "BuzzFeed_Fake_26-Webpage.json" for an example.
To remove these entries and possibly others,
# This will be a set of entries you wish to remove.
# Here we only consider "View All Posts".
invalid_entries = {"View All Posts"}
import functools
def fix(x, invalid):
if isinstance(x, list):
return [i for i in x if i not in invalid]
else:
# You can optionally choose to return [] here to fix the NaNs
# and to standardize the types of the values in this column
return x
fix_author = functools.partial(fix, invalid=invalid_entries)
df["authors"] = df.authors.apply(fix_author)

You need to orient your dataframe. Try the below code to update your Update 2 Approach:
x = {"top_img": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "text": "On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a \u201cpressure cooker\u201d device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.\n\nGiven that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn\u2019t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the \u201cbad guys.\u201d Unfortunately, our leadership \u2013 who ostensibly wants to protect us \u2013 finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.\n\nNew York City Mayor Bill de Blasio \u2013 who famously ended \u201cstop-and-frisk\u201d profiling in his city \u2013 was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. \u201cThere is no specific and credible threat to New York City from any terror organization,\u201d de Blasio said late Saturday at the news conference. \u201cWe believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and \u2026 agencies are at full alert\u201d, he said. Isn\u2019t \u201can intentional act\u201d terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O\u2019Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York\u2019s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word \u201cterrorism\u201d is defined. \u201cA bomb exploding in New York is obviously an act of terrorism.\u201d Cuomo hit the nail on the head, but why did need to clarify and caveat before making his \u201cobvious\u201d assessment?\n\nThe two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that \u201cwe need to do everything we can to support our first responders \u2013 also to pray for the victims\u201d and that \u201cwe need to let this investigation unfold.\u201d Trump was more direct. \u201cI must tell you that just before I got off the plane a bomb went off in New York and nobody knows what\u2019s going on,\u201d he said. \u201cBut boy we are living in a time\u2014we better get very tough folks. We better get very, very tough. It\u2019s a terrible thing that\u2019s going on in our world, in our country and we are going to get tough and smart and vigilant.\u201d\n\nUnfortunately, an incident like the Chelsea explosion reminds us how vulnerable our country is particularly in venues defined as \u201csoft targets.\u201d Now more than ever, America needs strong leadership which is laser-focused on protecting her citizens from terrorist attacks of all genres and is not afraid of being politically incorrect.\n\nThe views expressed in this opinion article are solely those of their author and are not necessarily either shared or endorsed by EagleRising.com", "authors": ["View All Posts", "Leonora Cravotta"], "keywords": [], "meta_data": {"description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "og": {"site_name": "Eagle Rising", "description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "title": "Another Terrorist Attack in NYC...Why Are we STILL Being Politically Correct", "locale": "en_US", "image": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "updated_time": "2016-09-22T10:49:05+00:00", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "type": "article"}, "robots": "noimageindex", "fb": {"app_id": 256195528075351, "pages": 135665053303678}, "article": {"section": "Political Correctness", "tag": "terrorism", "published_time": "2016-09-22T07:10:30+00:00", "modified_time": "2016-09-22T10:49:05+00:00"}, "viewport": "initial-scale=1,maximum-scale=1,user-scalable=no", "googlebot": "noimageindex"}, "canonical_link": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "images": ["http://constitution.com/wp-content/uploads/2017/08/confederatemonument_poll_pop.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46772-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/03/eagle-rising-logo3-1.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46729-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46764-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46731-featured-300x130.jpg", "http://pixel.quantserve.com/pixel/p-52ePUfP6_NxQ_.gif", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=112&d=mm&r=g", "http://b.scorecardresearch.com/p?c1=2&c2=22315475&cv=2.0&cj=1", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46784-featured-300x130.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/terrorism-2-800x300.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/coup-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2017/04/crtv_300x600_1.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46774-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/superstar-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46763-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46612-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46761-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46642-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46735-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46750-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46755-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46752-featured-300x130.png", "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46743-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46712-featured-300x130.jpg", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=100&d=mm&r=g", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46757-featured-300x130.png"], "title": "Another Terrorist Attack in NYC\u2026Why Are we STILL Being Politically Correct \u2013 Eagle Rising", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "summary": "", "movies": [], "publish_date": {"$date": 1474528230000}, "source": "http://eaglerising.com"}
import pandas as pd
df = pd.DataFrame.from_dict(x, orient='index')
print df
Reading from JSON file:
import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame.from_dict(data, orient='index')
print df

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Extract probabilities and labels from FARM TextClassification

I have spent a few days exploring the excellent FARM library and its modular approach to building models. The default output (result) however is very verbose, including a multiplicity of texts, values and ASCII artwork. For my research I only require the predicted labels from my NLP text classification model, together with the individual probabilities. How do I do that? I have been experimenting with nested lists/dictionaries but am unable to neatly produce a simple list of output labels and probabilities.
enter code here
# Test your model on a sample (Inference)
from farm.infer import Inferencer
from pprint import PrettyPrinter
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
basic_texts = [
# a snippet or two from Dickens
{"text": "Mr Dombey had remained in his own apartment since the death of his wife, absorbed in visions of the youth, education, and destination of his baby son. Something lay at the bottom of his cool heart, colder and heavier than its ordinary load; but it was more a sense of the child’s loss than his own, awakening within him an almost angry sorrow."},
{"text": "Soon after seven o'clock we went down to dinner, carefully, by Mrs. Jellyby's advice, for the stair-carpets, besides being very deficient in stair-wires, were so torn as to be absolute traps."},
{"text": "Walter passed out at the door, and was about to close it after him, when, hearing the voices of the brothers again, and also the mention of his own name, he stood irresolutely, with his hand upon the lock, and the door ajar, uncertain whether to return or go away."},
# from Lewis Carroll
{"text": "I have kept one for many years, and have found it of the greatest possible service, in many ways: it secures my _answering_ Letters, however long they have to wait; it enables me to refer, for my own guidance, to the details of previous correspondence, though the actual Letters may have been destroyed long ago;"},
{"text": "The Queen gasped, and sat down: the rapid journey through the air had quite taken away her breath and for a minute or two she could do nothing but hug the little Lily in silence."},
{"text": "Rub as she could, she could make nothing more of it: she was in a little dark shop, leaning with her elbows on the counter, and opposite to her was an old Sheep, sitting in an arm-chair knitting, and every now and then leaving off to look at her through a great pair of spectacles."},
# G K Chesterton
{"text": "Basil and I walked rapidly to the window which looked out on the garden. It was a small and somewhat smug suburban garden; the flower beds a little too neat and like the pattern of a coloured carpet; but on this shining and opulent summer day even they had the exuberance of something natural, I had almost said tropical. "},
{"text": "This is the whole danger of our time. There is a difference between the oppression which has been too common in the past and the oppression which seems only too probable in the future."},
{"text": "But whatever else the worst doctrine of depravity may have been, it was a product of spiritual conviction; it had nothing to do with remote physical origins. Men thought mankind wicked because they felt wicked themselves. "},
]
result = infer_model.inference_from_dicts(dicts=basic_texts)
PrettyPrinter().pprint(result)
#print(result)

All logging (incl. the ASCII artwork) is done in FARM via Python's logging framework. You can simply disable the logs up to a certain level like this at the beginning of your script:
import logging
logging.disable(logging.ERROR)
Is that what you are looking for or do you rather want to adjust the output format of the model predictions? If you only need label and probability, you could do something like this:
...
basic_texts = [
{"text": "Stackoverflow is a great community"},
{"text": "It's snowing"},
]
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
result = infer_model.inference_from_dicts(dicts=basic_texts)
minimal_results = []
for sample in result:
# Only extract the top 1 prediction per sample
top_pred = sample["predictions"][0]
minimal_results.append({"label": top_pred["label"], "probability": top_pred["probability"]})
PrettyPrinter().pprint(minimal_results)
infer_model.close_multiprocessing_pool()
(I left out the initial model loading etc. - see this example for more details)

How do I parse json correctly?

from urllib.request import urlopen
import json
def downloadPage(url):
webpage = urlopen(url).readlines()
return webpage
json_string = downloadPage('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=The%20Terminator')
str1 = ''.join(map(bytes.decode, json_string))
parsed_json = json.loads(str1)
print(parsed_json)
Seems like the json is not parsed properply and when I do
print(parsed_json['extract'])
I get
Traceback (most recent call last):
File "D:/Universitet/PythonProjects/myapp.py", line 14, in <module>
print(parsed_json['extract'])
KeyError: 'extract'
How can I make it work so it extracts the json I want with
print(parsed_json['extract'])

If this is your JSON object, then you have to traverse the whole object to get to the value you want to extract.
{
"batchcomplete":"",
"query":{
"pages":{
"30327":{
"pageid":30327,
"ns":0,
"title":"The Terminator",
"extract":"The Terminator is a 1984 American science fiction film directed by James Cameron. It stars Arnold Schwarzenegger as the Terminator, a cyborg assassin sent back in time from 2029 to 1984 to kill Sarah Connor (Linda Hamilton), whose son will one day become a savior against machines in a post-apocalyptic future. Michael Biehn plays Kyle Reese, a reverent soldier sent back in time to protect Sarah. The screenplay is credited to Cameron and producer Gale Anne Hurd, while co-writer William Wisher Jr. received a credit for additional dialogue. Executive producers John Daly and Derek Gibson of Hemdale Film Corporation were instrumental in financing and production.The Terminator topped the United States box office for two weeks and helped launch Cameron's film career and solidify Schwarzenegger's status as a leading man. Its success led to a franchise consisting of several sequels, a television series, comic books, novels and video games. In 2008, The Terminator was selected by the Library of Congress for preservation in the National Film Registry as \"culturally, historically, or aesthetically significant\"."
}
}
}
}
Something like:
result = json["query"]["pages"]["30327"]["extract"]
But of course you should search for the property in a proper way, iterating over properties / arrays and testing if the keys exist.
EDIT
If you know the structure is always the same, but the ID differs, then you can try something like this to handle arbitrary IDs.
for key, value in d["query"]["pages"].items():
result = value["extract"]
print(result)

JSON file have levels so if you want to get some values you should get via chains of keys:
parsed_json['query']['pages']['30327']['extract']

Working with span objects. [spaCy, python]

I'm not sure if this is a really dumb question, but here goes.
text_corpus = '''Insurance bosses plead guilty\n\nAnother three US insurance executives have pleaded guilty to fraud charges stemming from an ongoing investigation into industry malpractice.\n\nTwo executives from American International Group (AIG) and one from Marsh & McLennan were the latest. The investigation by New York attorney general Eliot Spitzer has now obtained nine guilty pleas. The highest ranking executive pleading guilty on Tuesday was former Marsh senior vice president Joshua Bewlay.\n\nHe admitted one felony count of scheming to defraud and faces up to four years in prison. A Marsh spokeswoman said Mr Bewlay was no longer with the company. Mr Spitzer\'s investigation of the US insurance industry looked at whether companies rigged bids and fixed prices. Last month Marsh agreed to pay $850m (£415m) to settle a lawsuit filed by Mr Spitzer, but under the settlement it "neither admits nor denies the allegations".\n'''
def get_entities(document_text, model):
analyzed_doc = model(document_text)
entities = [entity for entity in analyzed_doc.ents if entity.label_ in ["PER", "ORG", "LOC", "GPE"]]
return entities
model = spacy.load("en_core_web_sm")
entities_1 = get_entities(text_corpus, model)
entities_2 = get_entities(text_corpus, model)
but when it run the following,
entities_1[0] in entities_2
The output is False.
Why is that? The objects in both the entity lists are the same. Yet an item from one list is not in the other one. That's extremely odd. Can someone please explain why that is so to me?

This is due to the way ents's are represented in spaCy. They are classes with specific implementations so even entities_2[0] == entities_1[0] will evaluate to False. By the looks of it, the Span class does not have an implementation of __eq__ which, at first glance at least, is the simple reason why.
If you print out the value of entities_2[0] it will give you US but this is simply because the span class has a __repr__ method implemented in the same file. If you want to do a boolean comparison, one way would be to use the text property of Span and do something like:
entities_1[0].text in [e.text for e in entities_2]
edit:
As #abb pointed out, Span implements __richcmp__, however this is applicable to the same instance of Span since it checks the position of the token itself.

Unexpected behavior with open() and numpy.savetext() functions

Problem
I am trying to output statistics about a table, followed by more table data using Pandas and numpy.
When I execute the following code:
import pandas as pd
import numpy as np
data = pd.read_csv(r'c:\Documents\DS\CAStateBuildingMetrics.csv')
waterUsage = data["Water Use (All Water Sources) (kgal)"]
dept = data[["Department Name", "Property Id"]]
mean = str(waterUsage.mean())
median = str(waterUsage.median())
most = str(waterUsage.mode())
hw1 = open(r'c:\Documents\DS\testFile', "a")
hw1.write("Mean Water Usage Median Water Usage Most Common Usage Amounts\n")
hw1.write(mean+' '+median+' '+most)
np.savetxt(r'c:\Documents\DS\testFile', dept.values, fmt='%s')
The table output by np.savetext is written into c:\Documents\DS\testFile before the statistics about Mean, Median, and Mode water usage are written into the file. Below is the output I am describing:
Here is a sample of the table output, which ends up to be 1700 rows.
Capitol Area Development Authority 1259182
Capitol Area Development Authority 1259200
Capitol Area Development Authority 1259218
California Department of Forestry and Fire Protection 3939905
California Department of Forestry and Fire Protection 3939906
California Department of Forestry and Fire Protection 3939907
After this, the script outputs the statistics in this format
Mean Water Usage Median Water Usage Most Common Usage Amounts
6913.1633414932685 182.35 0 165.0
Type: float64
Question
How do I adjust the behavior to guarantee that the statistics appear before the table?

The issue, as pointed out by #hpaulj, is that the same open file is not being referenced.
Replacing
np.savetxt(r'c:\Documents\DS\testFile', dept.values, fmt='%s')
With
np.savetxt(hw1, dept.values, fmt='%s')
hw1.close()
Will write all information in the expected order in the same file. Closing it follows best practices of handling files in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to read nested json file in pandas dataframe? - python

Related

Python package to extract sentence from a textfile based on keyword

Extract probabilities and labels from FARM TextClassification

How do I parse json correctly?

Working with span objects. [spaCy, python]

Unexpected behavior with open() and numpy.savetext() functions

Categories

Resources