Flatten Json Output - python

How can I turn this data into a flat data frame?
I've tried using json_normalize and pivot, but I can't seem to get the format right.
This is my desired out put format:
SiteName|SiteId|...|CompressorMeterRefID|TankID|TankNumber...|TankID|TankNumber...|TankID|... DateandTime|...
Please advise
[{'SiteName': 'Reinschmiedt 1-4H (CRP 11)',
'SiteId': 57,
'SiteRefId': 'OK10020',
'Choke': '',
'GasMeter1': 53.25,
'GasMeter1Name': 'Check Meter',
'GasMeter1RefId': '',
'GasMeter2Name': '',
'GasMeter2RefId': '',
'GasMeter3Name': '',
'GasMeter3RefId': '',
'OilMeter1Name': '',
'OilMeter1RefId': '',
'OilMeter2Name': '',
'OilMeter2RefId': '',
'WaterMeter1': 0.0,
'WaterMeter1Name': 'Water Meter',
'WaterMeter1RefId': '',
'WaterMeter2Name': '',
'WaterMeter2RefId': '',
'FlareMeterName': '',
'FlareMeterRefId': '',
'GasLiftMeterName': '',
'GasLiftMeterRefId': '',
'CompressorMeterName': '',
'CompressorMeterRefId': '',
'TankEntries': [{'TankId': 138,
'TankNumber': 2,
'TankLevelDateTime': '2018-07-01T12:00:00.0000000Z',
'TankLevelDateTimeLocal': '2018-07-01T07:00:00.0000000Z',
'TankTopGauge': 35.99,
'TankName': 'Oil Tank 209206',
'TankRefId': 0,
'TankRefId2': '',
'TankRefId3': ''},
{'TankId': 139,
'TankNumber': 3,
'TankLevelDateTime': '2018-07-01T12:00:00.0000000Z',
'TankLevelDateTimeLocal': '2018-07-01T07:00:00.0000000Z',
'TankTopGauge': 109.5,
'TankName': 'Oil Tank 209207',
'TankRefId': 0,
'TankRefId2': '',
'TankRefId3': ''}],
'DateAndTime': '2018-07-01T12:00:00.0000000Z',
'DateAndTimeLocal': '2018-07-01T07:00:00.0000000Z',
'UserName': 'ScadaVisor',
'Notes': ''},
{'SiteName': 'Allen 1-11H (CRP 8)',
.....
.....
.....

In r you can do it like this using jsonlite package:
result<- as.data.frame(jsonlite::stream_in(textConnection(data)))

Related

formatting dictionary keys read from pdf

how can I convert this dictionary keys to the following
original_di={'001': '', '002': '', '3': '24s', '004': '42s', '5': '', '006': '', '007': '', '008': '', '009': '', '010': '', '011': '', '012\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '013': '', '014': '', '015': '', '016': '', '017': '', '018': '', '019': '', '020': '', '021': '', '022': '', '023': '', '024': '', '025': '', '026': '', '027': '', '028': '', '029': '', '030': '', '031': '', '032': '', '033': '', '041': '', '042': '', '043': '', '044': '', '045': '', '046': '', '047': '', '048': '', '049': '', '050': '', '051': '', '052': '', '053': '', '054': '', '055': '', '056\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '035': '', '037': '', '039\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '034': '', '036': '', '038': '', '040\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '057': '', '092': '', '058': '', '059': '', '060': '', '061': '', '062': '', '063\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '064\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '065\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '066': '', '067': '', '068': '', '069': '', '070': '', '071': '', '072': '', '073': '', '074': '', '075': '', '076': '', '077': '', '078': '', '079': '', '080': '', '081': '', '082': '', '083': '', '084': '', '085\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '086': '', '087': '', '088': '', '089\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '090': '', '091': '', '093': '', '094': '', '095': '', '096': '', '097': '', '098': '', '099': '', '100': '', '101': '', '102': '', '103': '', '104': '', '105': '', '106': '', '107': '', '108': '', '109': '', '110': '', '111': '', '112': '', '113': '', '114': '', '115': '', '116': '', '117': '', '118': '', '119': '', '120': '', '121': '', '122': '', '123': '', '124': '', '125': '', '126': '', '127': '', '128': '', '129': '', '130': '', '131': '', '132': '', '133': '', '134': '', '135': '', '136': '', '137': '', '138': '', '139': '', '140': '', '141': '', '142': '', '143': '', '144': '', '145': '87e', '146': '', '147': '', '148': '', '149\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '150\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '151\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '152\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '153\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '154\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '155\r\r\r\r\r\r\r\r\r\r\r\r\r': 'US', '156': ''}
some keys have extra \r or \t and some have keys which aren't 3 digits.
Ideally, the output I want is for all keys to be 3 digits 001, 003,050, 111 (without \r\t)
try this, strip to remove new-line characters & rjust to fill in the values
{k.strip().rjust(3, "0"): v.strip() for k, v in original_di.items()}
for k, v in original_di.items() - Iterate on dict and k contains the keys and v contains the values.
int(k.strip()) - Removing the new-line characters (Eg.: \n or \t) from key and casting to integer the string.
"{0:0=3d}".format(x) - Create a string which contains 3 digits in every case from your integer
: v.strip() - Removing the new-line characters (Eg.: \n or \t) from value.
Code:
original_di={'001': '', '002': '', '3': '24s', '004': '42s', '5': '', '006': '', '007': '', '008': '', '009': '', '010': '', '011': '', '012\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '013': '', '014': '', '015': '', '016': '', '017': '', '018': '', '019': '', '020': '', '021': '', '022': '', '023': '', '024': '', '025': '', '026': '', '027': '', '028': '', '029': '', '030': '', '031': '', '032': '', '033': '', '041': '', '042': '', '043': '', '044': '', '045': '', '046': '', '047': '', '048': '', '049': '', '050': '', '051': '', '052': '', '053': '', '054': '', '055': '', '056\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '035': '', '037': '', '039\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '034': '', '036': '', '038': '', '040\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '057': '', '092': '', '058': '', '059': '', '060': '', '061': '', '062': '', '063\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '064\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '065\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '066': '', '067': '', '068': '', '069': '', '070': '', '071': '', '072': '', '073': '', '074': '', '075': '', '076': '', '077': '', '078': '', '079': '', '080': '', '081': '', '082': '', '083': '', '084': '', '085\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '086': '', '087': '', '088': '', '089\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '090': '', '091': '', '093': '', '094': '', '095': '', '096': '', '097': '', '098': '', '099': '', '100': '', '101': '', '102': '', '103': '', '104': '', '105': '', '106': '', '107': '', '108': '', '109': '', '110': '', '111': '', '112': '', '113': '', '114': '', '115': '', '116': '', '117': '', '118': '', '119': '', '120': '', '121': '', '122': '', '123': '', '124': '', '125': '', '126': '', '127': '', '128': '', '129': '', '130': '', '131': '', '132': '', '133': '', '134': '', '135': '', '136': '', '137': '', '138': '', '139': '', '140': '', '141': '', '142': '', '143': '', '144': '', '145': '87e', '146': '', '147': '', '148': '', '149\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '150\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '151\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '152\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '153\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '154\r\r\r\r\r\r\r\r\r\r\r\r\r': '', '155\r\r\r\r\r\r\r\r\r\r\r\r\r': 'US', '156': ''}
print("{}".format({"{0:0=3d}".format(int(k.strip())): v.strip() for k, v in original_di.items()}))
Output:
>>> python3 test.py
{'001': '', '002': '', '003': '24s', '004': '42s', '005': '', '006': '', '007': '', '008': '', '009': '', '010': '', '011': '', '012': '', '013': '', '014': '', '015': '', '016': '', '017': '', '018': '', '019': '', '020': '', '021': '', '022': '', '023': '', '024': '', '025': '', '026': '', '027': '', '028': '', '029': '', '030': '', '031': '', '032': '', '033': '', '041': '', '042': '', '043': '', '044': '', '045': '', '046': '', '047': '', '048': '', '049': '', '050': '', '051': '', '052': '', '053': '', '054': '', '055': '', '056': '', '035': '', '037': '', '039': '', '034': '', '036': '', '038': '', '040': '', '057': '', '092': '', '058': '', '059': '', '060': '', '061': '', '062': '', '063': '', '064': '', '065': '', '066': '', '067': '', '068': '', '069': '', '070': '', '071': '', '072': '', '073': '', '074': '', '075': '', '076': '', '077': '', '078': '', '079': '', '080': '', '081': '', '082': '', '083': '', '084': '', '085': '', '086': '', '087': '', '088': '', '089': '', '090': '', '091': '', '093': '', '094': '', '095': '', '096': '', '097': '', '098': '', '099': '', '100': '', '101': '', '102': '', '103': '', '104': '', '105': '', '106': '', '107': '', '108': '', '109': '', '110': '', '111': '', '112': '', '113': '', '114': '', '115': '', '116': '', '117': '', '118': '', '119': '', '120': '', '121': '', '122': '', '123': '', '124': '', '125': '', '126': '', '127': '', '128': '', '129': '', '130': '', '131': '', '132': '', '133': '', '134': '', '135': '', '136': '', '137': '', '138': '', '139': '', '140': '', '141': '', '142': '', '143': '', '144': '', '145': '87e', '146': '', '147': '', '148': '', '149': '', '150': '', '151': '', '152': '', '153': '', '154': '', '155': 'US', '156': ''}

Limiting data in pd.DataFrame

I am trying to implement the following with loading an internal data structure to pandas:
df = pd.DataFrame(self.data,
nrows=num_rows+500,
skiprows=skip_rows,
header=header_row,
usecols=limit_cols)
However, it doesn't appear to implement any of those (like it does when reading a csv file, other than the data). Is there another method I can use to have more control over the data that I'm ingesting? Or, do I need to rebuild the data before loading it into pandas?
My input data looks like this:
data = [
['ABC', 'es-419', 'US', 'Movie', 'Full Extract', 'PARIAH', '', '', 'EST', 'Features - EST', 'HD', '2017-05-12 00:00:00', 'Open', 'WSP', '10.5000', '', '', '', '', '10.5240/8847-7152-6775-8B59-ADE0-Y', '10.5240/FFE3-D036-A9A4-9E7A-D833-1', '', '', '', '04065', '', '', '2011', '', '', '', '', '', '', '', '', '', '', '', '113811', '', '', '', '', '', '04065', '', 'Spanish (LAS)', 'US', '10', 'USA NATL SALE', '2017-05-11 00:00:00', 'TIER 3', '21', '', '', 'USA NATL SALE-SPANISH LANGUAGE', 'SPAN'],
['ABC', 'es-419', 'US', 'Movie', 'Full Extract', 'PATCH ADAMS', '', '', 'EST', 'Features - EST', 'HD', '2017-05-12 00:00:00', 'Open', 'WSP', '10.5000', '', '', '', '', '10.5240/DD84-FBF4-8F67-D6F3-47FF-1', '10.5240/B091-00D4-8215-39D8-0F33-8', '', '', '', 'U2254', '', '', '1998', '', '', '', '', '', '', '', '', '', '', '', '113811', '', '', '', '', '', 'U2254', '', 'Spanish (LAS)', 'US', '10', 'USA NATL SALE', '2017-05-11 00:00:00', 'TIER 3', '21', '', '', 'USA NATL SALE-SPANISH LANGUAGE', 'SPAN']
]
And so I'm looking to be able to state which rows it should load (or skip) and which columns it should skip (usecols). Is that possible to do with an internal python data structure?

How to correctly parse HTML to Unicode strings with pandas?

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from HTML table using pandas(read_html) and write result to csv file
However, when I write this text to a file,all spaces in it gets written in an unexpected encoding (example \xd0\xb9\xd1\x82\xd0\xb8).
to solve the problem I added a line i = i.split(" ")
after, all spaces in csv file substitutes for characters, the example below:
['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']
I would like to get rid of character ( '', ) Is there a way to fix this?
Any pointers would be much appreciated.
code python:
import pandas as pd
import html5lib
filename="1.csv"
file=open(filename,"w",encoding='UTF-8', newline='\n');
output=csv.writer(file, dialect='excel',delimiter =' ')
r = requests.get('http://10.45.87.12/og?sh=1&CallerName=&Sys=.79.83.86.51&')
pd.set_option('max_rows',10000)
df = pd.read_html(r.content)
for i in df:
i = str(i)
i = i.strip()
i = i.encode('UTF-8').decode('UTF-8')
i = i.split(" ")
output.writerow(i)
file.close()
You can use the filter method to remove of empty values. you can add the below snippet after 'i = i.split(" ")'
A = ['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']
print filter(None, A)
Output:
['0', '1', '2', '3\n0', 'number', 'last name', 'number', 'plan', 'NaN\n1', 'NaN', 'NaN', 'not', 'NaN\n2', '53494580', '+', '(53)494580', 'NP_551', 'NaN\n3', '53494581', '+', '(53)494581', 'NP_551', 'NaN\n4']

map list values when the position is not known

I I have the following list of lists:
(['investmentseminar', '300', '', '', 'CNAME', '', 'domain.com.'], 7)
(['#', '300', '', '', '', '', '', '', '', 'CNAME', '', 'domain.com.'], 12)
(['#', '300', '', '', '', '', '', '', '', '', '', '', '', '', '', 'MX', '', '1', '', 'eu-smtp-inbound-1.com.'], 20)
(['#', '3600', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'TXT', '', 'MS=ms87183849'], 19)
(['#', '3600', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'TXT', '', 'MS=ms91398333'], 19)
it is from a parsed file with BIND data, i am trying to extract the record type and TTL, where the position of the items in the list are fixed.
this is the code i have so far:
lines = [['#', '', '', 'MX', '', '10', '', 'relay1.netnames.net.'],['#', '', '', 'MX', '', '20', '', 'relay2.netnames.net.'], ['#', '3600', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'TXT', '', 'MS=ms91398333'], ['#', '300', '', '', '', '', '', '', '', '', '', '', '', '', '', 'MX', '', '1', '', 'eu-smtp-inbound-1.com.'], ['domain.tld.', '3600', '', '', '', '', '', '', '', '', '', '', '', 'TXT', '', 'v=spf1 redirect=spf.domain.tld'],['a.ns.slf', '', '', '', '', '', '', '', '', '', 'A', '', '192.123.54.133'],['adfs', '', '', '', '', '', '', '', '', '', '', '', '', '', 'A', '', '192.123.67.20']]
record_set_list = []
def record_set(record):
resource = {
'Name': record[0],
'TTL': record[1],
'Type': record[4],
'Value': record[-1]
}
record_set_list.append({'RecordSets': resource})
types = ['A', 'AAAA', 'CAA', 'CNAME', 'MX', 'NAPTR', 'PTR', 'SPF', 'SRV', 'TXT', 'ZONE']
for record in csv.reader(lines, delimiter=" "):
any_in = any(i in record for i in types)
if any_in is True:
record_set(record)
how do i match the TTL, Type and in the case of MX record the preference?
any advise is much appreciated
Use the builtin function filter to remove the empty strings, zip the remaining values with the corresponding keys, and make a dict.
def record_set(record):
keys = ['Name', 'TTL', 'Type', 'Value']
values = filter(None, record)
resource = dict(zip(keys, values))
record_set_list.append({'RecordSets': resource})

Importing CSV from URL and displaying rows on Python by using Requests

import csv
import requests
webpage = requests.get('http://www.pjm.com/pub/account/lmpda/20160427-da.csv')
reader=csv.reader(webpage)
for row in reader:
print(row)
Hi, I'm new to Python and I'm trying to open a CSV file from a URL & then display the rows so I can take the data that I need from it. However, the I get an error saying :
Traceback (most recent call last):
File "", line 1, in
for row in reader: Error: iterator should return strings, not bytes (did you open the file in text mode?)
Thank you in advance.
You can try this:
import csv, requests
webpage=requests.get('http://www.pjm.com/pub/account/lmpda/20160427-da.csv')
reader = csv.reader(webpage.content.splitlines())
for row in reader:
print(row)
Hope this will help
Use .text as you are getting bytes returned in python3:
webpage = requests.get('http://www.pjm.com/pub/account/lmpda/20160427-da.csv')
reader = csv.reader([webpage.text])
for row in reader:
print(row)
That gives _csv.Error: new-line character seen in unquoted field so split the lines after decoding, also stream=True will allow you to get the data in chunks not all at once so you can filter by row and write:
import csv
import requests
webpage = requests.get('http://www.pjm.com/pub/account/lmpda/20160427-da.csv', stream=1)
for line in webpage:
print(list(csv.reader((line.decode("utf-8")).splitlines()))[0])
Which gives you:
['Day Ahead Hourly LMP Values for 20160427', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['00', '600', '700', '800', '900', '1000', '1100', '1200', '1300', '1400', '1500', '1600', '1700', '1800', '1900', '2000', '2100', '2200', '2300', '2400', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['1', '25.13', '25.03', '28.66', '25.94', '21.74', '19.47', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['600', '600', '600', '700', '700', '700', '800', '800', '800', '900', '900', '900', '1000', '1000', '1000', '1100', '1100', '1100', '1200', '1200', '1200', '1300', '1300', '1300', '1400', '1400', '1400', '1500', '']
['1500', '1500', '1600', '1600', '1600', '1700', '1700', '1700', '1800', '1800', '1800', '1900', '1900', '1900']
['', '2000', '2000', '2000', '2100', '2100', '2100', '2200', '2200', '2200', '2300', '2300', '2300', '2400', '2400', '2400', '']
['lLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'Tot']
['alLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'To']
['talLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'TotalLMP', 'CongestionPrice', 'MarginalLossPrice', 'T']
.......................................
A variation on the answer by Padriac Cunningham uses iter_lines() from Requests and decodes each line using a list comprehension
import csv
import requests
webpage = requests.get('http://www.pjm.com/pub/account/lmpda/20160427-da.csv', stream = True)
webpage_decoded = [line.decode('utf-8') for line in webpage.iter_lines()]
reader = csv.reader(webpage_decoded)
or even simpler, you can have iter_lines() do the decoding
webpage_decoded = webpage.iter_lines(decode_unicode=True)

Categories

Resources