Python 'key error' while building dictionary dynamically (On the fly) - python

I get the error onthis line of code -
result_dict['strat'][k]['name'] = current_comps[0].strip()
The error is : Keyerror: 'strat'
I have an input line
PERSON1 ## CAR1 # ENTRY : 0 | EXIT : 0 ## CAR2 # M1 : YES : 10/01/17 02:00 | M2 : NO : 10/02/16 03:00 | M3 : NO : 05/07/17 11:00 | M4 : YES : 01/01/16 03:00 ## TRUCK # M3 : NO : 03/01/17 03:45 | M23 : NO : 01/01/14 07:00 | M27 : YES : 02/006/18 23:00
I 'm looking to parse this input to generate the output detailed below. As part of this, I'm trying to build a dictionary inserting both keys & values dynamically. I'm having a lot of problems doing this.
Could I please request help on this?
Here is what I've tried so far -
# File read
f = open('input_data', 'r')
file_cont = f.read().splitlines()
f.close()
#json template
# Initialize dictionary
result_arr = []
result_dict = {}
k = 0
for item in file_cont:
strat = item.split('##')
result_dict['Person'] = strat[0].strip()
j = 1
while j < len(strat):
# Split various components of the main line
current_comps = strat[j].split('#')
# Name of strat being parsed
result_dict['strat'][k]['name'] = current_comps[0].strip()
# tfs across the various time frames
tfs = current_comps[1].split('|')
# First travel mode
if current_comps[0].strip() == 'CAR1':
temp_low_arr = tfs[0].split(':')
temp_high_arr = tfs[1].split(':')
result_dict['strat'][k]['Entry'] = temp_low_arr[1].strip()
result_dict['strat'][k]['Exit'] = temp_high_arr[1].strip()
# Second travel mode
elif current_comps[0].strip() == 'CAR2':
z = 0
while z < len(tfs):
# Split components of the sign
sign_comp_car_2 = tfs[z].split(':')
result_dict['strat'][k]['tf'][z]['path'] = sign_comp_ma_cross[0].strip()
result_dict['strat'][k]['tf'][z]['sign'] = sign_comp_ma_cross[1].strip()
result_dict['strat'][k]['tf'][z]['sign_time'] = sign_comp_ma_cross[2].strip()
z += 1
# Third travel mode
elif current_comps[0].strip() == 'CAR3':
b = 0
while b < len(tfs):
# Split components of the sign
sign_car_3 = tfs[z].split(':')
result_dict['strat'][k]['tf'][b]['path'] = sign_all_term[0].strip()
result_dict['strat'][k]['tf'][b]['sign'] = sign_all_term[1].strip()
result_dict['strat'][k]['tf'][b]['sign_time'] = sign_all_term[2].strip()
b += 1
j += 1
k += 1
Expected output
[{
"Person":"",
"Transport":[
{
"Name":"CAR1",
"Entry":"0",
"Exit":"0"
},
{
"name":"CAR2:",
"tf":[
{
"path":"M1",
"sign":"YES",
"sign_time":"10/01/17 02:00"
},
{
"path":"M2",
"sign":"NO",
"sign_time":"10/02/16 03:00"
},
{
"path":"M3",
"sign":"NO",
"sign_time":"05/07/17 11:00"
},
{
"path":"M4",
"sign":"YES",
"sign_time":"01/01/16 03:00"
}
]
},
{
"name":"CAR3",
"tf":[
{
"path":"M3",
"sign":"NO",
"sign_time":"03/01/17 03:45"
},
{
"path":"M23",
"sign":"NO",
"sign_time":"01/01/14 07:00"
},
{
"path":"M27",
"sign":"Yes",
"sign_time":"02/006/18 23:00"
}
]
}
]
}]

The issue is when you try to assign the ['name'] field in result_dict['strat'][k] when result_dict['strat'][k] hasn't been initialized yet. Before you run your for-loop, the dictionary has no key called strat.
Now you could have done something like result_dict['strat'] = dict() (assigning an object to that key in the dict), but when you further subscript it using result_dict['strat'][k], it will try to resolve that first, by accessing result_dict['strat'], expecting either a subscriptable collection or a dictionary in return. However, since that key doesn't exist yet, it throws you the error.
What you could do instead is initialize a default dictionary:
from collections import defaultdict
...
resultdict = defaultdict(dict)
...
Otherwise, in your existing code, you could initialize a dict within result_dict before entering the loop.

Related

How to loop through values in JSON and assign to another dictionary

I am developing a Python/Django web app. I am trying to parse JSON into a python dictionary, read the values in the dictionary, and assign the values to another dictionary if certain conditions are met.
JSON is structured like this:
{content: {cars: [0, 1, 2]}, other_stuff: []}
Each car has multiple attributes:
0: {"make", "model", "power"...}
Each attribute has three variables:
make: {"value": "Toyota", "unit": "", "user_edited": "false"}
I am trying to assign the values in the JSON to other dictionaries; car_0, car_1 and car_2. In this case the JSON response is otherwise identical considering each car, but the 'make' of the first car is changed to 'Nissan', and I'm trying to then change the make of the car_0 also to 'Nissan'. I'm parsing JSON in the following way:
local_cars = [car_0, car_1, car_2] # Dictionaries which are already initialized.
print(local_cars[0] is local_cars[1]) # Prints: false
print(local_cars[0]['make']['value']) # Prints: Toyota (yes)
print(local_cars[1]['make']['value']) # Prints: Toyota (yes)
print(local_cars[2]['make']['value']) # Prints: Toyota (yes)
counter = 0
if request.method == 'POST':
payload = json.loads(request.body)
if bool(payload):
print(len(local_cars)) # Prints: 3
print(counter, payload['cars'][0]['make']['value']) # Prints: Nissan (yes)
print(counter, payload['cars'][1]['make']['value']) # Prints: Toyota (yes)
print(counter, payload['cars'][2]['make']['value']) # Prints: Toyota (yes)
print(counter, local_cars[0]['make']['value']) # Prints: Toyota (yes)
print(counter, local_cars[1]['make']['value']) # Prints: Toyota (yes)
print(counter, local_cars[2]['make']['value']) # Prints: Toyota (yes)
for target_car in payload['cars']: # Loop through all three cars in payload
print(local_cars[0] is local_cars[1]) # false
for attr in target_car.items(): # Loop through all key:dict pairs of a single car
attribute_key = attr[0] # Key (eg. 'make')
vars_dict = attr[1] # Dictionary of variables ('value': 'xx', 'unit': 'yy', 'user_edited': 'zz')
if vars_dict['user_edited'] == 'true':
local_cars[counter][attribute_key]['user_edited'] = 'true'
local_cars[counter][attribute_key]['value'] = vars_dict['value']
print(counter, local_cars[counter]['make']['value']) # Prints: 0, Toyota (yes), 1, Nissan (no!), 2, Nissan (no!)
counter = counter + 1
What I don't understand is why the other cars, local_cars[1] and local_cars[2] are affected in anyway in this loop. As it can be seen, for some reason their 'make' is changed to 'Nissan' even though it was 'Toyota' in the request body. This seems to happen in the first round of 'for target_car in payload['cars'].
Abandoning the loop/counter and focusing on one car does not make any difference:
for target_car in payload['cars']: --> target_car = payload['cars'][0]:
...
local_cars[0][attribute_key]['user_edited'] = 'true'
local_cars[0][attribute_key]['value'] = vars_dict['value']
What am I doing wrong? How can the car_1 and car_2 be affected even if I change the only part of the code where any values in those dictionaries are edited to affect only on the local_cars[0]?
UPDATED
Received the correct answer for this. Before the part of code originally posted, I initialized the car_0, car_1 and car_2 dictionaries.
What I did before was:
default_car = model_to_dict(Car.objects.first())
car_0 = {}
car_1 = {}
car_2 = {}
attribute = {}
i = 0
for key, value in default_car.items():
if i > 1:
attribute[key] = {"value": value, "unit": units.get(key), "user_edited": "false"}
i = i + 1
car_0.update(attribute)
car_1.update(attribute)
car_2.update(attribute)
local_cars = [car_0, car_1, car_2]
...
Apparently it was the problem that all car_x had a connection to attribute-dictionary. I solved the problem by editing the car_x initialization to the following:
default_car = model_to_dict(Car.objects.first())
car_0 = {}
car_1 = {}
car_2 = {}
attribute_0 = {}
attribute_1 = {}
attribute_2 = {}
i = 0
for key, value in default_car.items():
if i > 1:
attribute_0[key] = {"value": value, "unit": units.get(key), "user_edited": "false"}
attribute_1[key] = {"value": value, "unit": units.get(key), "user_edited": "false"}
attribute_2[key] = {"value": value, "unit": units.get(key), "user_edited": "false"}
i = i + 1
car_0.update(attribute_0)
car_1.update(attribute_1)
car_2.update(attribute_2)
local_cars = [car_0, car_1, car_2]
...
I think you are probably failing to take copies of car_0 etc. Don't forget that python assignment is purely name-binding.
x = car_0
y = car_0
print( x['make']['value'] ) # 'Toyota'
print( y['make']['value'] ) # 'Toyota'
print( x is y ) # True. Both names refer to the same object
x['make']['value'] = 'foo'
print( y['make']['value'] ) # 'foo'
Should have been y = car_0.copy() or even y=car_0.deepcopy().
I don't fully follow your code, but if you are still unsure then do some is testing to find out which entities are bound to the same object (and shouldn't be).

Why do I have a large gap between Elasticsearch and Snowflake?

I have been tasked to build a process in python that would extract the data from Elasticsearch, drop data in an Azure Blob after which Snowflake would ingest the data. I have the process running on Azure Functions that extracts an index group (like game_name.*) and for each index in the index group, it creates a thread to scroll on. I save the last date of each result and on the next run parse it in the range query. I am running the process every five minutes and have offset the end of the range by 5 minutes (we have a refresh running every 2 minutes). I let the process run for a while and then I do a gap analysis by taking a count(*) in both Elasticsearch and Snowflake by hour (or by day) and expect to have a max of 1% gap. However, for one index pattern which groups about 127 indexes, when I run a catchup job (for a day or more), the resulting gap is as expected, however, as soon as I let it run on the cron job (every 5 min), after a while I get gaps of 6-10% and only for this index group.
It looks as if the scroller function picks up an N amount of documents within the queried range but then for some reason documents are later added (PUT) with an earlier date. Or I might be wrong and my code is doing something funny. I've talked to our team and they don't cache any docs on the client, and the data is synced to a network clock (not the client's) and sending UTC.
Please see below the query I am using to paginate through elasticsearch:
def query(searchSize, lastRowDateOffset, endDate, pit, keep_alive):
body = {
"size": searchSize,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "baseCtx.date"
}
},
{
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lte": endDate
}
}
}
]
}
},
"pit": {
"id": pit,
"keep_alive": keep_alive
},
"sort": [
{
"baseCtx.date": {"order": "asc", "unmapped_type": "long"}
},
{
"_shard_doc": "asc"
}
],
"track_total_hits": False
}
return body
def scroller(pit,
threadNr,
index,
lastRowDateOffset,
endDate,
maxThreads,
es,
lastRowCount,
keep_alive="1m",
searchSize=10000):
cumulativeResultCount = 0
iterationResultCount = 0
data = []
dataBytes = b''
lastIndexDate = ''
startScroll = time.perf_counter()
while 1:
if lastRowCount == 0: break
#if lastRowDateOffset == endDate: lastRowCount = 0; break
try:
page = es.search(body=body)
except: # It is believed that the point in time is getting closed, hence the below opens a new one
pit = es.open_point_in_time(index=index, keep_alive=keep_alive)['id']
body = query(searchSize, lastRowDateOffset, endDate, pit, keep_alive)
page = es.search(body=body)
pit = page['pit_id']
data += page['hits']['hits']
body['pit']['id'] = pit
if len(data) > 0: body['search_after'] = [x['sort'] for x in page['hits']['hits']][-1]
cumulativeResultCount += len(page['hits']['hits'])
iterationResultCount = len(page['hits']['hits'])
#print(f"This Iteration Result Count: {iterationResultCount} -- Cumulative Results Count: {cumulativeResultCount} -- {time.perf_counter() - startScroll} seconds")
if iterationResultCount < searchSize: break
if len(data) > rowsPerMB * maxSizeMB / maxThreads: break
if time.perf_counter() - startScroll > maxProcessTimeSeconds: break
if len(data) != 0:
dataBytes = gzip.compress(bytes(json.dumps(data)[1:-1], encoding='utf-8'))
lastIndexDate = max([x['_source']['baseCtx']['date'] for x in data])
response = {
"pit": pit,
"index": index,
"threadNr": threadNr,
"dataBytes": dataBytes,
"lastIndexDate": lastIndexDate,
"cumulativeResultCount": cumulativeResultCount
}
return response
def batch(game_name, env='prod', startDate='auto', endDate='auto', writeDate=True, minutesOffset=5):
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
lowerFormat = game_name.lower().replace(" ","_")
indexGroup = lowerFormat + "*"
if env == 'dev': lowerFormat, indexGroup = 'dev_' + lowerFormat, 'dev.' + indexGroup
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
curFileName = f"{lowerFormat}_cursors.json"
curBlobFilePath = f"cursors/{curFileName}"
compressedTools = [gzip.compress(bytes('[', encoding='utf-8')), gzip.compress(bytes(',', encoding='utf-8')), gzip.compress(bytes(']', encoding='utf-8'))]
pits = []
lastRowCounts = []
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is not None: maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxThreads") is not None: maxThreads = int(os.getenv(f"{lowerFormat}_maxThreads"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is not None: maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
# Get all indices for the indexGroup
indicesEs = list(set([(re.findall(r"^.*-", x)[0][:-1] if '-' in x else x) + '*' for x in list(es.indices.get(indexGroup).keys())]))
indices = [{"indexName": x, "lastOffsetDate": (datetime.datetime.utcnow()-datetime.timedelta(days=5)).strftime("%Y/%m/%d 00:00:00")} for x in indicesEs]
# Load Cursors
cursors = getCursors(curBlobFilePath, indices)
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
if endDate == 'auto': endDate = f"{initTime-datetime.timedelta(minutes=minutesOffset):%Y/%m/%d %H:%M:%S}"
print(f"Less than or Equal to: {endDate}, {keep_alive}")
# Start Multi-Threading
while 1:
dataBytes = []
dataSize = 0
start = time.perf_counter()
if len(pits) == 0: pits = ['' for x in range(len(cursors))]
if len(lastRowCounts) == 0: lastRowCounts = ['' for x in range(len(cursors))]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(cursors)) as executor:
results = [
executor.submit(
scroller,
pit,
threadNr,
x['indexName'],
x['lastOffsetDate'] if startDate == 'auto' else startDate,
endDate,
len(cursors),
es,
lastRowCount,
keep_alive,
searchSize) for x, pit, threadNr, lastRowCount in (zip(cursors, pits, list(range(len(cursors))), lastRowCounts))
]
for f in concurrent.futures.as_completed(results):
if f.result()['lastIndexDate'] != '': cursors[f.result()['threadNr']]['lastOffsetDate'] = f.result()['lastIndexDate']
pits[f.result()['threadNr']] = f.result()['pit']
lastRowCounts[f.result()['threadNr']] = f.result()['cumulativeResultCount']
dataSize += f.result()['cumulativeResultCount']
if len(f.result()['dataBytes']) > 0: dataBytes.append(f.result()['dataBytes'])
print(f"Thread {f.result()['threadNr']+1}/{len(cursors)} -- Index {f.result()['index']} -- Results pulled {f.result()['cumulativeResultCount']} -- Cumulative Results: {dataSize} -- Process Time: {round(time.perf_counter()-start, 2)} sec")
if dataSize == 0: break
lastRowDateOffsetDT = datetime.datetime.strptime(max([x['lastOffsetDate'] for x in cursors]), '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Starting compression of {dataSize} rows -- {round(time.perf_counter()-start, 2)} sec")
dataBytes = compressedTools[0] + compressedTools[1].join(dataBytes) + compressedTools[2]
# Upload to Blob
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName, len(dataBytes))
print(f"File compiled: {outFile} -- {dataSize} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
# Update cursors
if writeDate: postCursors(curBlobFilePath, cursors)
# Clean Up
print("Closing PITs")
for pit in pits:
try: es.close_point_in_time({"id": pit})
except: pass
print(f"Closing Connection to {esUrl}")
es.close()
return
# Start the process
while 1:
batch("My App")
I think I just need a second pair of eyes to point out where the issue might be in the code. I've tried increasing the minutesOffset argv to 60 (so every 5 minutes it pulls the data from the last run until Now()-60 minutes) but had the same issue. Please help.
So the "baseCtx.date" is triggered by the client and it seems that in some cases there is a delay between when the event is triggered and when it is available to be searched. We fixed this by using the ingest pipeline as follows:
PUT _ingest/pipeline/indexDate
{
"description": "Creates a timestamp when a document is initially indexed",
"version": 1,
"processors": [
{
"set": {
"field": "indexDate",
"value": "{{{_ingest.timestamp}}}",
"tag": "indexDate"
}
}
]
}
And set index.default_pipeline to "indexDate" in the template settings. Every month the index name changes (we append the year and month) and this approach creates a server date we used to scroll.

create dict keys depending on the number of times the same value occurs

I have a dict as below, if the same value is found more tahn once then the dict key must be created with incremental numbering.
TT = {
"TYPE_1" : ['ERROR'],
"TYPE_2": ['FATAL'],
"TYPE_3" : ["TIME_OUT"],
"TYPE_4" : ['SYNTAX'],
"TYPE_5" : ['COMPILE'],
}
m1 = "ERROR the input is not proper"
m2 = "This should have not occured FATAL"
m3 = "Sorry TIME_OUT"
m4 = "SYNTAX not proper"
m5 = "u r late its TIME_OUT"
The value "TIME_OUT" occur twice in the search.
count = 0
for key in TT.keys():
print(key)
Key_1 = key
while key_1 in TT:
count = count+1
key_1 = key + "_{}".format(count)
The above code gives error Key_1 not defined.
Expected OUTPUT:
if the same value is occuring more than once then the dict key should be created with incremental numbering "TYPE_3_1" : ["TIME_OUT"],
TT = {
"TYPE_1" : ['ERROR'],
"TYPE_2": ['FATAL'],
"TYPE_3" : ["TIME_OUT"],
"TYPE_3_1" : ["TIME_OUT"],
"TYPE_4" : ['SYNTAX'],
"TYPE_5" : ['COMPILE'],
}
Please suggest on this.
There can be a much efficient way of solving this if you can rethink about some of the data structure but if that is not an option you may be able to try this.
inputs = ["ERROR the input is not proper",
"This should have not occured FATAL",
"Sorry TIME_OUT",
"SYNTAX not proper",
"u r late its TIME_OUT"]
basic_types = {
"TYPE_1" : ['ERROR'],
"TYPE_2": ['FATAL'],
"TYPE_3" : ["TIME_OUT"],
"TYPE_4" : ['SYNTAX'],
"TYPE_5" : ['COMPILE'],
}
type_counts = {}
results = {}
for sentence in inputs:
for basic_type in basic_types:
if basic_types.get(basic_type)[0] in sentence:
type_counts[basic_type] = type_counts.get(basic_type, 0) + 1
if type_counts[basic_type] == 1:
results[basic_type] = [basic_types.get(basic_type)[0]]
else:
results[basic_type+"_{}".format(type_counts[basic_type] - 1)] = [basic_types.get(basic_type)[0]]
print(results)

Python Iterating New Data into nested dictionary

I have been working on a Python Role-Playing game and I have a function to import item data from a text file. The text file is structured as follows:
WEAPON 3 sword_of_eventual_obsolescence 6 10 2 0 10
WEAPON 4 dagger_of_bluntness 2 5 3 1 0
WEAPON 5 sword_of_extreme_flimsiness 3 8 3 7 0
The data importing goes like this:
def items_get():
import os
global items
items = {
"weapon":{},
"armour":{},
"potion":{},
"misc":{}
}
file_dir = ( os.getcwd() + '\Code\items.txt' )
file_in = open( file_dir, 'r')
for each_line in file_in:
line = file_in.readline()
line = line.split(' ')
if line[0] == "WEAPON":
weapon_id = line[1]
name = line[2]
attack_min = line[3]
attack_max = line[4]
range = line[5]
weight = line[6]
value = line[7]
weapon_data = {
"name": name.replace('_', ' '),
"atk_min": attack_min,
"atk_max": attack_max,
"rng": range,
"wt": weight,
"val": value,
}
items["weapon"][weapon_id] = {}
items["weapon"][weapon_id].update(weapon_data)
However, when I print items["weapon"], I get this:
{'4': {'wt': '1', 'atk_min': '2', 'atk_max': '5', 'val': '0', 'name': 'dagger of bluntness', 'rng': '3'}}
As you can see, there is only 1 item there. On other occasions I have had two even though I actually have 3 items listed. Why is this happening, and how do I get all 3 items in the dictionary?
Thanks!
:P
EDIT: Here is the data for the potions, in case you were wondering.
elif line.split()[0] == "POTION":
_, id, name, hp_bonus, atk_bonus, range_bonus, ac_bonus, str_bonus, con_bonus, dex_bonus, int_bonus, wis_bonus, cha_bonus, wt, val = line.split()
A healing potion looks like this in the file:
POTION 1 potion_of_healing 20 0 0 0 0 0 0 0 0 0 0.1 2
for each_line in file_in:
line = file_in.readline()
each_line already contains the next line, because iterating through a file-like object (say, with a for loop) causes it to go by lines.
On each iteration of the loop, the file pointer is advanced by one line (file-like objects, though rewindable, keep track of their last-accessed position), and then before anything is done it gets advanced once more by the readline(), so the only line that doesn't get skipped entirely is the middle one (4).
To fix this, use the loop variable (each_line) within the loop body directly and nix the file_in.readline().
#noname1014, I know you know this but I want to point out few of the problems with your code (that may occur in some special cases, e.g if you change your file name items.txt to new_items.txt, rare_fruits.txt etc.) and some suggestions.
Do not use \ as path separators in Windows. Use \\ otherwise you may get into problems. \Code\time_items.txt will be evaluated as \Code imeitems.txt because \t is TAB here.
Using \ only works in few cases if \ followed by any character A, p, n, t, ", ' etc. does not construct escape sequences like \n, \t, \f, \r, \b etc.
Have a look at the below example for clarification.
>>> import os
>>>
>>> print(os.getcwd() + '\Code\timeitems.txt')
E:\Users\Rishikesh\Python3\Practice\Code imeitems.txt
>>>
>>> print(os.getcwd() + '\Code\\timeitems.txt')
E:\Users\Rishikesh\Python3\Practice\Code\timeitems.txt
>>>
>>> print(os.getcwd() + '\Code\newitems.txt')
E:\Users\Rishikesh\Python3\Practice\Code
ewitems.txt
>>>
>>> print(os.getcwd() + '\\Code\\newitems.txt')
E:\Users\Rishikesh\Python3\Practice\Code\newitems.txt
>>>
>>> # Do not use it as it may work only in some cases if \ followed by any character does not construct escape sequences.
...
>>> os.getcwd() + '\Code\items.txt'
'E:\\Users\\Rishikesh\\Python3\\Practice\\Code\\items.txt'
>>>
>>> # Use \\ as path separators
...
>>> os.getcwd() + '\\Code\\items.txt'
'E:\\Users\\Rishikesh\\Python3\\Practice\\Code\\items.txt'
>>>
>>> print(os.getcwd() + '\Code\items.txt')
E:\Users\Rishikesh\Python3\Practice\Code\items.txt
>>>
>>> print(os.getcwd() + '\\Code\\items.txt')
E:\Users\Rishikesh\Python3\Practice\Code\items.txt
>>>
If your dictionay is huge and you are facing any issue while looking into its items, pretty it using json module, it has a function called dumps() which is used to pretty print list and dictionary objects.
It is ok to place import statements inside function but placing it on the top is a Pythonic way (https://www.python.org/dev/peps/pep-0008/#imports). It is good for large applications with multiple functions in the same module.
Use with statement for opening files, in this case you do not need to close files.
Your code is fine, I have just modified it as below.
import os
global items
import json
def items_get():
items = {
"weapon":{},
"armour":{},
"potion":{},
"misc":{}
}
# Do not use \ as path separators in Windows. Use \\ (\t, \n, \' have speacial meanings)
file_dir = ( os.getcwd() + '\\Code\\items.txt' )
with open( file_dir, 'r') as file_in:
lines = file_in.readlines();
# ['WEAPON 3 sword_of_eventual_obsolescence 6 10 2 0 10\n', 'WEAPON 4 dagger_of_bluntness 2 5 3 1 0\n', 'WEAPON 5 sword_of_extreme_flimsiness 3 8 3 7 0']
for each_line in lines:
# Use strip() to remove any leading/trailing whitespaces (\n, \t, spaces etc.)
line = each_line.strip().split(' ');
if line[0] == "WEAPON":
weapon_id = line[1]
name = line[2]
attack_min = line[3]
attack_max = line[4]
range = line[5]
weight = line[6]
value = line[7]
weapon_data = {
"name": name.replace('_', ' '),
"atk_min": attack_min,
"atk_max": attack_max,
"rng": range,
"wt": weight,
"val": value,
}
items["weapon"][weapon_id] = {}
items["weapon"][weapon_id].update(weapon_data)
return items
# Calling items_get() to get dictionary
items = items_get();
# Pretty printing dictionary using json.dumps()
print(json.dumps(items, indent=4))
ยป Output
{
"weapon": {
"3": {
"name": "sword of eventual obsolescence",
"atk_min": "6",
"atk_max": "10",
"rng": "2",
"wt": "0",
"val": "10"
},
"4": {
"name": "dagger of bluntness",
"atk_min": "2",
"atk_max": "5",
"rng": "3",
"wt": "1",
"val": "0"
},
"5": {
"name": "sword of extreme flimsiness",
"atk_min": "3",
"atk_max": "8",
"rng": "3",
"wt": "7",
"val": "0"
}
},
"armour": {},
"potion": {},
"misc": {}
}

pandas custom file format parsing

I have data in the following format:
1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah
2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2
I need to convert this into a dataframe with the following columns:
id job grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1 engineer 1 True False False True True blah NaN
2 lawyer 7 False True True True False NaN 2
I could preprocess this data in python and then call pd.DataFrame on this, but I was wondering if there was a better way of doing this?
UPDATE: I ended up doing the following: If there are obvious optimizations, please let me know
with open(vwfile, encoding='latin-1') as f:
data = []
for line in f:
line = [x.strip() for x in line.strip().split('|')]
# line == [
# "1_engineer_grade1",
# "|Boolean IsMale IsNorthAmerican IsFromUSA",
# "|Name blah"
# ]
ident, job, grade = line[0].split("_")
features = line[1:]
bools = {
"IsMale": False,
"IsFemale": False,
"IsNorthAmerican": False,
"IsFromUSA": False,
"IsAlive": False,
}
others = {}
for category in features:
if category.startswith("Bools "):
for feature in category.split(' ')[1:]:
bools[feature] = True
else:
feature = category.split(" ")
# feature == ["Name", "blah"]
others[feature[0]] = feature[1]
featuredict = {
'ident': ident,
'job': job,
'grade': grade,
}
featuredict.update(bools)
featuredict.update(others)
data.append(featuredict)
df = pd.DataFrame(data)
UPDATE-2 A million line file took about 55 seconds to process this.

Categories

Resources