How to get information with python when data is heavily nested - python

I have a text file which contains some data to be mined.
The structure is shown below
name (personA {
field1 : data1
field2 : data2
fieldN : dataN
subfield() {
fieldx1 : datax1
fieldxN : dataxN
}
}
name (personB {
field1 : data11
field2 : data12
fieldN : data1N
}
In some person's record the subfield is absent and output should specify subfield to be unknown in that case. Now below is the code I use to extract the data
import re
data = dict()
with open('data.txt', 'r') as fin:
FLAG, FLAGP, FLAGS = False, False, False
for line in fin:
if FLAG:
if re.search('field1', line):
d1 = line.split()[2]
data['field1'] = d1
if re.search('fieldN', line):
dN = line.split()[2]
data['fieldN'] = dN
data['fieldxn'] = 'unknown'
FLAGP = True
if FLAGS:
if re.search('fieldxN', line):
dsN = line.split()[2]
data['fieldxn'] = dsN
if re.search('name\ \(', line):
pn = line.split()[1]
FLAG = True
data['name'] = pn
if re.search('subfield', line):
FLAGS = True
if len(data) == 4:
if FLAGP:
print data
FLAGP = False
FLAG = False
FLAGS = False
The output is shown below
{'field1': 'data1', 'fieldN': 'dataN', 'name': '(personA', 'fieldxn': 'unknown'}
{'field1': 'data11', 'fieldN': 'data1N', 'name': '(personB', 'fieldxn': 'unknown'}
The problem has been that I don't know where to print data so current I am using below statment to print data which is wrong
if len(data) == 4:
if FLAGP:
print data
FLAGP = False
FLAG = False
FLAGS = False
I would appreciate if someone could give any suggestion to retrieve the data correctly

I would take a different approach to parsing, storing the subfields (and other fields) in a dictionary.
data = open('data.txt', 'rt').read()
### Given a string containing lines of "fieldX : valueY"
### return a dictionary of values
def getFields(field_data):
fields = {}
if (field_data != None):
field_lines = field_data.strip().split("\n")
for pair in field_lines:
name, value = pair.split(":")
fields[name.strip()] = value.strip()
return fields
### Split the data by name
people_data = data.strip().split("name (")[1:]
### Loop though every person record
for person_data in people_data:
name, person_data = person_data.split(" {", 1) # split the name and the fields
# Split out the subfield data, if any
subfield_data = None
if (person_data.find("subfield()") > -1):
field_data, subfield_data = person_data.split("subfield() {", 1)
subfield_data = subfield_data.split("}")[0]
# Separate the fields into single lines of pairs
fields = getFields(field_data)
# and any subfields
subfields = getFields(subfield_data)
print("Person: "+str(name))
print("Fields: "+str(fields))
print("Sub_Fields:"+str(subfields))
Which gives me:
Person: personA
Fields: {'field1': 'data1', 'field2': 'data2', 'fieldN': 'dataN'}
Sub_Fields:{'fieldx1': 'datax1', 'fieldxN': 'dataxN'}
Person: personB
Fields: {'field1': 'data1', 'field2': 'data2', 'fieldN': 'dataN'}
Sub_Fields:{}
So you could just adjust your output based on whether subfields was None, or otherwise. The idea is to get your data input into more flexible structures, rather than "brute-force" parsing like you have done. In the above I use split() a lot to give a more flexible way through, rather than relying on finding exact names. Obviously it depends on your design requirements too.

Related

Get the value of a specific record in dicts with lists in python

I have a dict like this:
contactos = dict([
"id", id,
"nombres", nombres,
"apellidos", apellidos,
"telefonos", telefonos,
"correos", correos
])
And it works when I put a new register in every key:value, my problem is, how can I get the record for only one contact?
I have a part where I can input a number and search the position in the list of the dict, then I want to only show the record of that specific record in every key:value
I made this code, but it doesn´t work.
telefo = input(Fore.LIGHTGREEN_EX + "TELEFONO CONTACTO: " + Fore.RESET)
for x in range(len(telefonos)):
if(telefonos[x] == telefo):
print(contactos["telefonos"][x])
else:
print("No encontrado")
I print only the telefono value, ´cause it´s my test code.
This should be your working script:
# I imagine your data to be somethig like this. If it isn't, sorry:
id = 0
nombres = ['John', 'Anna', 'Robert']
apellidos = ['J.', 'A.', 'Rob.']
telefonos = ['333-444', '222-111', '555-888']
correos = ['john#email.com', 'anna#email.com', 'rob#email.com']
# This is the part where you made it wrong.
# Dictionaries are created with {}
#
# [] creates a list, not a dictionary structure.
#
# Also, key and values must be grouped as:
# "key": value
contactos = dict({
"id": id,
"nombres": nombres,
"apellidos": apellidos,
"telefonos": telefonos,
"correos": correos
})
# Now, imagine this this is the input from user:
telefo = "333-444"
for x in range(len(telefonos)):
if (telefonos[x] == telefo):
print(contactos["telefonos"][x])
break
else:
print("No encontrado")
When testing the script, the output is 333-444.

How to split a text file into a nested array?

Working on a project creating a python flask website that stores user logins into a text file. I have a text file where each line is one user and each user has 5 parameters stored on the line. All user parameters are separated by a ; character.
Parameters are:
username
password
first name
last name
background color
title
avatar
Sample of the text file:
joebob;pass1;joe;bob;yellow;My title!!;https://upload.wikimedia.org/wikipedia/commons/c/cd/Stick_Figure.jpg
richlong;pass2;rich;long;blue;My title2!!;https://www.iconspng.com/images/stick-figure-walking/stick-figure-walking.jpg
How do I go about storing the parameters into a python array, and how do I access them later when I need to reference log-ins.
Here is what I wrote so far:
accounts = { }
def readAccounts():
file = open("assignment11-account-info.txt", "r")
for accounts in file: #line
tmp = accounts.split(';')
for data in tmp: #data in line
accounts[data[0]] = {
'user': data[0],
'pass': data[1],
'first': data[2],
'last': data[3],
'color': data[4],
'title': data[5],
'avatar': data[6].rstrip()
}
file.close()
You can use the python builtin csv to parse
import csv
with open("assignment11-account-info.txt", "r") as file:
reader = csv.reader(file, delimiter=';')
result = []
for row in reader:
fields = ('user', 'passwd', 'first', 'last', 'color','title','avatar')
res = dict(zip(fields, row))
result.append(res)
Or equivalent but harder to read for a beginner the pythonic list comprehension:
with open("assignment11-account-info.txt", "r") as file:
reader = csv.reader(file, delimiter=';')
fields = ('user', 'passwd', 'first', 'last', 'color','title','avatar')
result = [ dict(zip(fields, row)) for row in reader ]
Here's what I might do:
accounts = {}
with open("assignment11-account-info.txt", "r") as file:
for line in file:
fields = line.rstrip().split(";")
user = fields[0]
pass = fields[1]
first = fields[2]
last = fields[3]
color = fields[4]
title = fields[5]
avatar = fields[6]
accounts[user] = {
"user" : user,
"pass" : pass,
"first" : first,
"last" : last,
"color" : color,
"title" : title,
"avatar" : avatar
}
By using with, the file handle file is closed for you automatically. This is the most "Python"-ic way of doing things.
So long as user is unique, you won't overwrite any entries you put in as you read through the file assignment11-account-info.txt.
If you need to deal with a case where user is repeated in the file assignment11-account-info.txt, then you need to use an array or list ([...]) as opposed to a dictionary ({...}). This is because reusing the value of user will overwrite any previous user entry you add to accounts. Overwriting existing entries is almost always a bad thing when using dictionaries!
If that is the case, I might do the following:
accounts = {}
with open("assignment11-account-info.txt", "r") as file:
for line in file:
fields = line.rstrip().split(";")
user = fields[0]
pass = fields[1]
first = fields[2]
last = fields[3]
color = fields[4]
title = fields[5]
avatar = fields[6]
if user not in accounts:
accounts[user] = []
accounts[user].append({
"user" : user,
"pass" : pass,
"first" : first,
"last" : last,
"color" : color,
"title" : title,
"avatar" : avatar
})
In this way, you preserve any cases where user is duplicated.

extracting name, email and number and save it into a variable

I want to extract the name, email and phone number of all the conversations and then save them into different variables. I want to save it like this: a=max, b=email and so on.
This is my text file:
[11:23] max : Name : max
Email : max#gmail.com
Phone : 01716345678
[11:24] harvey : hello there how can i help you
[11:24] max : can you tell me about the latest feature
and this is my code. What am I missing here?
in_file = open("chat.txt", "rt")
contents = in_file.read()
#line: str
for line in in_file:
if line.split('Name :'):
a=line
print(line)
elif line.split('Email :'):
b = line
elif line.split('Phone :'):
c = line
else:
d = line
That's not what split does, at all. You might be getting it confused with in.
In any case, a regular expression will do:
import re
string = '''[11:23] max : Name : max
Email : max#gmail.com
Phone : 01716345678
[11:24] harvey : hello there how can i help you
[11:24] max : can you tell me about the latest feature'''
keys = ['Name', 'Email', 'Phone', 'Text']
result = re.search('.+Name : (\w+).+Email : ([\w#\.]+).+Phone : (\d+)(.+)', string, flags=re.DOTALL).groups()
{key: data for key, data in zip(keys, result)}
Output:
{'Name': 'max',
'Email': 'max#gmail.com',
'Phone': '01716345678',
'Text': '\n\n[11:24] harvey : hello there how can i help you\n[11:24] max : can you tell me about the latest feature'}
Remove this line in your code:
"contents = in_file.read()"
Also, use "in" instead of "split":
in_file = open("chat.txt", "rt")
for line in in_file:
if ('Name') in line:
a=line
print(a)
elif 'Email' in line:
b = line
print(b)
elif 'Phone' in line:
c = line
print(c)
else:
d = line
print(d)

Consolidating row data from DB into a list of dicts

I'm reading data from a SELECT statement of SQLite. Date comes in the following form:
ID|Phone|Email|Status|Role
Multiple rows may be returned for the same ID, Phone, or Email. And for a given row, either Phone or Email can be empty/NULL. However, for the same ID, it's always the same value for Status and the same for Role. for example:
1|1234567892|a#email.com| active |typeA
2|3434567893|b#email.com| active |typeB
2|3434567893|c#email.com| active |typeB
3|5664567891|d#email.com|inactive|typeC
3|7942367891|d#email.com|inactive|typeC
4|5342234233| NULL | active |typeD
5| NULL |e#email.com| active |typeD
These data are returned as a list by Sqlite3, let's call it results. I need to go through them and reorganize the data to construct another list structure in Python. The final list basically consolidates the data for each ID, such that:
Each item of the final list is a dict, one for each unique ID in results. In other words, multiple rows for the same ID will be merged.
Each dict contains these keys: 'id', 'phones', 'emails', 'types', 'role', 'status'.
'phones' and 'emails' are lists, and contains zero or more items, but no duplicates.
'types' is also a list, and contains either 'phone' or 'email' or both, but no duplicates.
The order of dicts in the final list does not matter.
So far I have come up this:
processed = {}
for r in results:
if r['ID'] in processed:
p_data = processed[r['ID']]
if r['Phone']:
p_data['phones'].add(r['Phone'])
p_data['types'].add('phone')
if r['Email']:
p_data['emails'].add(r['Email'])
p_data['types'].add('email')
else:
p_data = {'id': r['ID'], 'status': r['Status'], 'role': r['Role']}
if r['Phone']:
p_data['phones'] = set([r['Phone']])
p_data.setdefault('types', set).add('phone')
if r['Email']:
p_data['emails'] = set([r['Email']])
p_data.setdefault('types', set).add('email')
processed[r['ID']] = p_data
consolidated = list(processed.values())
I wonder if there is a faster and/or more concise way to do this.
EDIT:
A final detail: I would prefer to have 'phones', 'emails', and 'types' in each dict as list instead of set. The reason is that I need to dump consolidated into JSON, and JSON does not allow set.
When faced with something like this I usually use:
processed = collections.defaultdict(lambda:{'phone':set(),'email':set(),'status':None,'type':set()})
and then something like:
for r in results:
for field in ['Phone','Email']:
if r[field]:
processed[r['ID']][field.lower()].add(r[field])
processed[r['ID']]['type'].add(field.lower())
Finally, you can dump it into a dictionary or a list:
a_list = processed.items()
a_dict = dict(a_list)
Regarding the JSON problem with sets, you can either convert the sets to lists right before serializing or write a custom encoder (very useful!). Here is an example of one I have for dates extended to handle sets:
class JSONDateTimeEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return int(time.mktime(obj.timetuple()))
elif isinstance(ojb, set):
return list(obj)
try:
return json.JSONEncoder.default(self, obj)
except:
return str(obj)
and to use it:
json.dumps(a_list,sort_keys=True, indent=2, cls =JSONDateTimeEncoder)
I assume results is a 2d list:
print results
#[['1', '1234567892', 'a#email.com', ' active ', 'typeA'],
#['2', '3434567893', 'b#email.com', ' active ', 'typeB'],
#['2', '3434567893', 'c#email.com', ' active ', 'typeB'],
#['3', '5664567891', 'd#email.com', 'inactive', 'typeC'],
#['3', '7942367891', 'd#email.com', 'inactive', 'typeC'],
#['4', '5342234233', ' NULL ', ' active ', 'typeD'],
#['5', ' NULL ', 'e#email.com', ' active ', 'typeD']]
Now we group this list by id:
from itertools import groupby
data_grouped = [ (k,list(v)) for k,v in groupby( sorted(results, key=lambda x:x[0]) , lambda x : x[0] )]
# make list of column names (should correspond to results). These will be dict keys
names = [ 'id', 'phone','email', 'status', 'roll' ]
ID_info = { g[0]: {names[i]: list(list( map( set, zip(*g[1] )))[i]) for i in range( len(names))} for g in data_grouped }
Now for the types:
for k in ID_info:
email = [ i for i in ID_info[k]['email'] if i.strip() != 'NULL' and i != '']
phone = [ i for i in ID_info[k]['phone'] if i.strip() != 'NULL' and i != '']
if email and phone:
ID_info[k]['types'] = [ 'phone', 'email' ]
elif email and not phone:
ID_info[k]['types'] = ['email']
elif phone and not email:
ID_info[k]['types'] = ['phone']
else:
ID_info[k]['types'] = []
# project
ID_info[k]['id'] = ID_info[k]['id'][0]
ID_info[k]['roll'] = ID_info[k]['roll'][0]
ID_info[k]['status'] = ID_info[k]['status'][0]
And what you asked for (a list of dicts) is returned by ID_info.values()

How do I look get an associated value in a json variable using python?

How do I look up the 'id' associated with the a person's 'name' when the 2 are in a dictionary?
user = 'PersonA'
id = ? #How do I retrieve the 'id' from the user_stream json variable?
json, stored in a variable named "user_stream"
[
{
'name': 'PersonA',
'id': '135963'
},
{
'name': 'PersonB',
'id': '152265'
},
]
You'll have to decode the JSON structure and loop through all the dictionaries until you find a match:
for person in json.loads(user_stream):
if person['name'] == user:
id = person['id']
break
else:
# The else branch is only ever reached if no match was found
raise ValueError('No such person')
If you need to make multiple lookups, you probably want to transform this structure to a dict to ease lookups:
name_to_id = {p['name']: p['id'] for p in json.loads(user_stream)}
then look up the id directly:
id = name_to_id.get(name) # if name is not found, id will be None
The above example assumes that names are unique, if they are not, use:
from collections import defaultdict
name_to_id = defaultdict(list)
for person in json.loads(user_stream):
name_to_id[person['name']).append(person['id'])
# lookup
ids = name_to_id.get(name, []) # list of ids, defaults to empty
This is as always a trade-off, you trade memory for speed.
Martijn Pieters's solution is correct, but if you intend to make many such look-ups it's better to load the json and iterate over it just once, and not for every look-up.
name_id = {}
for person in json.loads(user_stream):
name = person['name']
id = person['id']
name_id[name] = id
user = 'PersonA'
print name_id[user]
persons = json.loads(...)
results = filter(lambda p:p['name'] == 'avi',persons)
if results:
id = results[0]["id"]
results can be more than 1 of course..

Categories

Resources