generating duplicate values (fill down?) when parsing XML into Dataframe

generating duplicate values (fill down?) when parsing XML into Dataframe - python

I have a problem parsing XML into a data frame using Python. When I print out the values, some values seem to 'filldown', or repeat themselves. (see column adres). Does anyone one know what could be wrong?
import xml.etree.ElementTree as et
import pandas as pd
import xmltodict
import json
tree = et.parse('20191125_DMG_PI.xml')
root = tree.getroot()
df_cols = ["status", "priref", "full_name", "detail", "adres"]
rows = []
for record in root:
for child in record:
s_priref = child.get('priref')
for field in child.findall('Address'):
s_address = field.find('address').text
#for sub in field.findall('address.country'):
# s_country = sub.find('value').text if s_country is not None else None
for field in child.findall('name'):
s_full_name = field.find('value').text
for field in child.findall('name.status'):
s_status = field.find('value').text
for field in child.findall('level_of_detail'):
s_detail = field.find('value').text
rows.append({"status": s_status,
"priref": s_priref,
"full_name": s_full_name,
"detail": s_detail,
"adres": s_address},)
out_df = pd.DataFrame(rows, columns=df_cols)
print(out_df)

First off, findall() returns an empty list if there is nothing found which matches the search criteria, so in the loop
for field in child.findall("..."):
# this is only performed if child.findall() doesn't return empty
The consequence of this, in this case, is that s_address, s_full_name, s_status, and s_detail do not necessarily get assigned to a new value on each iteration of the outer loop. Hence, they will retain the value from the most recent iteration that the respective child.findall() clause returned non-empty.
The simple way to fix this is to assign them all to some initial value on each iteration of the outer loop, i.e.
for child in record:
s_piref = child.get('piref')
s_address = ''
s_full_name = ''
s_detail = ''
s_status = ''
# ...
Although it might be better (perhaps more 'pythonic') to do something like this:
# Store child.findall() and field.find() keys in a dict
dict = {'Address' : 'address',
'name' : 'value',
'name.status' : 'value',
'level_of_detail' : 'value'}
# To store the reference keys
ref = ["adres", "full_name", "status", "detail", "piref"]
for record in root:
# Initialize a second dict from the same keys mapping to
# empty strings instead
s = dict.fromkeys(dict.keys(), '')
s["piref"] = "piref"
for key in dict:
for field in child.findall(key):
s[key] = field.find(m[key])
rows.append(dict(zip(ref, s.values())),)
Which should work just the same as the other method but makes it easier to add more keys/fields as needed.

Related

Trying to Access keys in Dict from their values

I'm importing a CSV to a dictionary, where there are a number of houses labelled (I.E. 1A, 1B,...)
Rows are labelled containing some item such as 'coffee' and etc. In the table is data indicating how much of each item each house hold needs.
Excel screenshot
What I am trying to do it check the values of the key value pairs in the dictionary for anything that isn't blank (containing either 1 or 2), and then take the key value pair and the 'PRODUCT NUMBER' (from the csv) and append those into a new list.
I want to create a shopping list that will contain what item I need, with what quantity, to which household.
the column containing 'week' is not important for this
I import the CSV into python as a dictionary like this:
import csv
import pprint
from typing import List, Dict
input_file_1 = csv.DictReader(open("DATA CWK SHOPPING DATA WEEK 1 FILE B.xlsb.csv"))
table: List[Dict[str, int]] = [] #list
for row in input_file_1:
string_row: Dict[str, int] = {} #dictionary
for column in row:
string_row[column] = row[column]
table.append(string_row)
I found on 'geeksforgeeks' how to access the pair by its value. however when I try this in my dictionary, it only seems to be able to search for the last row.
# creating a new dictionary
my_dict ={"java":100, "python":112, "c":11}
# list out keys and values separately
key_list = list(my_dict.keys())
val_list = list(my_dict.values())
# print key with val 100
position = val_list.index(100)
print(key_list[position])
I also tried to do a for in range loop, but that didn't seem to work either:
for row in table:
if row["PRODUCT NUMBER"] == '1' and row["Week"] == '1':
for i in range(8):
if string_row.values() != ' ':
print(row[i])
Please, if I am unclear anywhere, please let me know and I will clear it up!!

Here is a loop I made that should do what you want.
values = list(table.values())
keys = list(table.keys())
new_table = {}
index = -1
for i in range(values.count("")):
index = values.index("", index +1)
new_table[keys[index]] = values[index]
If you want to remove those values from the original dict you can just add in
d.pop(keys[index]) into the loop

Extract value from key-value pair of dictionary

I have a CSV file with column name (in first row) and values (rest of the row). I wanted to create variables to store these values for every row in a loop. So I started off by creating a dictionary with the CSV file and I got a list of the records with a key-value pair. So now I wanted to create variables to store the "value" extracted from the "key" of each item and within a loop for every record. I am not sure if I am setting this correctly.
Here is the dictionary I have.
my_dict = [{'value id':'value1', 'name':'name1','info':'info1'},
{'value id':'value2', 'name':'name2','info':'info2'},
{'value id':'value3', 'name':'name3','info':'info3'},
}]
for i in len(my_dict):
item[value id] = value1
item[name] = name1
item[info] = info1
The value id and name will be unique and are identifiers the list. Ultimately, I wanted to create an item object i.e. item[info] = info1 and I can add other codes to modify the item[info].

try this,
my_dict = [{'value':'value1', 'name':'name1','info':'info1'},
{'value':'value2', 'name':'name2','info':'info2'},
{'value':'value3', 'name':'name3','info':'info3'}]
for obj in my_dict:
value = obj['value']
name = obj['name']
info = obj['info']
to expand on #aws_apprentice's point, you can capture the data by creating some additional variables
my_dict = [{'value':'value1', 'name':'name1','info':'info1'},
{'value':'value2', 'name':'name2','info':'info2'},
{'value':'value3', 'name':'name3','info':'info3'}]
values = []
names = []
info = []
for obj in my_dict:
values.append(obj['value'])
names.append(obj['name'])
info.append(obj['info'])

dataframe from dict resulting in empty dataframe

Hi I wrote some code that builds a default dictionary
def makedata(filename):
with open(filename, "r") as file:
for x in features:
previous = []
count = 0
for line in file:
var_name = x
regexp = re.compile(var_name + r'.*?([0-9.-]+)')
match = regexp.search(line)
if match and (match.group(1)) != previous:
previous = match.group(1)
count += 1
if count > wlength:
count = 1
target = str(str(count) + x)
dict.setdefault(target, []).append(match.group(1))
file.seek(0)
df = pd.DataFrame.from_dict(dict)
The dictionary looks good but when I try to convert to dataframe it is empty. I can't figure it out
dict:
{'1meanSignalLenght': ['0.5305184', '0.48961428', '0.47203177', '0.5177274'], '1amplCor': ['0.8780955002105448', '0.8634431017504487', '0.9381169983046714', '0.9407036427333355'], '1metr10.angle1': ['0.6439386643584522', '0.6555194964997434', '0.9512436169922103', '0.23789348400794422'], '1syncVar': ['0.1344131181025432', '0.08194580887223515', '0.15922251165913678', '0.28795644612520327'], '1linVelMagn': ['0.07062673289287498', '0.08792496681784517', '0.12603999663935528', '0.14791253129369603'], '1metr6.velSum': ['0.17850601560734558', '0.15855169971072014', '0.21396496345720045', '0.2739525279330513']}
df:
Empty DataFrame
Columns: []
Index: []
{}

I think part of your issue is that you are using the keyword 'dict', assuming it is a variable
make a dictionary in your function, call it something other than 'dict'. Have your function return that dictionary. Then when you make a dataframe use that return value. Right now, you are creating a data frame from an empty dictionary object.

df = pd.DataFrame(dict)
This should make a dataframe from the dictionary.

You can either pass a list of dicts simply using pd.DataFrame(list_of_dicts) (use pd.DataFrame([dict]) if your variable is not a list) or a dict of list using pd.DataFrame.from_dict(dict). In this last case dict should be something like dict = {a:[1,2,3], "b": ["a", "b", "c"], "c":...}.
see: Pandas Dataframe from dict with empty list value

Searching items of large list in large python dictionary quickly

I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys with the function createDictionaryKeys(). The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... } and is called allow_dict. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict

Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.

Storing data into namedtuples with empty fields to add other stuff

['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
I need to make a namedtuple for each of the lines in this list of lines, basically the fields would be the word in the first line 'Date,Open,High,Low,Close,Volume,Adj Close', I will then be making some calculations and will need to add 2 more fields at the end of each namedtuple. Any help on how I can do this?

from collections import namedtuple
data = ['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
def convert_to_named_tuples(data):
# get the names for the named tuple
field_names = data[0].split(",")
# these are you two extra custom fields
field_names.append("extra1")
field_names.append("extra2")
# field names can't have spaces in them (they have to be valid python identifiers
# and "Adj Close" isn't)
field_names = [field_name.replace(" ", "_") for field_name in field_names]
# you can do this as many times as you like..
# personally I'd do it manually once at the start and just check you're getting
# the field names you expect here...
ShareData = namedtuple("ShareData", field_names)
# unpack the data into the named tuples
share_data_list = []
for row in data[1:]:
fields = row.split(",")
fields += [None, None]
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# check it works..
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
Actually this is better I think since it converts the fields into the right types. On the downside it won't take arbitraty data...
from collections import namedtuple
from datetime import datetime
data = [...same as before...]
field_names = ["Date","Open","High","Low","Close","Volume", "AdjClose", "Extra1", "Extra2"]
ShareData = namedtuple("ShareData", field_names)
def convert_to_named_tuples(data):
share_data_list = []
for row in data[1:]:
row = row.split(",")
fields = (datetime.strptime(row[0], "%Y-%m-%d"), # date
float(row[1]), float(row[2]),
float(row[3]), float(row[4]),
int(row[5]), # volume
float(row[6]), # adj close
None, None) # extras
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# test
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
But I agree with other posts.. why use namedtuple when you can use a class definition..

Any special reason why you want to used namedtuples? If you want to add fields later maybe you should use a dictionary. If you really wan't to go the namedtuple way though, you could use a placeholder like:
from collections import namedtuple
field_names = data[0].replace(" ", "_").lower().split(",")
field_names += ['placeholder_1', 'placeholder_2']
Entry = namedtuple('Entry', field_names)
list_of_named_tuples = []
mock_data = [None, None]
for row in data[1:]:
row_data = row.split(",") + mock_data
list_of_named_tuples.append(Entry(*row_data))
If, instead, you want to parse your data into a list of dictionaries (more pythonic IMO) you should do:
field_names = data[0].split(",")
list_of_dicts = [dict(zip(field_names, row.split(','))) for row in data[1:]]
EDIT: Note that even though you may use dictionaries instead of namedtuples for the small dataset from your example, doing so with large amounts of data will translate into a higher memory footprint for your program.

why don't you use a dictionary for the data, adding additional keys is then easy
dataList = []
keys = myData[0].split(',')
for row in myData:
tempdict = dict()
for index, value in enumerate(row.split(',')):
tempdict[keys[index]] = value
# if your additional values are going to be determined here then
# you can do whatever calculations you need and add them
# otherwise you do work with this list elsewhere
dataList.append(tempdict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

generating duplicate values (fill down?) when parsing XML into Dataframe - python

Related

Trying to Access keys in Dict from their values

Extract value from key-value pair of dictionary

dataframe from dict resulting in empty dataframe

Searching items of large list in large python dictionary quickly

Storing data into namedtuples with empty fields to add other stuff

Categories

Resources