How to deal with columns in pandas dataframe?

How to deal with columns in pandas dataframe? - python

I want to do something with column data which is a list. like:
inputs:
col-A
[{'name':'1','age':'12'}, {'name':'2','age':'12'}]
[{'name':'3','age':'18'}, {'name':'7','age':'15'}]
....
outputs:
col-A
[{'1-age':'12'}, {'2-age':'12'}]
[{'3-age':'18'}, {'7-age':'15'}]
....
My code is:
def deal(dict_col, prefix_key):
key_value = dict_col[prefix_key]+'-'
dict_col.pop(prefix_key, None)
items = copy.deepcopy(dict_col)
for key, value in items.items():
dict_col[key_value+key] = dict_col.pop(key)
return dict_col
prefix = "name"
[[deal(sub_item, prefix) for sub_item in item] for item in df[col-A]]
Some items will be processed multiple times.
Because the return value of deal method will be swapped to item in real time?
For example:
For deal method we
input:
{'name':'1','age':'12'}
output:
{'1-age':'12'}
Then the next input may be {'1-age':'12'} , and now we have no name or age to deal with.
How to solve this problem?

You can use the pandas apply method for it here some code:
import pandas as pd
d = {'col-A' : [[{'name' : '1', 'age': '12'}, {'name' : '2', 'age': '12'}],[{'name' : '3', 'age': '18'},{'name' : '7', 'age': '15'}]]}
df = pd.DataFrame(d)
def deal(row, prefix):
out_list = []
for sub_dict in row:
out_dict = {}
out_str = sub_dict.get(prefix) + '-'
for k,v in sub_dict.items():
out_dict[out_str + k] = v
out_list.append(out_dict)
return out_list
prefix = 'name'
df['col-A'] = df['col-A'].apply(lambda x : deal(x, prefix))
print(df)
You could push some of the code in a one-liner if you like that more:
def deal(row, prefix):
out_list = []
for sub_dict in row:
out_dict = dict((sub_dict[prefix] + '-' + k , sub_dict[k]) for k in sub_dict.keys() if k != prefix)
out_list.append(out_dict)
return out_list
prefix = 'name'
df['col-A'] = df['col-A'].apply(lambda x : deal(x, prefix)
Just for the fun of it you could even bring it down to one single line (not recommended due to poor readability:
prefix = "name"
df['col-A'] = df['col-A'].apply(lambda row : [dict((sub_dict[prefix] + '-' + k , sub_dict[k]) for k in sub_dict.keys() if k != prefix) for sub_dict in row])

I believe you need .get function for select with default value if not exist key in dict:
def deal(dict_col, prefix_key):
key_value = dict_col.get(prefix_key, 'not_exist')+'-'
dict_col.pop(prefix_key, None)
items = copy.deepcopy(dict_col)
for key, value in items.items():
dict_col[key_value+key] = dict_col.pop(key)
return dict_col

Related

Create an organized DF from a List of mixed type items (Python)

I have a list of items in a 'variable:value' format, but the same 'variable' can appear multiple times. The only thing I know is that all values that follow the 'ID' category belong to the same 'ID', so I know how many rows I need (3 in this example).
I need to create a dataframe from this list. The problem I am encountering is that I cannot add a string value to my DF ('could not convert str to float'). I am not sure how to proceed.
mylist = ['ID:1', 'Date: Oct 2', 'B:88', 'C:noun', 'D:44', 'ID:2', 'B:55', 'C:noun', 'D:45', 'ID:3',
'Date:Sept 5', 'B:55', 'C:verb']
categories = []
for i in mylist:
var = i.split(":")
categories.append(var[0])
variables = list(set(categories))
df = np.empty((3,len(variables)))
df = pd.DataFrame(df)
counter = -1
for i in mylist:
item = i.split(":")
category = item[0]
value = item[1]
tracker = -1
for j in variables:
tracker = tracker + 1
if j == category:
float(value)
df[counter, tracker] = value
if category == "ID":
counter = counter + 1
float(value)
df[counter, 0] = value
In addition, I've tried converting the items in the list to dictionary, but I am not sure if that's the best way to achieve my goal:
df = np.empty((3,len(variables)))
df = pd.DataFrame(df, columns = variables)
mydict = {}
counter = -1
for i in mylist:
item = i.split(":")
category = item[0]
value = item[1]
mydict = {category:value}
if category == "ID":
counter = counter + 1
df[counter] = pd.DataFrame.from_dict(mydict)
else:
df[counter] = pd.DataFrame.from_dict(mydict)
Edit:
I solved it. Code below:
df = np.empty((0,len(variables)))
df = pd.DataFrame(df, columns = variables)
mydict = {}
counter = 0
for i in mylist:
item = i.split(":")
category = item[0]
value = item[1]
mynewdef = {category:value}
counter = counter + 1
if counter == len(mylist):
df = df.append(mydict, ignore_index = True)
df = df.iloc[1:]
elif category == 'ID':
df = df.append(mydict, ignore_index = True)
mydict = {}
mydict.update(mynewdef)
else:
mydict.update(mynewdef)

Perhaps this works
df = pd.DataFrame([e.split(':') for e in my_list],
columns=['key', 'value'])
df = df.pivot(columns='key', values='value') #not tested

How to catagorize a list with same prefix/suffix?

I have a list of words as below:
Data = ['pre_bbc', 'pre_nbc', 'pre_fox', 'bread_post', 'pre_news', 'lucky_post',
'banana_post', 'mike', 'john', 'edward_lear', 'winelistpdf', 'cookbookspdf']
Assuming I have no idea of what the prefix or suffix is beforehand, and '_' is not always the case to split suffix/prefix, is there a way using Python to catagorize this list into groups? Let's say the result I want is as below:
List0 = ['pre_bbc', 'pre_nbc', 'pre_fox', 'pre_news']
List1 = ['bread_post', 'lucky_post', 'banana_post']
List2 = ['winelistpdf', 'cookbookspdf']
Orphan_list =['mike', 'john', 'edward_lear']
There could be some tricky cases in which a word contains both suffix and prefix, like 'pre_voa_post', I think this can be put into both lists. Also, let's assume all the elements are unique in this list.
Thanks!

This was a pretty challenging one! There are a few conditions to consider here if this needs to be fairly universal.
Minimum length for an affix
Delimiters that denote affixes
Multiple affixes
import json
def get_affix_groups(words, min=3, delimiter="_"):
"""Get groups from a word list that have matching affixes."""
groups = {}
for word in words:
for item in [w for w in words if w != word]:
for n in range(len(word) - min):
try:
prefix, *_, suffix = word.split(delimiter)
except ValueError:
prefix = word[:n + min]
suffix = word[-(n + min):]
if item.startswith(prefix):
prefix_group = groups.setdefault(prefix, {word})
groups[prefix].add(item)
if item.endswith(suffix):
suffix_group = groups.setdefault(suffix, {word})
groups[suffix].add(item)
all_words = [i for w in groups.values() for i in w]
groups["orphans"] = {word for word in words if word not in all_words}
return groups
data = [
"pre_bbc",
"pre_nbc",
"pre_fox",
"bread_post",
"pre_news",
"lucky_post",
"banana_post",
"mike",
"john",
"edward_lear",
"winelistpdf",
"cookbookspdf",
"pre_voa_post"
]
# Print the resulting dict in a human-readable format
print(json.dumps(get_affix_groups(data), default=list, indent=2))
Output
{
"pre": [
"pre_fox",
"pre_voa_post",
"pre_bbc",
"pre_news",
"pre_nbc"
],
"post": [
"lucky_post",
"pre_voa_post",
"bread_post",
"banana_post"
],
"pdf": [
"cookbookspdf",
"winelistpdf"
],
"orphans": [
"john",
"edward_lear",
"mike"
]
}
If you really need these to be variables, you can use exec(), but it's considered bad practice.
for affix, group in get_affix_groups(data).items():
exec(f"{affix} = {group}")

Tested with :
Data = ['pre_voa_post', 'argument', 'thermodynamic', 'winelistpdf',
'pre_bbc', 'anteroom', 'pre_nbc', 'thermostat', 'pre_fox',
'antedate', 'blabla', 'enchantment', 'pre_news', 'lucky_post',
'banana_post', 'mike', 'john', 'thermometer', 'toto', 'antenatal' ]
Function
def test(Data):
suffixes = Data.copy()
prefixes = Data.copy()
my_suffixes = {}
my_prefixes = {}
Orphan_list = []
Orphan_s = []
Orphan_p = []
while len(prefixes) > 1:
first_p = prefixes.pop(0)
prefix = ''
for elt_pref in prefixes:
i = min(len(first_p), len(elt_pref))
while i > 1:
if first_p[0:i] == elt_pref[0:i]:
prefix = first_p[0:i]
my_prefixes[prefix] = [first_p, elt_pref, ]
prefixes.remove(elt_pref)
var = 0
while var < len(prefixes):
sec_elt = prefixes[var]
if sec_elt.startswith(prefix):
my_prefixes[prefix].append(sec_elt)
prefixes.remove(sec_elt)
else:
var += 1
break
else:
i -= 1
if prefix == '':
Orphan_p.append(first_p)
if prefixes:
Orphan_p.append(prefixes[0])
while len(suffixes) > 1:
first_s = suffixes.pop(0)
suffix = ''
for elt_suf in suffixes:
j = min(len(first_s), len(elt_suf))
while j > 2:
if first_s[-j:] == elt_suf[-j:]:
suffix = first_s[-j:]
my_suffixes[suffix] = [first_s, elt_suf, ]
suffixes.remove(elt_suf)
var = 0
while var < len(suffixes):
elt_suf3 = suffixes[var]
if elt_suf3.endswith(suffix):
my_suffixes[suffix].append(elt_suf3)
suffixes.remove(elt_suf3)
else:
var += 1
break
else:
j -= 1
if suffix == '':
Orphan_s.append(first_s)
if suffixes:
Orphan_s.append(suffixes[0])
Orphan_list = list(set(Orphan_p) & set(Orphan_s))
print("my_suffixes", my_suffixes)
print("my_prefixes", my_prefixes)
print("Orphan_list", Orphan_list)
Result:
my_suffixes {'_post': ['pre_voa_post', 'bread_post', 'lucky_post', 'banana_post'],
'ment': ['argument', 'enchantment'],
'pdf': ['winelistpdf', 'cookbookspdf']}
my_prefixes {'pre_': ['pre_voa_post', 'pre_bbc', 'pre_nbc', 'pre_fox', 'pre_news'],
'thermo': ['thermodynamic', 'thermostat', 'thermometer'],
'ante': ['anteroom', 'antedate', 'antenatal']}
Orphan_list ['toto', 'mike', 'john', 'blabla', 'edward_lear']

this should not be a valid question, but:
def partition(list_of_pref, list_of_words):
ret_list = []
for l in list_of_pref:
this_list = []
ret_list.append(this_list)
for word in list_of_words:
if word.startswith(l):
this_list.append(word)
partition(['pre', 'banana'],['pre_bbc', 'pre_nbc', 'pre_fox', 'bread_post', 'pre_news', 'lucky_post', 'banana_post', 'mike', 'john', 'edward_lear', 'winelistpdf', 'cookbookspdf'])
Out[4]: [['pre_bbc', 'pre_nbc', 'pre_fox', 'pre_news'], ['banana_post']]
return ret_list
do the same with list of pref, or generate them by iterating on split('_') inside your function and you are done

How to get the last value from the list using python?

data = ['reply': '{"osc":{"version":"1.0"}}']
data1 = ['reply':'{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}']
I need to get only the "1.0" value from data and data1 using python 3.6.
How can I achieve this?

After fixing the data:
data1 = {'reply':{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}}
keep taking the value from the dictionary as long as it remains a dictionary:
d = data1
while isinstance(d, dict):
d = list(d.values())[0]
print(d)
#1.0

Your data and data1 are invalid datatypes in python so I converted them into the valid dictionary.
from operator import getitem
data = {'reply': {"osc":{"version":"1.0"}}}
data1 = {'reply':{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}}
def get_item(keys, dict_):
return reduce(getitem, keys, dict_)
print(get_item(['reply','osc','version'], data))
print(get_item(['reply','device','network', 'ipv4_dante',"auto"],data1))
>>>1.0
1.0
Another Approach that data and data1 as string:
class GetValue:
def __init__(self, string):
self.string = string
self.new_keys = {}
def clean_data(self, data):
if data[2] == '{':
get_json_data = len(data) - 3
else:
get_json_data = len(data) - 1
modified_data = [val for val in list(data[data.find(':')+1:get_json_data])
if val is not "'" ]
return json.loads(''.join(modified_data))
def get_recurvise_key(self, data, dict_):
for key, val in dict_.items():
self.new_keys.setdefault(data,[]).append(key)
if isinstance(val, dict):
self.get_recurvise_key(data, val)
return self.new_keys.get(data)
def get_value(self):
get_data = self.clean_data(self.string)
get_keys = self.get_recurvise_key(self.string,get_data)
value = reduce(getitem, get_keys, get_data)
return value
data = """['{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}']"""
data1 = """['reply':'{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}']"""
obj_data = GetValue(data)
obj_data1 = GetValue(data1)
print(obj_data.get_value(), obj_data1.get_value())
>>> 1.0 1.0

You can get that result recursively like:
Code:
def bottom_value(in_data):
def recurse_to_bottom(a_dict):
if isinstance(a_dict, dict):
a_key = list(a_dict)[0]
return recurse_to_bottom(a_dict[a_key])
return a_dict
return recurse_to_bottom(json.loads(in_data['reply']))
Test Code:
import json
data = {'reply': '{"osc":{"version":"1.0"}}'}
data1 = {'reply':'{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}'}
print(bottom_value(data))
print(bottom_value(data1))
Result:
1.0
1.0

You can try regex :
import re
pattern=r'(?<=")[0-9.]+'
data1="""['reply': '{"osc":{"version":"1.0"}}']"""
data2="""['reply':'{"device":{"network":{"ipv4_dante":{"auto":"1.0"}}}}']"""
def find_value(data):
return re.findall(pattern,data)[0]
output:
print(find_value(data1))
output:
1.0
second:
print(find_value(data2))
output
1.0

Python multilevel dict to strings

I have a python dictionary and a dictionary with in some of the values. I'm trying to generate a dotted delimited string of the keys in the structure with the value at the end. With the example below I'd want FIELD0 1 and NAME. I could create a for loop to process the data or a recursive function. I didn't know if there was something prebuilt method for collapsing a multilevel dictionary to delimited strings?
I was trying the following but as you know it will just append the sub dictionaries.
'.'.join('%s %s\n' % i for i in a.items())
{'BOGUS1': 'BOGUS_VAL1',
'BOGUS2': 'BOGUS_VAL1',
'FIELD0': {'F0_VAL1': 1, 'F0_VAL2': 2},
'FIELD1': {'F1_VAL1': 80, 'F1_VAL2': 67, 'F1_VAL3': 100},
'FOOBAR1': 'FB_VAL1',
'NAME': 'VALUE'}
BOGUS2.BOGUS_VAL1
.NAME.VALUE
.BOGUS1.BOGUS_VAL1
.FIELD0.{'F0_VAL1': 1, 'F0_VAL2': 2}
.FIELD1.{'F1_VAL2': 67, 'F1_VAL3': 100, 'F1_VAL1': 80}
.FOOBAR1.FB_VAL1
# Wanted results
FIELD0.F0_VAL1 1
FIELD0.F0_VAL2 2
FIELD1.F1_VAL1 80
FIELD1.F2_VAL1 67
FIELD1.F3_VAL1 100
NAME VALUE

How about something like this:
def dotnotation(d, prefix = ''):
for k, v in d.items():
if type(v) == type(dict()):
dotnotation(v, prefix + str(k) + '.')
else:
print prefix + str(k) + ' = ' + str(v)
Also the formatting can be changed according to the stored types. This should work with your example.

Here is my approach:
def dotted_keys(dic):
""" Generated dot notation keys from a dictionary """
queue = [(None, dic)] # A queue of (prefix, object)
while queue:
prefix, current = queue.pop(0)
for k, v in current.iteritems():
if isinstance(v, dict):
queue.append((k, v))
elif prefix:
yield prefix + '.' + k
else:
yield k
def dict_search(dic, dotted_key, default=None):
""" Take a dictionary and a dotted key and return the value. If not
found, return the value specified by the default parameter.
Example: dict_search(d, 'FIELD0.F0_VAL2')
"""
current = dic
keys = dotted_key.split('.')
for k in keys:
if k in current:
current = current[k]
else:
return default
return current
if __name__ == '__main__':
d = {
'BOGUS1': 'BOGUS_VAL1',
'BOGUS2': 'BOGUS_VAL1',
'FIELD0': {'F0_VAL1': 1, 'F0_VAL2': 2, 'XYZ': {'X1': 9}},
'FIELD1': {'F1_VAL1': 80, 'F1_VAL2': 67, 'F1_VAL3': 100},
'FOOBAR1': 'FB_VAL1',
'NAME': 'VALUE'
}
for k in dotted_keys(d):
print(k, '=', dict_search(d, k))
Output:
BOGUS2 = BOGUS_VAL1
NAME = VALUE
BOGUS1 = BOGUS_VAL1
FOOBAR1 = FB_VAL1
FIELD0.F0_VAL1 = 1
FIELD0.F0_VAL2 = 2
FIELD1.F1_VAL2 = 67
FIELD1.F1_VAL3 = 100
FIELD1.F1_VAL1 = 80
XYZ.X1 = None
The dotted_keys function generates a list of keys in dotted notation while the dict_search function takes a dotted key and return a value.

best way to parse a line in python to a dictionary

I have a file with lines like
account = "TEST1" Qty=100 price = 20.11 subject="some value" values="3=this, 4=that"
There is no special delimiter and each key has a value that is surrounded by double quotes if its a string but not if it is a number. There is no key without a value though there may exist blank strings which are represented as "" and there is no escape character for a quote as it is not needed
I want to know what is a good way to parse this kind of line with python and store the values as key-value pairs in a dictionary

We're going to need a regex for this.
import re, decimal
r= re.compile('([^ =]+) *= *("[^"]*"|[^ ]*)')
d= {}
for k, v in r.findall(line):
if v[:1]=='"':
d[k]= v[1:-1]
else:
d[k]= decimal.Decimal(v)
>>> d
{'account': 'TEST1', 'subject': 'some value', 'values': '3=this, 4=that', 'price': Decimal('20.11'), 'Qty': Decimal('100.0')}
You can use float instead of decimal if you prefer, but it's probably a bad idea if money is involved.

Maybe a bit simpler to follow is the pyparsing rendition:
from pyparsing import *
# define basic elements - use re's for numerics, faster than easier than
# composing from pyparsing objects
integer = Regex(r'[+-]?\d+')
real = Regex(r'[+-]?\d+\.\d*')
ident = Word(alphanums)
value = real | integer | quotedString.setParseAction(removeQuotes)
# define a key-value pair, and a configline as one or more of these
# wrap configline in a Dict so that results are accessible by given keys
kvpair = Group(ident + Suppress('=') + value)
configline = Dict(OneOrMore(kvpair))
src = 'account = "TEST1" Qty=100 price = 20.11 subject="some value" ' \
'values="3=this, 4=that"'
configitems = configline.parseString(src)
Now you can access your pieces using the returned configitems ParseResults object:
>>> print configitems.asList()
[['account', 'TEST1'], ['Qty', '100'], ['price', '20.11'],
['subject', 'some value'], ['values', '3=this, 4=that']]
>>> print configitems.asDict()
{'account': 'TEST1', 'Qty': '100', 'values': '3=this, 4=that',
'price': '20.11', 'subject': 'some value'}
>>> print configitems.dump()
[['account', 'TEST1'], ['Qty', '100'], ['price', '20.11'],
['subject', 'some value'], ['values', '3=this, 4=that']]
- Qty: 100
- account: TEST1
- price: 20.11
- subject: some value
- values: 3=this, 4=that
>>> print configitems.keys()
['account', 'subject', 'values', 'price', 'Qty']
>>> print configitems.subject
some value

A recursive variation of bobince's parses values with embedded equals as dictionaries:
>>> import re
>>> import pprint
>>>
>>> def parse_line(line):
... d = {}
... a = re.compile(r'\s*(\w+)\s*=\s*("[^"]*"|[^ ,]*),?')
... float_re = re.compile(r'^\d.+$')
... int_re = re.compile(r'^\d+$')
... for k,v in a.findall(line):
... if int_re.match(k):
... k = int(k)
... if v[-1] == '"':
... v = v[1:-1]
... if '=' in v:
... d[k] = parse_line(v)
... elif int_re.match(v):
... d[k] = int(v)
... elif float_re.match(v):
... d[k] = float(v)
... else:
... d[k] = v
... return d
...
>>> line = 'account = "TEST1" Qty=100 price = 20.11 subject="some value" values=
"3=this, 4=that"'
>>> pprint.pprint(parse_line(line))
{'Qty': 100,
'account': 'TEST1',
'price': 20.109999999999999,
'subject': 'some value',
'values': {3: 'this', 4: 'that'}}

If you don't want to use a regex, another option is just to read the string a character at a time:
string = 'account = "TEST1" Qty=100 price = 20.11 subject="some value" values="3=this, 4=that"'
inside_quotes = False
key = None
value = ""
dict = {}
for c in string:
if c == '"':
inside_quotes = not inside_quotes
elif c == '=' and not inside_quotes:
key = value
value = ''
elif c == ' ':
if inside_quotes:
value += ' ';
elif key and value:
dict[key] = value
key = None
value = ''
else:
value += c
dict[key] = value
print dict

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with columns in pandas dataframe? - python

Related

Create an organized DF from a List of mixed type items (Python)

How to catagorize a list with same prefix/suffix?

How to get the last value from the list using python?

Python multilevel dict to strings

best way to parse a line in python to a dictionary

Categories

Resources