Convert literal string to list inside python

Convert literal string to list inside python - python

helpful
'[2, 4]'
'[0, 0]'
'[0, 1]'
'[7, 13]'
'[4, 6]'
Column name helpful has a list inside the string. I want to split 2 and 4 into separate columns.
[int(each) for each in df['helpful'][0].strip('[]').split(',')]
This works the first row but if I do
[int(each) for each in df['helpful'].strip('[]').split(',')]
gives me attribute error
AttributeError: 'Series' object has no attribute 'strip'
How can I print out like this in my dataframe??
helpful not_helpful
2 4
0 0
0 1
7 13
4 6

As suggested by #abarnert, the first port of call is find out why your data is coming across as strings and try and rectify that problem.
However, if this is beyond your control, you can use ast.literal_eval as below.
import pandas as pd
from ast import literal_eval
df = pd.DataFrame({'helpful': ['[2, 4]', '[0, 0]', '[0, 1]', '[7, 13]', '[4, 6]']})
res = pd.DataFrame(df['helpful'].map(literal_eval).tolist(),
columns=['helpful', 'not_helpful'])
# helpful not_helpful
# 0 2 4
# 1 0 0
# 2 0 1
# 3 7 13
# 4 4 6
Explanation
From the documentation, ast.literal_eval performs the following function:
Safely evaluate an expression node or a string containing a Python
literal or container display. The string or node provided may only
consist of the following Python literal structures: strings, bytes,
numbers, tuples, lists, dicts, sets, booleans, and None.

Assuming what you've described here accurately mimics your real-world case, how about a regex with .str.extract()?
>>> regex = r'\[(?P<helpful>\d+),\s*(?P<not_helpful>\d+)\]'
>>> df
helpful
0 [2, 4]
1 [0, 0]
2 [0, 1]
>>> df['helpful'].str.extract(regex, expand=True).astype(np.int64)
helpful not_helpful
0 2 4
1 0 0
2 0 1
Each pattern (?P<name>...) is a named capturing group. Here, there are two: helpful/not helpful. This assumes the pattern can be described by: opening bracket, 1 or more digits, comma, 0 or more spaces, 1 or more digits, and closing bracket. The Pandas method (.extract()), as its name implies, "extracts" the result of match.group(i) for each i:
>>> import re
>>> regex = r'\[(?P<helpful>\d+),\s*(?P<not_helpful>\d+)\]'
>>> re.search(regex, '[2, 4]').group('helpful')
'2'
>>> re.search(regex, '[2, 4]').group('not_helpful')
'4'

Just for fun without module.
s = """
helpful
'[2, 4]'
'[0, 0]'
'[0, 1]'
'[7, 13]'
'[4, 6]'
"""
lst = s.strip().splitlines()
d = {'helpful':[], 'not_helpful':[]}
el = [tuple(int(x) for x in e.strip("'[]").split(', ')) for e in lst[1:]]
d['helpful'].extend(x[0] for x in el)
d['not_helpful'].extend(x[1] for x in el)
NUM_WIDTH = 4
COLUMN_WIDTH = max(len(k) for k in d)
print('{:^{num_width}}{:^{column_width}}{:^{column_width}}'.format(
' ', *sorted(d),
num_width=NUM_WIDTH,
column_width=COLUMN_WIDTH
)
)
for (i, v) in enumerate(zip(d['helpful'], d['not_helpful']), 1):
print('{:^{num_width}}{:^{column_width}}{:^{column_width}}'.format(
i, *v,
num_width=NUM_WIDTH,
column_width=COLUMN_WIDTH
)
)

Related

Python splitting an int based on the char length

I’m new to python and would like to do a simple function. I’d like to read the input array and if the value is more than 4 digits, to then split it then print the first value then the second value.
I’m having issues splitting the number and getting rid of 0’s inbetween; so for example 1006, would become 1, 6.
Input array:
a = [ 1002, 2, 3, 7 ,9, 15, 5992]
Desired output in console:
1, 2
2
3
7
9
15
59,92

You can abstract the splitting into a function and then use a list comprehension to map that function over the list. The following can be tweaked (it matches more of what you had before one of your edits). It can be tweaked of course:
def split_num(n):
s = str(n)
if len(s) < 4:
return 0, n
else:
a,b = s[:2], s[2:]
if a[1] == '0': a = a[0]
return int(a), int(b)
nums = [1002, 2, 3, 7 ,9, 15, 5992]
result = [split_num(n) for n in nums]
for a,b in result:
print(a,b)
Output:
1 2
0 2
0 3
0 7
0 9
0 15
59 92

If you just want a list of the non-zero digits in the original list, you can use this:
a = [ 1002, 2, 3, 7 ,9, 15, 5992]
strings = [str(el) for el in a]
str_digits = [char for el in strings for char in el if char != '0']
and if you want the digits as ints, you can do:
int_digits = [int(el) for el in str_digits]
or go straight to
int_digits = [int(char) for el in strings for char in el if char != '0']
I'm not sure what the logic is behind your desired output is, though, so if this isn't helpful I'm sorry.

Python: join string list if else oneliner

Working with Pandas, I have to rewrite queries implemented as a dict:
query = {"height": 175}
The key is the attribute for the query and the value could be a scalar or iterable.
In the first part I check if the value is not NaN and scalar.
If this condition holds I write the query expression with the == symbol, but else if the value is Iterable I would need to write the expression with the in keyword.
This is the actual code that I need to fix in order to work also with Iterables.
import numpy as np
from collections import Iterable
def query_dict_to_expr(query: dict) -> str:
expr = " and ".join(["{} == {}"
.format(k, v) for k, v in query.items()
if (not np.isnan(v)
and np.isscalar(v))
else "{} in #v".format(k) if isinstance(v, Iterable)
]
)
return expr
but I got invalid syntax in correspondence with the else statement.

If I understand correctly, you don't need to check the type:
In [47]: query
Out[47]: {'height': 175, 'lst_col': [1, 2, 3]}
In [48]: ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
Out[48]: 'height == 175 and lst_col == [1, 2, 3]'
Demo:
In [53]: df = pd.DataFrame(np.random.randint(5, size=(5,3)), columns=list('abc'))
In [54]: df
Out[54]:
a b c
0 0 0 3
1 4 2 4
2 2 2 3
3 0 1 0
4 0 4 1
In [55]: query = {"a": 0, 'b':[0,4]}
In [56]: q = ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
In [57]: q
Out[57]: 'a == 0 and b == [0, 4]'
In [58]: df.query(q)
Out[58]:
a b c
0 0 0 3
4 0 4 1

You misplaces the if/else in the comprehension. If you put the if after the for, like f(x) for x in iterable if g(x), this will filter the elements of the iterable (and can not be combined with an else). Instead, you want to keep all the elements, i.e. use f(x) for x in iterable where f(x) just happens to be a ternary expression, i.e. in the form a(x) if c(x) else b(x).
Instead, try like this (simplified non-numpy example):
>>> query = {"foo": 42, "bar": [1,2,3]}
>>> " and ".join(["{} == {}".format(k, v)
if not isinstance(v, list)
else "{} in {}".format(k, v)
for k, v in query.items()])
'foo == 42 and bar in [1, 2, 3]'

Python Pandas: Inconsistent behaviour of boolean indexing on a Series using the len() method

I have a Series of strings and I need to apply boolean indexing using len() on it.
In one case it works, in another case it does not:
The working case is a groupby on a dataframe, followed by a unique() on the resulting Series and a apply(str) to change the resulting numpy.ndarray entries into strings:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,3,4,5,4,4]})
dg = df.groupby('A')['B'].unique().apply(str)
db = dg[len(dg) > 2]
This just works fine and yields the desired result:
>>db
Out[119]: '[1 2 3]'
The following however throws KeyError: True:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[len(ss) > 2]
Both objects dg and ss are just Series of Strings:
>>type(dg)
Out[113]: pandas.core.series.Series
>>type(ss)
Out[114]: pandas.core.series.Series
>>type(dg['a'])
Out[115]: str
>>type(ss[0])
Out[116]: str
I'm following the syntax as described in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
I can see a potential conflict because len(ss) on its own returns the length of the Series itself and now that exact command is used for boolean indexing ss[len(ss) > 2], but then I'd expect neither of the two examples to work.
Right now this behaviour seems inconsistent, unless I'm missing something obvious.

I think you need str.len, because need length of each value of Series:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
print (ss.str.len())
0 1
1 1
2 2
3 2
4 4
5 2
6 3
dtype: int64
print (ss.str.len() > 2)
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
ls = ss[ss.str.len() > 2]
print (ls)
4 eeee
6 ggg
dtype: object
If use len, get length of Series:
print (len(ss))
7
Another solution is apply len:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[ss.apply(len) > 2]
print (ls)
4 eeee
6 ggg
dtype: object
First script is wrong, you need apply len also:
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,2,4,5,4,6]})
dg = df.groupby('A')['B'].unique()
print (dg)
A
a [1, 2]
b [4, 5, 6]
Name: B, dtype: object
db = dg[dg.apply(len) > 2]
print (db)
A
b [4, 5, 6]
Name: B, dtype: object
If cast list to str, you get another len (length of data + length of [] + length of whitespaces):
dg = df.groupby('A')['B'].unique().apply(str)
print (dg)
A
a [1 2]
b [4 5 6]
Name: B, dtype: object
print (dg.apply(len))
A
a 5
b 7
Name: B, dtype: int64

Python - string to matrix representation

I have a string a="1 2 3; 4 5 6". How do i express this as a matrix [1 2 3; 4 5 6] in Python?
I want to then use another such string b, convert to a matrix and find a x b.

You can use the numpy module to create a matrix directly from a string in matlab type format
>>> import numpy as np
>>> a="1 2 3; 4 5 6"
>>> np.matrix(a)
matrix([[1, 2, 3],
[4, 5, 6]])
You can use the same library to do matrix multiplication
>>> A = np.matrix("1 2 3; 4 5 6")
>>> B = np.matrix("2 3; 4 5; 6 7")
>>> A * B
matrix([[28, 34],
[64, 79]])
Go read up on the numpy library, it is a very powerful module to do all of the type of work that you are referring to.

This is one way to do it, split the string at ;, then go through each string, split at ' ' and then go through that, convert it to an int and append to a sublist, then append that sublist to another list:
a = "1 2 3; 4 5 6"
aSplit = a.split('; ')
l = []
for item in aSplit:
subl = []
for num in item.split(' '):
subl.append(int(num))
l.append(subl)
print l

Slicing a list from the end

Say, I have a list of values:
>>> a = [1, 2, 3, 4]
How can I make it include the end value through slicing? I expected:
>>> a[4:]
[4]
instead of:
>>> a[4:]
[]

Slicing indices start from zero
So if you have:
>>> xs = [1, 2, 3, 4]
| | | |
V V V V
0 1 2 3 <-- index in xs
And you slice from 4 onwards you get:
>>> xs[4:]
[]
Four is is the length of ``xs`, not the last index!
However if you slice from 3 onwards (the last index of the list):
>>> xs[3:]
[4]
See: Data Structures
Many many common computer programmming langauges and software systems are in fact zero-based so please have a read of Zero-based Numbering

Python indexes are zero based. The last element is at index 3, not 4:
>>> a = [1,2,3,4]
>>> a[3:]
[4]

a = [1,2,3,4]
a[-1:]
In python you can iterate values from beginning to end ending to beginning
1, 2, 3, 4
| | | |
0 1 2 3 (or)
-4 -3 -2 -1
So If you want last element of the list you can use either a[len(a)-1:] or a[-1:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert literal string to list inside python - python

Related

Python splitting an int based on the char length

Python: join string list if else oneliner

Python Pandas: Inconsistent behaviour of boolean indexing on a Series using the len() method

Python - string to matrix representation

Slicing a list from the end

Categories

Resources