Trying to convert a string into a list in Pandas - python

I am trying to convert a column, which looks something like this
cleaned
['11-111']
['12-345', '67-789']
['00-000', '01-234']
into a list, since I read that Pandas initially interprets list as strings from this article:
https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173
I using the function mentioned in the article
master["cleaned"] = master["cleaned"].apply(eval)
but I am getting this error
eval() arg 1 must be a string, bytes or code object
I tried looking them up, but I can't figure it out

df.cleaned = pd.eval(df.cleaned)
There doesn't appear to be a built it to deal with failures, so you can make it yourself:
def try_eval(x):
try:
return eval(x)
except:
return x
df.cleaned = df.cleaned.apply(try_eval)
Then you can look for the ones that didn't convert by doing:
df.cleaned[df.cleaned.apply(lambda x: isinstance(x, str))]

Related

Pandas Dataframe column value's type is returning as a string though it is list of strings [duplicate]

I have a dataframe with a column containing a tuple data as a string. Eg. '(5,6)'. I need to convert this to a tuple structure. One way of doing it is using the ast.literal_eval(). I am using it in this way.
df['Column'] = df['Column'].apply(ast.literal_eval)
Unfortunately, my data in this column contains empty strings also. The ast.literal_eval() is not able to handle this. I get this error.
SyntaxError: unexpected EOF while parsing
I am unsure if this is because it is unable to handle such a character. Based on my reading, I found that ast.literal_eval() works only in cases when a list, dict or tuple is there inside a string structure.
To overcome this I tried to create my own function and return an empty string if it raises an exception.
def literal_return(val):
try:
return ast.literal_eval(val)
except ValueError:
return (val)
df['Column2'] = df['Column'].apply(literal_return)
Even in this case, the same error pops up. How do we handle this. It would be great even if there is a way to ignore certain rows to apply the function and apply on the rest. Any help is appreciated.
I would do it simply requiring a string type from each entry:
from ast import literal_eval
df['column_2'] = df.column_1.apply(lambda x: literal_eval(str(x)))
If You need to advanced Exception handling, You could do, for example:
def f(x):
try:
return literal_eval(str(x))
except Exception as e:
print(e)
return []
df['column_2'] = df.column_1.apply(lambda x: f(x))
This works when the function is changed to:
def literal_return(val):
try:
return ast.literal_eval(val)
except (ValueError, SyntaxError) as e:
return val

type error list indices must be integers not str python. Why?

This is a code for format printing multiple lists:
print("{0[0]:10s} {1[0]:20} {2[0]:5} £{3[0]:6} £{4[0]:<7}".format(gtinlist, desclist, qtylist, pricelist, valuelist))
This prints the first value of each list
But as soon as I change it to:
print("{0[0:9]:10s} {1[0:9]:20} {2[0:9]:5} £{3[0:9]:6} £{4[0:9]:<7}".format(gtinlist, desclist, qtylist, pricelist, valuelist))
And if I put any number in the place of :9 it does not work.
I don't understand why
Help?
If you want to print the 9th position, you can use
"{0[8]:10s}"
Using
"{0[0:9]:10s}"
will tell format() to use the key "0:9", which is a string but lists only have integer keys:
TypeError: list indices must be integers or slices, not str).
In python derp[0:9] means, to get a slice of the list derp from key 0 to key 8. But format() does not interpret the 0:9 as python syntax. Format does check if the instance has a __getitem__ method and passes the key to it.
>>> class donk():
... def __getitem__(self, k):
... return 5
...
>>> a = donk()
>>> 'bla{0[3]}'.format(a)
'bla5'
If you want to print the first 8 elements here, maybe separated by comma, you may use:
", ".join(['{:10s}'.format(a) for a in my_list[0:9])
Additional: If you want to print all stuff in your arrays on separate lines you can use a for-loop:
for i, gt in enumerate(gtinlist):
print("{:10s} {:20} {:5} £{:6} £{:<7}".format(gt, desclist[i], stylist[i], priceless[i], valueless[i]))
In this approach len(gtinlist) <= len(stylist), len(priceless), len(valueless)
Apparently slicing isn't supported there, only integers (which are turned into ints) and everything else will be interpreted as a string index:
>>> class Foo:
def __getitem__(self, index):
print('got asked for:', type(index).__name__, repr(index))
>>> x = '{0[3]}'.format(Foo())
got asked for: int 3
>>> x = '{0[3:7]}'.format(Foo())
got asked for: str '3:7'
How it would look if a slice were requested:
>>> Foo()[3:7]
got asked for: slice slice(3, 7, None)
I checked the docs but couldn't really find an answer, just that "an expression of the form '[index]' does an index lookup using __getitem__()", which could support slicing. But it doesn't even mention that integers will be turned into ints.

Count number of strings

First of all, I do realize that this is a really simple question and please bear with me on this.
How, in python, can I get the numbers of strings? I am trying to do something like this:
def func(input_strings):
# Make the input_strings iterable
if len(input_strings) == 1:
input_strings = (input_strings, )
# Do something
for input_string in input_strings:
analyze_string(input_string)
return
So with this function, if the input is a list, ['string1', 'string2', 'string3'], it will loop over them; if the input is only one string like 'string1', then it will still take care of it, instead of throwing an exception.
However, the len(str) returns the number of characters in the string and wouldn't give me a "1".
I'd really appreciate your help!
Use isinstance to check whether a given value is string:
>>> isinstance('a-string', str)
True
>>> isinstance(['a-string', 'another-string'], str)
False
def func(input_strings):
# Make the input_strings iterable
if isinstance(input_strings, str):
input_strings = (input_strings, )
# Do something
for input_string in input_strings:
analyze_string(input_string)
return
Python 2.x note (Python 2.3+)
Use isinstance('a-string', basestring) if you want to also test unicode. (basestring is the superclass for str and unicode).
I'd suggest using *args to allow the function to accept any number of strings.
def func(*input_strings):
for input_string in input_strings:
analyze_string(input_string)
func("one string")
func("lots", "of", "strings")
If you then want the number of strings, you can simple use len(input_strings).
Have a look at these answers.

How to return multiple strings from a script to the rule sequence in booggie 2?

This is an issue specific to the use of python scripts in booggie 2.
I want to return multiple strings to the sequence and store them there in variables.
The script should look like this:
def getConfiguration(config_id):
""" Signature: getConfiguration(int): string, string"""
return "string_1", "string_2"
In the sequence I wanna have this:
(param_1, param_2) = getConfiguration(1)
Please note: The booggie-project does not exist anymore but led to the development of Soley Studio which covers the same functionality.
Scripts in booggie 2 are restricted to a single return value.
But you can return an array which then contains your strings.
Sadly Python arrays are different from GrGen arrays so we need to convert them first.
So your example would look like this:
def getConfiguration(config_id):
""" Signature: getConfiguration(int): array<string>"""
#TypeHelper in booggie 2 contains conversion methods from Python to GrGen types
return TypeHelper.ToSeqArray(["string_1", "string_2"])
return a tuple
return ("string_1", "string_2")
See this example
In [124]: def f():
.....: return (1,2)
.....:
In [125]: a, b = f()
In [126]: a
Out[126]: 1
In [127]: b
Out[127]: 2
Still, it's not possible to return multiple values but a python list is now converted into a C#-array that works in the sequence.
The python script itself should look like this
def getConfiguration(config_id):
""" Signature: getConfiguration(int): array<string>"""
return ["feature_1", "feature_2"]
In the sequence, you can then use this list as if it was an array:
config_list:array<string> # initialize array of string
(config_list) = getConfigurationList(1) # assign script output to that array
{first_item = config_list[0]} # get the first string("feature_1")
{second_item = config_list[1]} # get the second string("feature_2")
For the example above I recommend using the following code to access the entries in the array (in the sequence):
config_list:array<string> # initialize array of string
(config_list) = getConfigurationList(1) # assign script output to that array
{first_item = config_list[0]} # get the first string("feature_1")
{second_item = config_list[1]} # get the second string("feature_2")

Method for guessing type of data represented currently represented as strings

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:
integer
floating point
date
boolean
string
My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.
I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?
What about integers and floats?
ast.literal_eval() can get the easy ones.
Dateutil comes to mind for parsing dates.
For integers and floats you could always try a cast in a try/except section
>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
... cg = float(g)
... except:
... print "g is not a float"
...
g is not a float
>>>
The data types I am currently interested in are...
These do not exist in a CSV file. The data is only strings. Only. Nothing more.
test a sample of rows
Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.
The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.
If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).
try:
x= datetime.datetime.strptime( row[0], some format )
except ValueError:
# column is not valid.
If column 2 is supposed to be a float, you can only do this.
try:
y= float( row[1] )
except ValueError:
# column is not valid.
If column 3 is supposed to be an int, you can only do this.
try:
z= int( row[2] )
except ValueError:
# column is not valid.
There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.
Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.
Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.
You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:
https://github.com/okfn/messytables
https://messytables.readthedocs.org/ - docs
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
We tested ast.literal_eval() but rescuing from error is pretty slow, if you want to cast from data that you receive all as string, I think that regex would be faster.
Something like the following worked very well for us.
import datetime
import re
"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
if s == ""
return None
elif re.match("\A[0-9]+\.[0-9]+\Z", s):
return float
elif re.match("\A[0-9]+\Z", s):
return int
# 2019-01-01 or 01/01/2019 or 01/01/19
elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s):
return datetime.date
elif re.match("\A(true|false)\Z", s):
return bool
else:
return str
Tests:
assert guess_type("") == None
assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date

Categories

Resources