Why does pandas df.values convert tuple into strings

Why does pandas df.values convert tuple into strings - python

I have a csv with 4000 over data, in which each cells contains a tuple which holds a specific coordinated. I will like to convert it to a numpy array to work with. I use pandas to convert it into a DataFrame before calling df.values. However after calling df.values, the tuple becomes a string "(x,y)", instead. Is it possible to prevent this happening? Thank you.
df = pd.read_csv(sample_data)
array = df.values

I think problem is from csv always get tuples as strings.
So need convert them:
import ast
df['col'] = df['col'].apply(ast.literal_eval)
Or if all columns are tuples:
df = df.applymap(ast.literal_eval)

It seems that you read the file from local path ?
My answer is use eval to change the string:
df.apply(lambda x:x.apply(eval))

Another way to change the data type after reading the csv:
df['col'].apply(tuple)

Related

Python sets stored as string in a column of a pandas dataframe

I have a pandas dataframe, where one column contains sets of strings (each row is a (single) set of strings). However, when I "save" this dataframe to csv, and read it back into a pandas dataframe later, each set of strings in this particular column seems to be saved as a single string. For example the value in this particular row, should be a single set of strings, but it seems to have been read in as a single string:
I need to access this data as a python set of strings, is there a way to turn this back into a set? Or better yet, have pandas read this back in as a set?

You can wrap the string in the "set()" function to turn it back into a set.
string = "{'+-0-', '0---', '+0+-', '0-0-', '++++', '+++0', '+++-', '+---', '0+++', '0++0', '0+00', '+-+-', '000-', '+00-'}"
new_set = set(string)

I think you could use a different separator while converting dataframe to csv.
import pandas as pd
df = pd.DataFrame(["{'Ramesh','Suresh','Sachin','Venkat'}"],columns=['set'])
print('Old df \n', df)
df.to_csv('mycsv.csv', sep= ';', index=False)
new_df = pd.read_csv('mycsv.csv', sep= ';')
print('New df \n',new_df)
Output:

You can use series.apply I think:
Let's say your column of sets was called column_of_sets. Assuming you've already read the csv, now do this to convert back to sets.
df['column_of_sets'] = df['column_of_sets'].apply(eval)
I'm taking eval from #Cabara's comment. I think it is the best bet.

converting class 'tuple' in python to a pandas dataframe

Am applying some method / property out of a custom lib in python mlfinlab.
When I apply this calculation (here is the link). Under get_ema_dollar_imbalance_bars , then am getting back an output with
<class 'tuple'> as a type of the object, then towards the end of the output space I get the below
[4798 rows x 10 columns],
Empty DataFrame
Columns: []
Index: [] , dtype=object)
what am trying to do is to convert it into a pandas dataframe.
I have tried different methods around including transforming it first to a NumPy array then I can get in to pandas df format, but it is still not working.
Any help/guidance is appreciated and welcomed on your part.
I have included some snapshots as well.
best regards

First option:
Convert your tuple to a list with:
lst = list(yourtuple)
Convert to a dataframe:
df = pd.DataFrame(lst)
Second Option:
I can't see the formatting of your tuple, but assuming you need more than one column, and your tuple is structured in a way where it's ((col1,entry1,entry2),(col2,entry1,entry2)), you can do the following:
Create a dictionary:
newdict = dict(yourtuple)
Convert to a dataframe:
df = pd.DataFrame([newdict])

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it

Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).

Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame

It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.

The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T

If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)

Specific string slicing

I have a large string array which i store as an nparray named np_base: np.shape(np_base)
Out[32]: (65000000, 1)
what i intend to do is to vertically slice the array in order to decompose it into multiple columns that i'll store later in an independant way, so i tried to loop over the row indexes and to append:
for i in range(65000000):
INCDN.append(np.base[i, 0][0:5])
but this trhows out a memory error.
Could anybody please help me out with this issue, i've been searching for days for an alternative way to slice the string array.
Thanks,

There are many ways to apply a function to a numpy array one of which is the following:
np_truncated = np.vectorize(lambda x: x[:5])(np_base)
Your approach with iterativly appending a list is usally the least perfomed solution in most contexts.
Alternatively, if you intent to work with many columns, you might want to use pandas.
import pandas as pd
df = pd.DataFrame(np_base, columns=["Raw"])
truncated = df.Raw.str.slice(0,5)

How to read a column of 'dictionary' into pandas

I have a csv file with a column whose type is a dictionary (column 'b' in the example below). But b in df is a string type even though I define it as the dictionary type. I didn't find the solution to this question. Any suggestions?
a = pd.DataFrame({'a': [1,2,3], 'b':[{'a':3, 'haha':4}, {'c':3}, {'d':4}]})
a.to_csv('tmp.csv', index=None)
df = pd.read_csv('tmp.csv', dtype={'b':dict})

I wonder if your CSV column is actually meant to be a Python dict column, or rather JSON. If it's JSON, you can read the column as dtype=str, then use json_normalize() on that column to explode it into multiple columns. This is an efficient solution assuming the column contains valid JSON.

There is no dictionary type in pandas. So you should specify object in case you want normal Python objects:
df = pd.read_csv('tmp.csv', dtype={'b':object})
This will contain strings because pandas doesn't know what dictionaries are. In case you want dictionaries again you could try to "eval" them with ast.literal_eval (safe string evaluation):
df['b'] = df['b'].apply(ast.literal_eval)
print(df['b'][0]['a']) # 3
If you're really confident that you never ever run this on untrusted csvs you could also use eval. But before you consider that I would recommend trying to use a DataFrame only with "native" pandas or NumPy types (or maybe a DataFrame in DataFrame approach). Best to avoid object dtypes as much as possible.

You could use the converters parameter.
From the documentation:
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels.
If you know your column is well formed and includes no missing values then you could do:
df = pd.read_csv('tmp.csv', converters = {'b': ast.literal_eval})
However, for safety (as others have commented) you should probably define your own function with some basic error resilience:
def to_dict(x):
try:
y = ast.literal_eval(x)
if type(y) == dict:
return y
except:
return None
and then:
df = pd.read_csv('tmp.csv', converters = {'b': to_dict})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does pandas df.values convert tuple into strings - python

I think problem is from csv always get tuples as strings. So need convert them: import ast df['col'] = df['col'].apply(ast.literal_eval) Or if all columns are tuples: df = df.applymap(ast.literal_eval)

It seems that you read the file from local path ? My answer is use eval to change the string: df.apply(lambda x:x.apply(eval))

Another way to change the data type after reading the csv: df['col'].apply(tuple)

Related

Python sets stored as string in a column of a pandas dataframe

converting class 'tuple' in python to a pandas dataframe

List of Series to Dataframe

Specific string slicing

How to read a column of 'dictionary' into pandas

Categories

Resources