Performing a JSON operation on dataframe column - python

I have a data frame, where one of the columns is a column of strings, which can be converted by separately with json.loads(string) to a dictionary.
I'd like to perform json.loads() on the entire column at once, turning the column of strings, to a column of dictionaries.
Is this possible?

You can use apply or list comprehension:
df['col'] = df['col'].apply(pd.io.json.loads)
df['col'] = [pd.io.json.loads(x) for x in df['col']]
Another more general solution:
import ast
df['col'] = df['col'].apply(ast.literal_eval)

Related

Python sets stored as string in a column of a pandas dataframe

I have a pandas dataframe, where one column contains sets of strings (each row is a (single) set of strings). However, when I "save" this dataframe to csv, and read it back into a pandas dataframe later, each set of strings in this particular column seems to be saved as a single string. For example the value in this particular row, should be a single set of strings, but it seems to have been read in as a single string:
I need to access this data as a python set of strings, is there a way to turn this back into a set? Or better yet, have pandas read this back in as a set?
You can wrap the string in the "set()" function to turn it back into a set.
string = "{'+-0-', '0---', '+0+-', '0-0-', '++++', '+++0', '+++-', '+---', '0+++', '0++0', '0+00', '+-+-', '000-', '+00-'}"
new_set = set(string)
I think you could use a different separator while converting dataframe to csv.
import pandas as pd
df = pd.DataFrame(["{'Ramesh','Suresh','Sachin','Venkat'}"],columns=['set'])
print('Old df \n', df)
df.to_csv('mycsv.csv', sep= ';', index=False)
new_df = pd.read_csv('mycsv.csv', sep= ';')
print('New df \n',new_df)
Output:
You can use series.apply I think:
Let's say your column of sets was called column_of_sets. Assuming you've already read the csv, now do this to convert back to sets.
df['column_of_sets'] = df['column_of_sets'].apply(eval)
I'm taking eval from #Cabara's comment. I think it is the best bet.

How to transform a column of string into columns of a single char? Python Pandas

I am dealing with DNA sequencing data, and each column looks something like "ACCGTGC". I would like to transform this into several columns, where each column contains only one char. How to do this in Python pandas?
For performance convert values to lists and pass to DataFrame constructor:
df1 = pd.DataFrame([list(x) for x in df['col']], index=df.index)
If need add to original:
df = df.join(df1)

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it
Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).
Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame
It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.
The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T
If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)

How to split this data in Python pandas dataframe?

The is my pandas data frame, In the index column i want to keep only the values after double underscore(__) and remove the rest.
Use str.split with parameter n=1 for split by first splitter (if possible multiple __) and select second lists:
df['index'].str.split('__', n=1).str[1]
Or use list comprehension if no missing values and performance is important:
df['last'] = [x.split('__', 1)[1] for x in df['index']]
df['index'].apply(lambda x: x.split('__')[-1]) will do the trick

How to read a column of 'dictionary' into pandas

I have a csv file with a column whose type is a dictionary (column 'b' in the example below). But b in df is a string type even though I define it as the dictionary type. I didn't find the solution to this question. Any suggestions?
a = pd.DataFrame({'a': [1,2,3], 'b':[{'a':3, 'haha':4}, {'c':3}, {'d':4}]})
a.to_csv('tmp.csv', index=None)
df = pd.read_csv('tmp.csv', dtype={'b':dict})
I wonder if your CSV column is actually meant to be a Python dict column, or rather JSON. If it's JSON, you can read the column as dtype=str, then use json_normalize() on that column to explode it into multiple columns. This is an efficient solution assuming the column contains valid JSON.
There is no dictionary type in pandas. So you should specify object in case you want normal Python objects:
df = pd.read_csv('tmp.csv', dtype={'b':object})
This will contain strings because pandas doesn't know what dictionaries are. In case you want dictionaries again you could try to "eval" them with ast.literal_eval (safe string evaluation):
df['b'] = df['b'].apply(ast.literal_eval)
print(df['b'][0]['a']) # 3
If you're really confident that you never ever run this on untrusted csvs you could also use eval. But before you consider that I would recommend trying to use a DataFrame only with "native" pandas or NumPy types (or maybe a DataFrame in DataFrame approach). Best to avoid object dtypes as much as possible.
You could use the converters parameter.
From the documentation:
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels.
If you know your column is well formed and includes no missing values then you could do:
df = pd.read_csv('tmp.csv', converters = {'b': ast.literal_eval})
However, for safety (as others have commented) you should probably define your own function with some basic error resilience:
def to_dict(x):
try:
y = ast.literal_eval(x)
if type(y) == dict:
return y
except:
return None
and then:
df = pd.read_csv('tmp.csv', converters = {'b': to_dict})

Categories

Resources