extract specific field by index from json string stored in dataframe

extract specific field by index from json string stored in dataframe - python

I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
expected output
a Group Code
0 {"Number":1,"Group":"a","First":"Yes","Second"... a 10786
1 {"Number":2,"Group":"a","First":"Yes","Second"... a 38971
I have tried indexing by location but its printing only characters rather than fields e.g.
tuple(dat['a'].items())[0][1][4]
I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much

The reason pd.json_normalize is not working is because pd.json_normalize works on dictionaries and your strings are not json compatible.
Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped.
It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize.
import pandas as pd
import json
import re
jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.
a
Number
Group
First
Code
Desc
Labs
0
{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}
1
a
Yes
10786
True
['a', 'b', 'c']
1
{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}
2
a
Yes
38971
True
['a', 'b', 'c']

Related

How to separate a list of strings into several lines within a row/cell?

i have a string ['A', 'B', 'C'] in a cell of a dataframe which is imported from a CSV file by Python Pandas (looks exactly the same as the (row 1, ColA) below).
Now i wanna make it like (row 2, ColA) in the panda dataframe, as the uploaded pic below, how do i achieve it? I also want it to look like this in the excel when i save it to CSV by using to_CSV.
Input:
df = pd.DataFrame({'ColA' : "['A','B','C']"}, index=[1])
Output:
ColA
1 ['A','B','C']
2 A
B
C
I believe text.explode isn't a solution as it separates the list into several rows.
Thank you! and Wish you guys a healthy and safe new year!

You have to do some cleanup with replace. With replace I have two different patterns within the string, for which I want two separate replacements. One replacements is '\n', so I can see on new lines when sending to excel. The other replacement is replacing with an empty string, i.e. nothing. The or operator | separates the different possible replacements for the empty string output of ''. [ and ] are regex characters, so you must escape with \, so you are basically getting rid of [, ] and '. You must also pass regex=True to replace.:
df = pd.DataFrame({'ColA' : "['A','B','C']"}, index=[1])
df['ColA'] = df['ColA'].replace(["','", "\[|\]|'"], ['\n', ''], regex=True)
df
Out[1]:
ColA
1 A\nB\nC
And output in Excel after expanding row width:

Column name mapping with data is being observed different when creating a pandas dataframe with column names given as a dictionary instead of list

I'm observing random results in the column name mapping to the data when the column names are specified in a dictionary instead of a list.
In the first case below, the ordering is as expected, but in the latter case the column names get incorrectly mapped to the data.
Code:
sample_df = pd.DataFrame([[23,45,67],[99,32,11]],columns={"col1","col2","col3"})
print(sample_df)
sample_df = pd.DataFrame([[23,45,67],[99,32,11]],columns={"feat1","feat2","feat3"})
print(sample_df)

That's because your column values are neither a dictionary nor a list. You have a set there, which are inherently unordered which is why you're seeing this result. Use square brackets to denote a list (which is ordered; e.g. ["col1", "col2", "col3"]), or parentheses to denote a tuple (which is also ordered; e.g. ("col1", "col2", "col3")).
The visual difference between a set and a dictionary is that a dictionary contains mappings via a colon {"hello": "world"} (this dictionary contains 1 mapping "hello" to "world"). Whereas a set does not contain mappings such as this and simply has elements separated by commas: {"hello", "world"} (this set contains 2 items, "hello" and "world")

firstly you specified Set in columns parameter ... and
Python Set Datatype is Unordered and Unindexed (reference)
In Python Dict you must have Key and Value
try in this way -
sample_df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
output
col1 col2
0 1 3
1 2 4

How to convert a dataframe column to list and values in the list to be enclosed with double quotes

I have a list column in dataframe as mentioned below.
df=pd.DataFrame({'a':["a,b,c"]})
df:
a
0 a,b,c
df.a.astype(str).values.tolist()
['a,b,c']
But I want the output to be ["a","b","c"] in this format.
Can someone help me with the code.

The following code will result in the desired output -
df.a.apply(lambda x: str(x.split(',')))
Output -
['a', 'b', 'c']
We split on comma and then convert every element to string.
Edit 1 :
This piece of code can also be used to get the same output -
df['a']=[str(i.split(',')) for i in df.a]

Create new column in DataFrame from a conditional of a list

I have a DataFrame like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0])})
I want to create a new column called test that displays a 1 if a 0 exists within each list in column B. The results hopefully would look like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0]), 'test': (1,0,1)})
With a dataframe that contains strings rather than lists I have created additional columns from the string values using the following
df.loc[df['B'].str.contains(0),'test']=1
When I try this with the example df, I generate a
TypeError: first argument must be string or compiled pattern
I also tried to convert the list to a string but only retained the first integer in the list. I suspect that I need something else to access the elements within the list but don't know how to do it. Any suggestions?

This should do it for you:
df['test'] = pd.np.where(df['B'].apply(lambda x: 0 in x), 1, 0)

creating a dataframe from a variable length text string

I am new to numpy and pandas. I am trying to add the words and their indexes to a dataframe. The text string can be of variable length.
text=word_tokenize('this string can be of variable length')
df2 = pd.DataFrame({'index':np.array([]),'word':np.array([])})
for i in text:
for i, row in df2.iterrows():
word_val = text[i]
index_val = text.index(i)
df2.set_value(i,'word',word_val)
df2.set_value(i,'index',index_val)
print df2

To create a DataFrame from each word of your string(can be of any length), you can directly use
df2 = pd.DataFrame(text, columns=['word'])
your nltk "word_tokenize" providing you a list of words which can be used to provide column data and by default pandas take care of index.

Just pass the list directly into the DataFrame method:
pd.DataFrame(['i', 'am', 'a', 'fellow'], columns=['word'])
word
0 i
1 am
2 a
3 fellow
I'm not sure you want to name a column 'index' and in this case the values will be the same as the index of the DataFrame itself. Also its not a good practice to name a column 'index' as you wont be able to access it with the df.column_name syntax and your code could be confusing to other people.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract specific field by index from json string stored in dataframe - python

Related

How to separate a list of strings into several lines within a row/cell?

Column name mapping with data is being observed different when creating a pandas dataframe with column names given as a dictionary instead of list

How to convert a dataframe column to list and values in the list to be enclosed with double quotes

Create new column in DataFrame from a conditional of a list

creating a dataframe from a variable length text string

Categories

Resources