How to split text inside a pandas dataframe into new dataframe columns - python

I have a list
list1= ['{"bank_name": null, "country": null, "url": null, "type": "Debit", "scheme": "Visa", "bin": "789452"}\n',
'{"prepaid": "", "bin": "123457", "scheme": "Visa", "type": "Debit", "bank_name": "Ohio", "url": "www.u.org", "country": "UKs"}\n']
I passed it into a dataframe:
df = pd.DataFrame({'bincol':list1})
print(df)
bincol
0 {"bank_name": null, "country": null, "url": nu...
1 {"prepaid": "", "bin": "123457", "scheme": "Vi...
I am trying to split bincol columns into new columns using this function
def explode_col(df, column_value):
df = df.dropna(subset=[column_value])
if isinstance(df[str(column_value)].iloc[0], str):
df[column_value] = df[str(column_value)].apply(ast.literal_eval)
expanded_child_df = (pd.concat({i: json_normalize(x) for i, x in .pop(str(column_value)).items()}).reset_index(level=1,drop=True).join(df, how='right', lsuffix='_left', rsuffix='_right').reset_index(drop=True))
expanded_child_df.columns = map(str.lower, expanded_child_df.columns)
return expanded_child_df
df2 = explode_col(df,'bincol')
But i am getting this error, am i missing something here ?
raise ValueError(f'malformed node or string: {node!r}')
ValueError: malformed node or string: <_ast.Name object at 0x7fd3aa05c400>

For me working in your sample data json.loads for convert data to dictionaries, then is used json_normalize for DataFrame:
import json
df = pd.json_normalize(df['bincol'].apply(json.loads))
print(df)
bank_name country url type scheme bin prepaid
0 None None None Debit Visa 789452 NaN
1 Ohio UKs www.u.org Debit Visa 123457

Related

Guessing data type for dataframe

Is there an algorithm to detect the data types of each column of a file or dataframe? The challenge is to suggest the data type by having wrong, missing or unnormalized data. I want to detect the data types named in the keys. My first try was to use messytables but the result is really bad without normalizing the data before. So maybe there is an algorithm to get better results for an type suggestion or a way to normalize the data without knowning the data. The result should match the keys from the dataframe.
import pandas as pd
from messytables import CSVTableSet, type_guess
data = {
"decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
"dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
"dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
"dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
"strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
"integers": ["N.A.", "", "1234", "123123", "2222", "0"],
"time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
"datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
"boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)
towrite = io.BytesIO()
df.to_csv(towrite) # write to BytesIO buffer
towrite.seek(0)
rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]
Here is my take on your interesting question.
With the dataframe you provided, here is one way to do it:
# For each main type, define a lambda helper function which returns the number of values in the given column of said type
helpers = {
"float": lambda df, col: df[col]
.apply(lambda x: x.replace(".", "").isdigit() and "." in x)
.sum(),
"integer": lambda df, col: df[col].apply(lambda x: x.isdigit()).sum(),
"datetime": lambda df, col: pd.to_datetime(
df[col], errors="coerce", infer_datetime_format=True
)
.notna()
.sum(),
"bool": lambda df, col: df[col].apply(lambda x: x == "True" or x == "False").sum(),
}
# Iterate on each column of the dataframe and get the type with maximum number of values
df_dtypes = {}
for col in df.columns:
results = {key: helper(df, col) for key, helper in helpers.items()}
best_result = max(results, key=results.get)
df_dtypes[col] = best_result if max(results.values()) else "string"
print(df_dtypes)
# Output
{
"decimals": "float",
"dates1": "datetime",
"dates2": "datetime",
"dates3": "datetime",
"strings": "string",
"integers": "integer",
"time": "datetime",
"datetime": "datetime",
"boolean": "bool",
}

How to remove delimeted pipe from my json column and split them to different columns and their respective values

"description": ID|100|\nName|Sam|\nCity|New York City|\nState|New York|\nContact|1234567890|\nEmail|1234#yahoo.com|
This is how my code in json looks like. I wanted to convert this json file to excel sheet to split the nested column to separate columns and have used pandas for it, but couldn't achieve it. The output I want in my excel sheet is:
ID Name City State Contact Email
100 Sam New York City New York 1234567890 1234#yahoo.com
I want to remove those pipes and the solution should be in pandas. Please help me out with this.
The code I am trying:
I want output as:
The output on excel sheet:
[2]: https://i.stack.imgur.com/QjSUU.png
The list of dict column looks like:
"assignees": [{
"id": 1234,
"username": "xyz",
"name": "XYZ",
"state": "active",
"avatar_url": "aaaaaaaaaaaaaaa",
"web_url": "bbbbbbbbbbb"
},
{
"id": 5678,
"username": "abcd",
"name": "ABCD",
"state": "active",
"avatar_url": "hhhhhhhhhhh",
"web_url": "mmmmmmmmm"
}
],
This could be one way:
import pandas as pd
df = pd.read_json('Sample.json')
df2 = pd.DataFrame()
for i in df.index:
desc = df['description'][i]
attributes = desc.split("\n")
d = {}
for attrib in attributes:
if not(attrib.startswith('Name') or attrib.startswith('-----')):
kv = attrib.split("|")
d[kv[0]] = kv[1]
df2 = df2.append(d, ignore_index=True)
print(df2)
df2.to_csv("output.csv")
Output xls:

Creating pandas dataframe from accessing specific keys of nested dictionary

How can below dictionary converted to expected dataframe like below?
{
"getArticleAttributesResponse": {
"attributes": [{
"articleId": {
"id": "2345",
"locale": "en_US"
},
"keyValuePairs": [{
"key": "tags",
"value": "[{\"displayName\": \"Nice\", \"englishName\": \"Pradeep\", \"refKey\": \"Key2\"}, {\"displayName\": \"Family Sharing\", \"englishName\": \"Sarvendra\", \"refKey\": \"Key1\", \"meta\": {\"customerDisplayable\": [false]}}}]"
}]
}]
}
}
Expected dataframe:
id displayName englistname refKey
2345 Nice Pradeep Key2
2345 Family Sharing Sarvendra Key1
df1 = pd.DataFrame(d['getDDResponse']['attributes']).explode('keyValuePairs')
df2 = pd.concat([df1[col].apply(pd.Series) for col in df1],1).assign(value = lambda x :x.value.apply(eval)).explode('value')
df = pd.concat([df2[col].apply(pd.Series) for col in df2],1)
OUTPUT:
0 0 display englishName reference source
0 1234 tags Unarchived Unarchived friend monster

Extracting str from pandas dataframe using json

I read csv file into a dataframe named df
Each rows contains str below.
'{"id":2140043003,"name":"Olallo Rubio",...}'
I would like to extract "name" and "id" from each row and make a new dataframe to store the str.
I use the following codes to extract but it shows an error. Please let me know if there is any suggestions on how to solve this problem. Thanks
JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31)
text={
"id": 2140043003,
"name": "Olallo Rubio",
"is_registered": True,
"chosen_currency": 'Null',
"avatar": {
"thumb": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls": {
"web": {
"user": "https://www.kickstarter.com/profile/2140043003"
},
"api": {
"user": "https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}
def extract(text,*args):
list1=[]
for i in args:
list1.append(text[i])
return list1
print(extract(text,'name','id'))
# ['Olallo Rubio', 2140043003]
Here's what I came up with using pandas.json_normalize():
import pandas as pd
sample = [{
"id": 2140043003,
"name":"Olallo Rubio",
"is_registered": True,
"chosen_currency": None,
"avatar":{
"thumb":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls":{
"web":{
"user":"https://www.kickstarter.com/profile/2140043003"
},
"api":{
"user":"https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}]
# Create datafrane
df = pd.json_normalize(sample)
# Select columns into new dataframe.
df1 = df.loc[:, ["name", "id",]]
Check df1:
Input:
print(df1)
Output:
name id
0 Olallo Rubio 2140043003

Pandas DataFrame created for each row

I am attempting to pass data in JSON from an API to a Pandas DataFrame. I could not get pandas.read_json to work with the API data so I'm sure it's not the best solution, but I currently have for loop running through the JSON to extract the values I want.
Here is what I have:
import json
import urllib.request
import pandas as pd
r = urllib.request.urlopen("https://graph.facebook.com/v3.1/{page-id}/insights?access_token={access-token}&pretty=0&metric=page_impressions%2cpage_engaged_users%2cpage_fans%2cpage_video_views%2cpage_posts_impressions").read()
output = json.loads(r)
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
df = [{'Name': name, 'Period': period, 'Value': value}]
df = pd.DataFrame(df)
print(df)
And here is an excerpt of the JSON from the API:
{
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
},
The issue I am now facing is because of the For Loop (I believe), each row of data is being inserted into its own DataFrame like so:
Name Period Value
0 page_video_views day 465
Name Period Value
0 page_video_views week 3257
Name Period Value
0 page_video_views days_28 9987
Name Period Value
0 page_impressions day 1402
How can I pass all of them easily into the same DataFrame like so?
Name Period Value
0 page_video_views day 465
1 page_video_views week 3257
2 page_video_views days_28 9987
3 page_impressions day 1402
Again, I know this most likely isn't the best solution so any suggestions on how to improve any aspect are very welcome.
You can create list of dictionaries and pass to DataFrame constructor:
L = []
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
L.append({'Name': name, 'Period': period, 'Value': value})
df = pd.DataFrame(L)
Or use list comprehension:
L = [({'Name': item['name'], 'Period': item['period'], 'Value': item['values'][0]['value']})
for item in output['data']]
df = pd.DataFrame(L)
print (df)
Name Period Value
0 page_video_views day 634
Sample for testing:
output = {
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
}]}
Try to convert dictionary after json loading to dataframe like:
output = json.loads(r)
df = pd.DataFrame.from_dict(output , orient='index')
df.reset_index(level=0, inplace=True)
If you are taking the data from the url. I would suggest this approach and passing only the data stored under an attribute
import request
data=request.get("url here").json('Period')
Period is now dictionary you can now call the pd.DataFrame.from_dict(data) to parse the data
df = pd.DataFrame.from_dict(Period)

Categories

Resources