Pandas DataFrame created for each row - python

I am attempting to pass data in JSON from an API to a Pandas DataFrame. I could not get pandas.read_json to work with the API data so I'm sure it's not the best solution, but I currently have for loop running through the JSON to extract the values I want.
Here is what I have:
import json
import urllib.request
import pandas as pd
r = urllib.request.urlopen("https://graph.facebook.com/v3.1/{page-id}/insights?access_token={access-token}&pretty=0&metric=page_impressions%2cpage_engaged_users%2cpage_fans%2cpage_video_views%2cpage_posts_impressions").read()
output = json.loads(r)
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
df = [{'Name': name, 'Period': period, 'Value': value}]
df = pd.DataFrame(df)
print(df)
And here is an excerpt of the JSON from the API:
{
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
},
The issue I am now facing is because of the For Loop (I believe), each row of data is being inserted into its own DataFrame like so:
Name Period Value
0 page_video_views day 465
Name Period Value
0 page_video_views week 3257
Name Period Value
0 page_video_views days_28 9987
Name Period Value
0 page_impressions day 1402
How can I pass all of them easily into the same DataFrame like so?
Name Period Value
0 page_video_views day 465
1 page_video_views week 3257
2 page_video_views days_28 9987
3 page_impressions day 1402
Again, I know this most likely isn't the best solution so any suggestions on how to improve any aspect are very welcome.

You can create list of dictionaries and pass to DataFrame constructor:
L = []
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
L.append({'Name': name, 'Period': period, 'Value': value})
df = pd.DataFrame(L)
Or use list comprehension:
L = [({'Name': item['name'], 'Period': item['period'], 'Value': item['values'][0]['value']})
for item in output['data']]
df = pd.DataFrame(L)
print (df)
Name Period Value
0 page_video_views day 634
Sample for testing:
output = {
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
}]}

Try to convert dictionary after json loading to dataframe like:
output = json.loads(r)
df = pd.DataFrame.from_dict(output , orient='index')
df.reset_index(level=0, inplace=True)

If you are taking the data from the url. I would suggest this approach and passing only the data stored under an attribute
import request
data=request.get("url here").json('Period')
Period is now dictionary you can now call the pd.DataFrame.from_dict(data) to parse the data
df = pd.DataFrame.from_dict(Period)

Related

Iterate json with multiple values per key for matching pattern: Set dataframe column value depending on value of other column (Faster Python Solution)

Below is the Problem statement.
Inputs: -
csv containing Time when a particular URL is hit.
json containing multiple url_categories with the matching url patters for each url_categories.
Required Output: -
csv, containing Time, url, and url_category. Where url_category is decided based on URL of input csv and url_patterns mentioned per url_category. If the URL doesn't match any of the patterns then category should be marked as 'Other'.
Objective: -
Python code should create the required output is a FAST way.
Input csv (simplified) which contains the time when a particular URL is hit, like below.
TIME,URL
11:51,/url3a/partC
12:51,/url6/partA
13:51,/url7/partA/partA/partB
14:51,/url5/partA/partB/part1
15:51,/url3b/partA
16:51,/url8/partA/partB
17:51,/url2a/
18:51,/url5/partA/partB
19:51,/url1/part1/part2
20:51,/url4b/partA
21:51,/url9/partA/partA/partB
22:51,/url2/partA/partB
23:51,/url1a/partD
00:51,/url3/partA/partB
01:51,/url9/partA/partA/partB
02:51,/url4a/
03:51,/url5b/partA/partE
04:51,/url7/partA/partA/partB
05:51,/url1b/part1
Input json (simplified) describing URL categories with URL patterns, like below.
{
"category1": [ "/url1/part1/part2", "/url1a/", "/url1b/part1" ],
"category2": [ "/url2/partA/partB", "/url2a/", "/url2b/partA" ],
"category3": [ "/url3/partA/partB", "/url3a/", "/url3b/partA" ],
"category4": [ "/url4/partA/partB", "/url4a/", "/url4b/partA" ],
"category5": [ "/url5/partA/partB", "/url5a/", "/url5b/partA" ],
}
I have a python code which achieves but it is very slow as I'm iterating through each dataframe row and each key & values in the json. Need a solution which in which the code executes much faster, as my input csv has many rows and input json also has many url categories and many url patterns associated with each url category.
json1 = '{"category1": ["/url1/part1/part2", "/url1a/", "/url1b/part1"], "category2": ["/url2/partA/partB", "/url2a/", "/url2b/partA"], "category3": ["/url3/partA/partB", "/url3a/", "/url3b/partA"], "category4": ["/url4/partA/partB", "/url4a/", "/url4b/partA"], "category5": ["/url5/partA/partB", "/url5a/", "/url5b/partA"]}'
print(json1)
json2 = json.loads(json1)
print(f"---json2: {json2}; \ntype(json2): {type(json2)}")
df = pd.read_csv(in_csv_path)
print(df)
for i in range(len(df)) :
for key in json2:
for url_pattern in json2[key]:
if str(df.loc[i, "URL"]).find(str(url_pattern)) != -1:
df.loc[i, "CATEGORY"] = key
df.fillna('Other', inplace=True)
print(df)
df.to_csv(out_csv, index=False)
Below is the output csv.
TIME,URL,CATEGORY
11:51,/url3a/partC ,category3
12:51,/url6/partA,Other
13:51,/url7/partA/partA/partB,Other
14:51,/url5/partA/partB,category5
15:51,/url3b/partA,category3
16:51,/url8/partA/partB,Other
17:51,/url2a/,category2
18:51,/url5/partA/partB,category5
19:51,/url1/part1/part2,category1
20:51,/url4b/partA,category4
21:51,/url9/partA/partA/partB,Other
22:51,/url2/partA/partB,category2
23:51,/url1a/,category1
00:51,/url3/partA/partB,category3
01:51,/url9/partA/partA/partB,Other
02:51,/url4a/,category4
03:51,/url5b/partA,category5
04:51,/url7/partA/partA/partB,Other
05:51,/url1b/part1,category1
Given the following dataframe and json string:
import json
import pandas as pd
df = pd.DataFrame(
[
{"TIME": "11:51", "URL": "/url3a/partC"},
{"TIME": "12:51", "URL": "/url6/partA"},
{"TIME": "13:51", "URL": "/url7/partA/partA/partB"},
{"TIME": "14:51", "URL": "/url5/partA/partB/part1"},
{"TIME": "15:51", "URL": "/url3b/partA"},
{"TIME": "16:51", "URL": "/url8/partA/partB"},
{"TIME": "17:51", "URL": "/url2a/"},
{"TIME": "18:51", "URL": "/url5/partA/partB"},
{"TIME": "19:51", "URL": "/url1/part1/part2"},
{"TIME": "20:51", "URL": "/url4b/partA"},
{"TIME": "21:51", "URL": "/url9/partA/partA/partB"},
{"TIME": "22:51", "URL": "/url2/partA/partB"},
{"TIME": "23:51", "URL": "/url1a/partD"},
{"TIME": "00:51", "URL": "/url3/partA/partB"},
{"TIME": "01:51", "URL": "/url9/partA/partA/partB"},
{"TIME": "02:51", "URL": "/url4a/"},
{"TIME": "03:51", "URL": "/url5b/partA/partE"},
{"TIME": "04:51", "URL": "/url7/partA/partA/partB"},
{"TIME": "05:51", "URL": "/url1b/part1"},
]
)
categories = json.loads(
'{"category1": ["/url1/part1/part2", "/url1a/", "/url1b/part1"], "category2": ["/url2/partA/partB", "/url2a/", "/url2b/partA"], "category3": ["/url3/partA/partB", "/url3a/", "/url3b/partA"], "category4": ["/url4/partA/partB", "/url4a/", "/url4b/partA"], "category5": ["/url5/partA/partB", "/url5a/", "/url5b/partA"]}'
)
Here is a more idiomatic way to do it:
# Rework categories
categories = {
url.split("/")[1]: cat for cat, urls in categories.items() for url in urls
}
# Process urls
df["Category"] = df["URL"].apply(lambda x: categories.get(x.split("/")[1], "Other"))
So that:
print(df)
# Output
TIME URL Category
0 11:51 /url3a/partC category3
1 12:51 /url6/partA Other
2 13:51 /url7/partA/partA/partB Other
3 14:51 /url5/partA/partB/part1 category5
4 15:51 /url3b/partA category3
5 16:51 /url8/partA/partB Other
6 17:51 /url2a/ category2
7 18:51 /url5/partA/partB category5
8 19:51 /url1/part1/part2 category1
9 20:51 /url4b/partA category4
10 21:51 /url9/partA/partA/partB Other
11 22:51 /url2/partA/partB category2
12 23:51 /url1a/partD category1
13 00:51 /url3/partA/partB category3
14 01:51 /url9/partA/partA/partB Other
15 02:51 /url4a/ category4
16 03:51 /url5b/partA/partE category5
17 04:51 /url7/partA/partA/partB Other
18 05:51 /url1b/part1 category1

How to remove delimeted pipe from my json column and split them to different columns and their respective values

"description": ID|100|\nName|Sam|\nCity|New York City|\nState|New York|\nContact|1234567890|\nEmail|1234#yahoo.com|
This is how my code in json looks like. I wanted to convert this json file to excel sheet to split the nested column to separate columns and have used pandas for it, but couldn't achieve it. The output I want in my excel sheet is:
ID Name City State Contact Email
100 Sam New York City New York 1234567890 1234#yahoo.com
I want to remove those pipes and the solution should be in pandas. Please help me out with this.
The code I am trying:
I want output as:
The output on excel sheet:
[2]: https://i.stack.imgur.com/QjSUU.png
The list of dict column looks like:
"assignees": [{
"id": 1234,
"username": "xyz",
"name": "XYZ",
"state": "active",
"avatar_url": "aaaaaaaaaaaaaaa",
"web_url": "bbbbbbbbbbb"
},
{
"id": 5678,
"username": "abcd",
"name": "ABCD",
"state": "active",
"avatar_url": "hhhhhhhhhhh",
"web_url": "mmmmmmmmm"
}
],
This could be one way:
import pandas as pd
df = pd.read_json('Sample.json')
df2 = pd.DataFrame()
for i in df.index:
desc = df['description'][i]
attributes = desc.split("\n")
d = {}
for attrib in attributes:
if not(attrib.startswith('Name') or attrib.startswith('-----')):
kv = attrib.split("|")
d[kv[0]] = kv[1]
df2 = df2.append(d, ignore_index=True)
print(df2)
df2.to_csv("output.csv")
Output xls:

How to split text inside a pandas dataframe into new dataframe columns

I have a list
list1= ['{"bank_name": null, "country": null, "url": null, "type": "Debit", "scheme": "Visa", "bin": "789452"}\n',
'{"prepaid": "", "bin": "123457", "scheme": "Visa", "type": "Debit", "bank_name": "Ohio", "url": "www.u.org", "country": "UKs"}\n']
I passed it into a dataframe:
df = pd.DataFrame({'bincol':list1})
print(df)
bincol
0 {"bank_name": null, "country": null, "url": nu...
1 {"prepaid": "", "bin": "123457", "scheme": "Vi...
I am trying to split bincol columns into new columns using this function
def explode_col(df, column_value):
df = df.dropna(subset=[column_value])
if isinstance(df[str(column_value)].iloc[0], str):
df[column_value] = df[str(column_value)].apply(ast.literal_eval)
expanded_child_df = (pd.concat({i: json_normalize(x) for i, x in .pop(str(column_value)).items()}).reset_index(level=1,drop=True).join(df, how='right', lsuffix='_left', rsuffix='_right').reset_index(drop=True))
expanded_child_df.columns = map(str.lower, expanded_child_df.columns)
return expanded_child_df
df2 = explode_col(df,'bincol')
But i am getting this error, am i missing something here ?
raise ValueError(f'malformed node or string: {node!r}')
ValueError: malformed node or string: <_ast.Name object at 0x7fd3aa05c400>
For me working in your sample data json.loads for convert data to dictionaries, then is used json_normalize for DataFrame:
import json
df = pd.json_normalize(df['bincol'].apply(json.loads))
print(df)
bank_name country url type scheme bin prepaid
0 None None None Debit Visa 789452 NaN
1 Ohio UKs www.u.org Debit Visa 123457

python generator to pandas dataframe

I have a generator being returned from:
data = public_client.get_product_trades(product_id='BTC-USD', limit=10)
How do i turn the data in to a pandas dataframe?
the method DOCSTRING reads:
"""{"Returns": [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]}"""
I have tried:
df = [x for x in data]
df = pd.DataFrame.from_records(df)
but it does not work as i get the error:
AttributeError: 'str' object has no attribute 'keys'
When i print the above "x for x in data" i see the list of dicts but the end looks strange, could this be why?
print(list(data))
[{'time': '2020-12-30T13:04:14.385Z', 'trade_id': 116918468, 'price': '27853.82000000', 'size': '0.00171515', 'side': 'sell'},{'time': '2020-12-30T12:31:24.185Z', 'trade_id': 116915675, 'price': '27683.70000000', 'size': '0.01683711', 'side': 'sell'}, 'message']
It looks to be a list of dicts but the end value is a single string 'message'.
Based on the updated question:
df = pd.DataFrame(list(data)[:-1])
Or, more cleanly:
df = pd.DataFrame([x for x in data if isinstance(x, dict)])
print(df)
time trade_id price size side
0 2020-12-30T13:04:14.385Z 116918468 27853.82000000 0.00171515 sell
1 2020-12-30T12:31:24.185Z 116915675 27683.70000000 0.01683711 sell
Oh, and BTW, you'll still need to change those strings into something usable...
So e.g.:
df['time'] = pd.to_datetime(df['time'])
for k in ['price', 'size']:
df[k] = pd.to_numeric(df[k])
You could access the values in the dictionary and build a dataframe from it (although not particularly clean):
dict_of_data = [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]
import pandas as pd
list_of_data = [list(dict_of_data[0].values()),list(dict_of_data[1].values())]
pd.DataFrame(list_of_data, columns=list(dict_of_data[0].keys())).set_index('time')
its straightforward just use the pd.DataFrame constructor:
#list_of_dicts = [{
# "time": "2014-11-07T22:19:28.578544Z",
# "trade_id": 74,
# "price": "10.00000000",
# "size": "0.01000000",
# "side": "buy"
# }, {
# "time": "2014-11-07T01:08:43.642366Z",
# "trade_id": 73,
# "price": "100.00000000",
# "size": "0.01000000",
# "side": "sell"
#}]
# or if you take it from 'data'
list_of_dicts = data[:-1]
df = pd.DataFrame(list_of_dicts)
df
Out[4]:
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell
UPDATE
according to the question update, it seems you have json data that is still string...
import json
data = json.loads(data)
data = data['Returns']
pd.DataFrame(data)
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell

Write json format using pandas Series and DataFrame

I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks
Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)

Categories

Resources