Extracting complex data from a DataFrame

Extracting complex data from a DataFrame - python

I have to analyze some complex data which is in a Pandas DataFrame. I am not aware of the exact structure of the data inside the dataframe. I have pulled the data from a Json file. I used the "head" syntax to look at top level data.
If I want to extract the group manufacturer or nutrients in a seperate dataframe, how should I go about doing that so that I can do some Statistical analysis.
with open("nutrients.json") as f:
objects = [json.loads(line) for line in f]
df = pd.DataFrame(objects)
print(df.head())
group manufacturer \
0 Dairy and Egg Products
1 Dairy and Egg Products
2 Dairy and Egg Products
3 Dairy and Egg Products
4 Dairy and Egg Products
meta \
0 {'langual': [], 'nitrogen_factor': '6.38', 're...
1 {'langual': [], 'nitrogen_factor': '6.38', 're...
2 {'langual': [], 'nitrogen_factor': '6.38', 're...
3 {'langual': [], 'nitrogen_factor': '6.38', 're...
4 {'langual': [], 'nitrogen_factor': '6.38', 're...
name \
0 {'long': 'Butter, salted', 'sci': '', 'common'...
1 {'long': 'Butter, whipped, with salt', 'sci': ...
2 {'long': 'Butter oil, anhydrous', 'sci': '', '...
3 {'long': 'Cheese, blue', 'sci': '', 'common': []}
4 {'long': 'Cheese, brick', 'sci': '', 'common':...
nutrients \
0 [{'code': '203', 'value': '0.85', 'units': 'g'...
1 [{'code': '203', 'value': '0.85', 'units': 'g'...
2 [{'code': '203', 'value': '0.28', 'units': 'g'...
3 [{'code': '203', 'value': '21.40', 'units': 'g...
4 [{'code': '203', 'value': '23.24', 'units': 'g...
portions
0 [{'g': '227', 'amt': '1', 'unit': 'cup'}, {'g'...
1 [{'g': '151', 'amt': '1', 'unit': 'cup'}, {'g'...
2 [{'g': '205', 'amt': '1', 'unit': 'cup'}, {'g'...
3 [{'g': '28.35', 'amt': '1', 'unit': 'oz'}, {'g...
4 [{'g': '132', 'amt': '1', 'unit': 'cup, diced'...

Related

ccxt OKEx placing orders

I placed DEMO order on OKEx with amount 246 and price 0.46. When I looked on site, order amount was more than 11k:
I fetched info about order:
{'info': {'accFillSz': '0', 'avgPx': '', 'cTime': '1652262833825', 'category': 'normal', 'ccy': '', 'clOrdId': 'e847386590ce4dBCc812b22b16d7807c', 'fee': '0', 'feeCcy': 'USDT', 'fillPx': '', 'fillSz': '0', 'fillTime': '', 'instId': 'XRP-USDT-SWAP', 'instType': 'SWAP', 'lever': '1', 'ordId': '444557778278035458', 'ordType': 'limit', 'pnl': '0', 'posSide': 'long', 'px': '0.45693', 'rebate': '0', 'rebateCcy': 'USDT', 'side': 'buy', 'slOrdPx': '-1', 'slTriggerPx': '0.44779', 'slTriggerPxType': 'mark', 'source': '', 'state': 'live', 'sz': '246', 'tag': '', 'tdMode': 'isolated', 'tgtCcy': '', 'tpOrdPx': '-1', 'tpTriggerPx': '0.46606', 'tpTriggerPxType': 'mark', 'tradeId': '', 'uTime': '1652262833825'}, 'id': '444557778278035458', 'clientOrderId': 'e847386590ce4dBCc812b22b16d7807c', 'timestamp': 1652262833825, 'datetime': '2022-05-11T09:53:53.825Z', 'lastTradeTimestamp': None, 'symbol': 'XRP/USDT:USDT', 'type': 'limit', 'timeInForce': None, 'postOnly': None, 'side': 'buy', 'price': 0.45693, 'stopPrice': 0.44779, 'average': None, 'cost': 0.0, 'amount': 246.0, 'filled': 0.0, 'remaining': 246.0, 'status': 'open', 'fee': {'cost': 0.0, 'currency': 'USDT'}, 'trades': [], 'fees': [{'cost': 0.0, 'currency': 'USDT'}]}
and amount is 246.
Here is my code:
exchange = ccxt.okx(
{
'apiKey': API_KEY,
'secret': API_SECRET,
'password': API_PASSPHRASE,
'options': {
'defaultType': 'swap'
},
'headers': {
'x-simulated-trading': '1'
}
}
exchange.load_markets()
market = exchange.market(PAIR)
params = {
'tdMode': 'isolated',
'posSide': 'long',
'instId': market['id'],
'side': 'buy',
'sz': 246,
'tpOrdPx': '-1',
'slOrdPx': '-1',
'tpTriggerPx': str(take_profit),
'slTriggerPx': str(stop_loss),
'tpTriggerPxType': 'mark',
'slTriggerPxType': 'mark',
}
order = exchange.create_order(
f"{PAIR}", ORDER_TYPE, 'buy', summa, price, params=params)
info = exchange.fetch_order(order['id'], PAIR)
print(info)
What I'm doing wrong?

For starters you can only buy multiples of 100 of XRP as you can see in the screenshot below so you can only buy 200 or 300 and not 246.
Secondly, it looks like there's a multiplier of 100 being applied in the api where 1 = 100 XRP. I was able to deduce this by entering 24,600 XRP which gives you around $11k that you mentioned.
In your case, if you were to buy 200 or 300 XRP, you would need to enter 2 or 3 as an amount in the api request.

Python Pandas, how to group list of dict and sort

I have a list of dict like:
data = [
{'ID': '000681', 'type': 'B:G+', 'testA': '11'},
{'ID': '000682', 'type': 'B:G+', 'testA': '-'},
{'ID': '000683', 'type': 'B:G+', 'testA': '13'},
{'ID': '000684', 'type': 'B:G+', 'testA': '14'},
{'ID': '000681', 'type': 'B:G+', 'testB': '15'},
{'ID': '000682', 'type': 'B:G+', 'testB': '16'},
{'ID': '000683', 'type': 'B:G+', 'testB': '17'},
{'ID': '000684', 'type': 'B:G+', 'testB': '-'}
]
How to use Pandas to get data like:
data = [
{'ID': '000683', 'type': 'B:G+', 'testA': '13', 'testB': '17'},
{'ID': '000681', 'type': 'B:G+', 'testA': '11', 'testB': '15'},
{'ID': '000684', 'type': 'B:G+', 'testA': '14', 'testB': '-'},
{'ID': '000682', 'type': 'B:G+', 'testA': '-', 'testB': '16'}
]
Same ID and same type to one col and sorted by testA and testB values
sorted : both testA and testB have value and lager value of testA+testB at the top.

First convert columns to numeric with replace non numeric to integers and then aggregate sum:
df = pd.DataFrame(data)
c = ['testA','testB']
df[c] = df[c].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df1 = df.groupby(['ID','type'])[c].sum(min_count=1).sort_values(c).fillna('-').reset_index()
print (df1)
ID type testA testB
0 000681 B:G+ 11 15
1 000683 B:G+ 13 17
2 000684 B:G+ 14 -
3 000682 B:G+ - 16
If want sorting by sum of both columns use Series.argsort:
df = pd.DataFrame(data)
c = ['testA','testB']
df[c] = df[c].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = df.groupby(['ID','type'])[c].sum(min_count=1)
df2 = df2.iloc[(-df2).sum(axis=1).argsort()].fillna('-').reset_index()
print (df2)
ID type testA testB
0 000683 B:G+ 13 17
1 000681 B:G+ 11 15
2 000682 B:G+ - 16
3 000684 B:G+ 14 -

Web scraping using Beautifulsoup to collect dropdown values

I am new to Python, trying to get a list of all the drop down values from the following website "https://www.sfma.org.sg/member/category" but failing to do so.
The below code is producing an empty list
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
import pandas as pd
page = "https://www.sfma.org.sg/member/category"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
categories = soup.find_all('select', attrs={'class' :'w3-select w3-border'})
The desired output is the below list :-
['Alcoholic Beverage','Beer','Bottled
Beverage',..........,'Trader','Wholesaler']
Thanks !!

The options are loaded through Javascript, but the data is on the page. With some crude regexes you can extract it:
import re
import json
import requests
url = 'https://www.sfma.org.sg/member/category/'
text = requests.get(url).text
d = re.findall(r'var\s*cObject\s*=\s*(.*)\s*;', text)[0]
d = re.sub(r'(\w+)(?=:)', r'"\1"', d)
d = json.loads(d.replace("'", '"'))
from pprint import pprint
pprint(d, width=200)
Prints:
{'category': [{'cat_type': '1', 'id': '1', 'name': 'Alcoholic Beverage', 'permalink': 'alcoholic-beverage', 'status': '2'},
{'cat_type': '1', 'id': '2', 'name': 'Beer', 'permalink': 'beer', 'status': '2'},
{'cat_type': '1', 'id': '3', 'name': 'Bottled Beverage', 'permalink': 'bottled-beverage', 'status': '2'},
{'cat_type': '1', 'id': '4', 'name': 'Canned Beverage', 'permalink': 'canned-beverage', 'status': '2'},
{'cat_type': '1', 'id': '5', 'name': 'Carbonated Beverage', 'permalink': 'carbonated-beverage', 'status': '2'},
{'cat_type': '1', 'id': '6', 'name': 'Cereal / Grain Beverage', 'permalink': 'cereal-grain-beverage', 'status': '2'},
{'cat_type': '1', 'id': '7', 'name': 'Cider', 'permalink': 'cider', 'status': '2'},
{'cat_type': '1', 'id': '8', 'name': 'Coffee', 'permalink': 'coffee', 'status': '2'},
{'cat_type': '1', 'id': '9', 'name': 'Distilled Water', 'permalink': 'distilled-water', 'status': '2'},
{'cat_type': '1', 'id': '10', 'name': 'Fruit / Vegetable Juice', 'permalink': 'fruit-vegetable-juice', 'status': '2'},
{'cat_type': '1', 'id': '11', 'name': 'Herbal Beverage', 'permalink': 'herbal-beverage', 'status': '2'},
{'cat_type': '1', 'id': '12', 'name': 'Instant Beverage', 'permalink': 'instant-beverage', 'status': '2'},
{'cat_type': '1', 'id': '13', 'name': 'Milk', 'permalink': 'milk', 'status': '2'},
{'cat_type': '1', 'id': '14', 'name': 'Mineral Water', 'permalink': 'mineral-water', 'status': '2'},
...and so on.
EDIT: To print just names of categories, you can do this:
for c in d['category']:
print(c['name'])
Prints:
Alcoholic Beverage
Beer
Bottled Beverage
Canned Beverage
Carbonated Beverage
Cereal / Grain Beverage
Cider
...
Manufacturer
Restaurant
Retail Outlet
Supplier
Trader
Wholesaler

This is not really a proper question but still.
categories = soup.find("select", attrs={"name": "ctype"}).find_all('option')
result = [cat.get_text() for cat in categories]

How to unpack an object of dictionaries to a range of Data Frames

I am creating a function that grabs data from an ERP system to display to the end user.
I want to unpack an object of dictionaries and create a range of Pandas DataFrames with them.
For example, I have:
troRows
{0: [{'productID': 134336, 'price': '10.0000', 'amount': '1', 'cost': 0}],
1: [{'productID': 142141, 'price': '5.5000', 'amount': '4', 'cost': 0}],
2: [{'productID': 141764, 'price': '5.5000', 'amount': '1', 'cost': 0}],
3: [{'productID': 81661, 'price': '4.5000', 'amount': '1', 'cost': 0}],
4: [{'productID': 146761, 'price': '5.5000', 'amount': '1', 'cost': 0}],
5: [{'productID': 143585, 'price': '5.5900', 'amount': '9', 'cost': 0}],
6: [{'productID': 133018, 'price': '5.0000', 'amount': '1', 'cost': 0}],
7: [{'productID': 146250, 'price': '13.7500', 'amount': '5', 'cost': 0}],
8: [{'productID': 149986, 'price': '5.8900', 'amount': '2', 'cost': 0},
{'productID': 149790, 'price': '4.9900', 'amount': '2', 'cost': 0},
{'productID': 149972, 'price': '5.2900', 'amount': '2', 'cost': 0},
{'productID': 149248, 'price': '2.0000', 'amount': '2', 'cost': 0},
{'productID': 149984, 'price': '4.2000', 'amount': '2', 'cost': 0},
Each time the function will need to unpack x number of dictionaries which may have different number of rows into a range of DataFrames.
So for example, this range of Dictionaries would return
DF0, DF1, DF2, DF3, DF4, DF5, DF6, DF7, DF8.
I can unpack a single Dictionary with:
pd.DataFrame(troRows[8])
which returns
amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984
How can I structure my code so that it does this for all the dictionaries for me?

Solution for dictionary of DataFrames - use dictioanry comprehension and set index values to keys of dictionary:
dfs = {k: pd.DataFrame(v) for k, v in troRows.items()}
print (dfs)
{0: amount cost price productID
0 1 0 10.0000 134336, 1: amount cost price productID
0 4 0 5.5000 142141, 2: amount cost price productID
0 1 0 5.5000 141764, 3: amount cost price productID
0 1 0 4.5000 81661, 4: amount cost price productID
0 1 0 5.5000 146761, 5: amount cost price productID
0 9 0 5.5900 143585, 6: amount cost price productID
0 1 0 5.0000 133018, 7: amount cost price productID
0 5 0 13.7500 146250, 8: amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984}
print (dfs[8])
amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984
Solutions for one DataFrame:
Use list comprehension with flattening and pass it to DataFrame constructor:
troRows = pd.Series([[{'productID': 134336, 'price': '10.0000', 'amount': '1', 'cost': 0}],
[{'productID': 142141, 'price': '5.5000', 'amount': '4', 'cost': 0}],
[{'productID': 141764, 'price': '5.5000', 'amount': '1', 'cost': 0}],
[{'productID': 81661, 'price': '4.5000', 'amount': '1', 'cost': 0}],
[{'productID': 146761, 'price': '5.5000', 'amount': '1', 'cost': 0}],
[{'productID': 143585, 'price': '5.5900', 'amount': '9', 'cost': 0}],
[{'productID': 133018, 'price': '5.0000', 'amount': '1', 'cost': 0}],
[{'productID': 146250, 'price': '13.7500', 'amount': '5', 'cost': 0}],
[{'productID': 149986, 'price': '5.8900', 'amount': '2', 'cost': 0},
{'productID': 149790, 'price': '4.9900', 'amount': '2', 'cost': 0},
{'productID': 149972, 'price': '5.2900', 'amount': '2', 'cost': 0},
{'productID': 149248, 'price': '2.0000', 'amount': '2', 'cost': 0},
{'productID': 149984, 'price': '4.2000', 'amount': '2', 'cost': 0}]])
df = pd.DataFrame([y for x in troRows for y in x])
Another solution for flatten your data is use chain.from_iterable:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(troRows)))
print (df)
amount cost price productID
0 1 0 10.0000 134336
1 4 0 5.5000 142141
2 1 0 5.5000 141764
3 1 0 4.5000 81661
4 1 0 5.5000 146761
5 9 0 5.5900 143585
6 1 0 5.0000 133018
7 5 0 13.7500 146250
8 2 0 5.8900 149986
9 2 0 4.9900 149790
10 2 0 5.2900 149972
11 2 0 2.0000 149248
12 2 0 4.2000 149984

Filling in blank dictionary values based on other key value pairs

I have a df that contains a column ['mjtheme_namecode'] which is in dictionary form containing a code and a name. The codes all have numbers but some of the names are missing. I would like to fill in the missing name values based on other pairs with the same code. Here is the df column in question:
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize
df = pd.read_json('data/world_bank_projects.json')
print(df['mjtheme_namecode'].head(15))
0 [{'code': '8', 'name': 'Human development'}, {...
1 [{'code': '1', 'name': 'Economic management'},...
2 [{'code': '5', 'name': 'Trade and integration'...
3 [{'code': '7', 'name': 'Social dev/gender/incl...
4 [{'code': '5', 'name': 'Trade and integration'...
5 [{'code': '6', 'name': 'Social protection and ...
6 [{'code': '2', 'name': 'Public sector governan...
7 [{'code': '11', 'name': 'Environment and natur...
8 [{'code': '10', 'name': 'Rural development'}, ...
9 [{'code': '2', 'name': 'Public sector governan...
10 [{'code': '10', 'name': 'Rural development'}, ...
11 [{'code': '10', 'name': 'Rural development'}, ...
12 [{'code': '4', 'name': ''}]
13 [{'code': '5', 'name': 'Trade and integration'...
14 [{'code': '6', 'name': 'Social protection and ...
Name: mjtheme_namecode, dtype: object
I know I could make the column a separate df and then ffill, but I think I would have to reindex, so I don't think I could put it back in place after that. I'm thinking ideally I'd make a list (with no duplicates) of only dict items with both codes and names then use that list to iterate over the dictionary in a for loop where name becomes the matching value from the non-duplicate list I created. Does this make sense? Not sure how to go about it.

You can take a similar approach of creating a new DataFrame, but then transition back:
theme= pd.DataFrame([val for pair in df['mjtheme_namecode'].values for val in pair])
mapper = theme.drop_duplicates().replace(r'', np.nan).dropna().set_index('code').name.to_dict()
Using a list comprehension to put it all together:
s = pd.Series(
[[{'code': i['code'], 'name': mapper[i['code']]}
for i in t] for t in df.mjtheme_namecode]
)
s.head(13)
0 [{'code': '8', 'name': 'Human development'}, {...
1 [{'code': '1', 'name': 'Economic management'},...
2 [{'code': '5', 'name': 'Trade and integration'...
3 [{'code': '7', 'name': 'Social dev/gender/incl...
4 [{'code': '5', 'name': 'Trade and integration'...
5 [{'code': '6', 'name': 'Social protection and ...
6 [{'code': '2', 'name': 'Public sector governan...
7 [{'code': '11', 'name': 'Environment and natur...
8 [{'code': '10', 'name': 'Rural development'}, ...
9 [{'code': '2', 'name': 'Public sector governan...
10 [{'code': '10', 'name': 'Rural development'}, ...
11 [{'code': '10', 'name': 'Rural development'}, ...
12 [{'code': '4', 'name': 'Financial and private ...
dtype: object
As you can see, the last row (row 12) has been correctly filled in, as have the others, and you can reassign this to your original DataFrame.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting complex data from a DataFrame - python

Related

ccxt OKEx placing orders

Python Pandas, how to group list of dict and sort

Web scraping using Beautifulsoup to collect dropdown values

How to unpack an object of dictionaries to a range of Data Frames

Filling in blank dictionary values based on other key value pairs

Categories

Resources