I have read through many articles and posts to connect to an api then format it into int/str however I did mange to make possibly the longest winded way ever its real ugly please could someone show me the shortest most efficient way to accomplish the below code any suggestions would be greatly appreciated bassically looking to print out "eos" in str format and "price" as int Thanks!
import urllib
import json
import pandas as pd
import numpy as np
import requests
r = requests.get('https://api.coinmarketcap.com/v1/ticker/eos/')
with open('events.csv','w') as fd:
fd.write(r.text)
data = pd.read_csv('events.csv', names=['Choose One'])
i = data.iloc[[6], [0]]
a = str(i)
name,price = a.split(":")
string = price[2:-1]
print(string)
It's simpler to just use pandas read_json to read the file into a data frame, read_json will automatically assign the apt datatype to each column, then use column selection to select 'name','price_usd' columns (of-course in this case there is only one row, but the same code can be used with multiple rows)
i.e.
import pandas as pd
df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
print(df[['name','price_usd']].apply(lambda row:'{}: {:.0f}'.format(ro
w['name'],row['price_usd']),axis=1))
using .0f in the format statement will display the integer part (rounded) of the price_usd value so the output will be.
0 EOS: 9
alternatively using the round function will round the float values
i.e.
In [34]: import pandas as pd
...: df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
...: print(df[['name','price_usd']].apply(lambda row:'{}: {:}'.format(row['n
...: ame'],round(row['price_usd'],2)),axis=1))
...:
...:
0 EOS: 8.99
dtype: object
Simply use json.loads(r.text) or much easier directly r.json().
Say, right now the api returns the following data:
[
{
"id": "eos",
"name": "EOS",
"symbol": "EOS",
"rank": "9",
"price_usd": "9.31992",
"price_btc": "0.00106154",
"24h_volume_usd": "596467000.0",
"market_cap_usd": "6034993504.0",
"available_supply": "647537050.0",
"total_supply": "900000000.0",
"max_supply": "1000000000.0",
"percent_change_1h": "1.3",
"percent_change_24h": "-6.81",
"percent_change_7d": "-36.4",
"last_updated": "1517755757"
}
]
If you use r.json(), you get this as a json, otherwise load it with data = json.loads(r.text) and save it to a pandas DataFrame with df = pd.DataFrame(data) which then looks like the following:
In [15]: df
Out[15]:
24h_volume_usd available_supply id last_updated market_cap_usd max_supply name percent_change_1h percent_change_24h percent_change_7d price_btc price_usd rank symbol total_supply
0 596467000.0 647537050.0 eos 1517755757 6034993504.0 1000000000.0 EOS 1.3 -6.81 -36.4 0.00106154 9.31992 9 EOS 900000000.0
Access the data with pandas indexing:
In [8]: df[['name', 'price_usd']]
Out[8]:
name price_usd
0 EOS 9.29186
Or for printing:
In [18]: print df.loc[0, 'name'], ': ', df.loc[0, 'price_usd']
EOS : 9.31992
Related
I just started with Great Expectations library and I want to know if it is possible to use it to remove invalidated data from Pandas DataFrame. And how I can do that if is possible ?
Also I want to insert invalid data to PostgreSQL database.
I didn't find anything about this in the documentation and on searching the Web.
Later Edit :
To clarify: I need that in the case great expectation for example find 5 rows in a DataFrame that are invalid (for example df.expect_column_values_to_not_be_null('age') has 5 rows with null) to remove them from original DataFrame and insert them in a PostgreSQL errors table
Great Expectations is a powerful tool to validate data.
Like all powerful tools, it's not that straightforward.
You can start from here:
import great_expectations as ge
import numpy as np
import pandas as pd
# get some random numbers and create a pandas df
df_raw = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# initialize a "great_expectations" df
df = ge.from_pandas(df_raw)
# search for invalidate data on column 'A'.
# In this case, i'm looking for any null value from column 'A'.
df.expect_column_values_to_not_be_null('A')
Results:
{
"exception_info": null,
"expectation_config": {
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "A",
"result_format": "BASIC"
},
"meta": {}
},
"meta": {},
"success": true,
"result": {
"element_count": 100,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"partial_unexpected_list": []
}
}
Look at the response : good news !!!
There aren't null values in my df.
"unexpected_count" is equal to 0
API Reference :
https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/index.html
EDIT:
If you need simply to find some invalid values and split your df into:
Clean Dataframe
Dirty Dataframe
maybe you dont need "great_expectations". you can use a function like this:
import pandas as pd
my_df = pd.DataFrame({'A': [1,2,1,2,3,0,1,1,5,2]})
def check_data_quality(dataframe):
df = dataframe
clean_df = df[df['A'].isin([1, 2])]
dirty_df = df[df["A"].isin([1, 2]) == False]
return {'clean': clean_df,
'dirty': dirty_df}
my_df_clean = check_data_quality(my_df)['clean']
my_df_dirty = check_data_quality(my_df)['dirty']
I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])
I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.
Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0
There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))
Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.
This is a follow-up question to this.
Using Pandas Style, I manage to format all of the values in a dataframe into values that have commas as thousands separators. However, when there are empty strings in a dataframe, the formatting fails.
Basically, my goal is convert from this: to this:
Can anyone help me with this?
This is the code I have so far:
import pandas as pd
from IPython.display import HTML
styles = [
hover(),
dict(selector = "th",
props = [("font-size", "110%"),
("text-align", "left"),
("background-color", "#cacaca")
]
)
]
column_01 = ["", 2000000000, "", 21000000, 3000]
df = pd.DataFrame(column_01)
int_frmt = lambda x: "{:,}".format(x) # Integer
float_frmt = lambda x: "{:,.0f}".format(x) if x > 1e3 else "{:,.2f}".format(x) # Float
str_frmt = lambda x: "{:}".format(x) # <----- Added for empty strings but fails
frmt_map = {np.dtype("int64"): int_frmt,
np.dtype("float64"): float_frmt,
np.dtype("S"): str_frmt # <----- Added for empty strings but fails
}
frmt = {col: frmt_map[df.dtypes[col]] for col in df.columns if df.dtypes[col] in frmt_map.keys()}
html = (df.style.set_table_styles(styles).format(frmt))
html
Using NumPy you could create a function to do the conversion, and vectorize() it. This can then be applied to your dataframe as follows:
import numpy as np
def thousands(x):
try:
return '{:,}'.format(int(x))
except ValueError as e:
return x
data = np.array(["","2000000000", "", "21000000", "3000"])
f_thousands = np.vectorize(thousands)
print f_thousands(data)
Giving you:
['' '2,000,000,000' '' '21,000,000' '3,000']
This attempts to convert the entry to an integer, and then use format's thousands separator. If the conversion fails, it returns the passed entry unchanged, e.g. blank
See also Python's Format Specification Mini-Language for more information.
Using Pandas, this could be done as follows:
import pandas as pd
def thousands(x):
try:
return '{:,}'.format(int(x))
except ValueError as e:
return x
data = pd.DataFrame(["","2000000000", "", "21000000", "3000"])
print data.applymap(thousands)
Giving you:
0
0
1 2,000,000,000
2
3 21,000,000
4 3,000
I have a generated file as follows:
[{"intervals": [{"overwrites": 35588.4, "latency": 479.52}, {"overwrites": 150375.0, "latency": 441.1485001192274}], "uid": "23"}]
I simplified the file a bit for space reasons (there are more columns besides for the "overwrites" and "latency" ). I would like to import the data into a dataframe so I can later on draw the latency. I tried the following:
with open(os.path.join(path, "my_file.json")) as json_file:
curr_list=json.load(json_file)
df=pd.Series(curr_list[0]['intervals'])
print df
which returned:
0 {u'overwrites': 35588.4, u'latency...
1 {u'overwrites': 150375.0, u'latency...
However I couldn't get to store df in a data structure that allows me to access the latency field as follows:
graph = df[['latency']]
graph.plot(title="latency")
Any ideas?
Thanks for the help!
I think you can use json_normalize:
import pandas as pd
from pandas.io.json import json_normalize
data = [{"intervals": [{"overwrites": 35588.4, "latency": 479.52},
{"overwrites": 150375.0, "latency": 441.1485001192274}],
"uid": "23"}]
result = json_normalize(data, 'intervals', ['uid'])
print result
latency overwrites uid
0 479.5200 35588.4 23
1 441.1485 150375.0 23