Combining Code Steps into a User-Defined Function - python

I'm writing a function that retrieves data from an API based on an ID#, then reads the json response into a pandas dataframe, munges the dataframe, and finally compiles every dataframe together. The goal is to pass a pandas series of ID#'s into the function, to retrieve the relevant data for a list of thousands of IDs.
When I execute every step manually, the steps work. I get a nice one-row pandas dataframe with all of the columns and the values that I want. When I combine all of the steps within a function containing a for-loop, it stops working.
Here are the steps:
req = Request('https://gs-api.greatschools.org/schools/3601714/metrics') ##request
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXX') ##authenticate
content = urlopen(req).read() ##retrieve
data = pd.read_json(content) ##convert json to pandas dataframe
data.reset_index(inplace=True) ##reset index
data['id'] = 3601714 ##add id column
data.drop(columns=['head-official-name','head-official-email'],inplace=True) ##drop columns
data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity') ##pivot the dataframe
I've combined all of these steps into a function:
def demographics(universal_id):
demo_mstr = []
for item in universal_id:
id = item
req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
content = urlopen(req).read()
data = pd.read_json(content)
data.reset_index(inplace=True)
data['id'] = id
data.drop(columns=['head-official-name','head-official-email'],inplace=True)
data = data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity')
demo_mstr.append(data)
return demo_mstr
If I run the function on a test list of ID#s, I get the following error: HTTPError: HTTP Error 422:
I've rewritten the function a number of times, and I've managed to get different error types, but not a working function.
What am I missing?

Update: I am answering my own question, in the hopes that it helps someone.
So, I figured out that the 422 error was related to the fact that not every ID# had the data associated with it, in the API. Hence, some of the API calls were returning no data, which caused the error.
As for pivot, I realized that the need for pivot was caused by pandas' poor handling of json data. In my mind, pd.read_json is only good for exploratory analysis, and even then, it's kind of useless.
What you should do instead is use r.json() to unpack your raw json into its constituent dictionaries, and you need to write a parse_json function that iterates over the dictionaries, and converts them into the column names you desire.
Converting first to a pandas dataframe, then pivoting, then trying to append dataframes together is a recipe for disaster. Stay in json, do what you need to do with json, append all the json arrays into a master list, and then only convert to pandas dataframe at the very end!

Related

How do I un-nest multiple sets of data and set them as columns

I am trying to bring this data from an api into a data frame however some of the data is nested under another one. So for example I would like to have each stat like 'passing_att' as its own column. But it is nested under 'passing' which 'passing' and 'rushing' and 'receiving' is all nested under 'stats'. I can bring just 'passing' into a data frame which would show 'passing_att' and the rest of the 'passing' data but then I don't have names and all the other data I would like.
Code
The correct word is to 'flatten' a JSON object. There are many ways to do this. You could use the flatten_json pip package or write your own function to do this. Though for your specific use case, you could use the json_normalize() function from the pandas library because you are trying to flatten a list of JSON objects that can easily be converted into a pandas dataframe.
Once you have the nested JSON object as show below (performed in your code):
json_list = json.loads(response.text)
Use the following code to flatten it:
df = pandas.json_normalize(json_list)
You will get all the nested columns flattened into a single level dataframe separated by a period. For example, the passing_att attribute cann be accessed via the column name 'stats.passing.passing_att'. You can change the separator by passing the sep parameter to the json_normalize function. For example, if you want to use a hyphen as the separator, you can do so by passing the following parameter to the json_normalize function:
df = pandas.json_normalize(json_list, sep='-')
If you want to select only a few columns, you can pass a list of column names to the dataframe's [] operator. For example, if you want to select only the player_name, position, passing_att and passing_cmp columns, you can do so by using the following code:
df = df[['player_name', 'position', 'stats.passing.passing_att', 'stats.passing.passing_cmp']]

using pandas.read_csv() for malformed csv data

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

Extracting data from a dict like column in a dataframe

I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.
As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.

Python JSON dict to dataframe no rows

So after hours of unsuccesfull googling I finally decided to post this here.
I am trying to convert some data obtained by an API call to a
Pandas.DataFrame()
This is my code:
response = requests.get(url)
data_as_list = response.json()['data']
for dct in data_as_list:
json_df = pd.DataFrame.from_records(dct)
Unfortunately, the returned dataframe only contains the column names, but no row data at all, even though the dictionary has some. I already tried from_dict and pd.read_json() (after dumping it into a JSON string). But all of these had the same
result.
The data is a nested dictionary in JSON format and looks like this
You can make DataFrames From python lists (that contains dictionaries or lists (nested list)) like this code:
json_df = pd.DataFrame(data_as_list)
Do this,
pd.DataFrame(data_as_list)

Selecting Pandas DataFrame Rows Based On Conditions

I am new to Python and getting to grips with Pandas. I am trying to perform a simple import CSV, filter, write CSV but can't the filter seems to be dropping rows of data compared to my Access query.
I am importing via the command below:
Costs1516 = pd.read_csv('C:......../1b Data MFF adjusted.csv')
Following import I get a data warning that the service code column contains data of multiple types (some are numerical codes others are purely text) but the import seems to attribute data type Object which I thought would just treat them both as strings and all would be fine....
I want the output dataframe to have the same structure as the the imported data (Costs1516), but only to include rows where 'Service Code' = '110'.
I have pulled the following SQL from Access which seems to do the job well, and returns 136k rows:
SELECT [1b Data MFF adjusted].*, [1b Data MFF adjusted].``[Service code]
FROM [1b Data MFF adjusted]
WHERE ((([1b Data MFF adjusted].[Service code])="110"));
My pandas equivalent is below but only returns 99k records:
Costs1516Ortho = Costs1516.loc[Costs1516['Service code'] == '110']
I have compared the two outputs and I can't see any reason why pandas is excluding some lines and including others....I'm really stuck...any suggested areas to look or approaches to test gratefully received.

Categories

Resources