Selecting Pandas DataFrame Rows Based On Conditions - python

I am new to Python and getting to grips with Pandas. I am trying to perform a simple import CSV, filter, write CSV but can't the filter seems to be dropping rows of data compared to my Access query.
I am importing via the command below:
Costs1516 = pd.read_csv('C:......../1b Data MFF adjusted.csv')
Following import I get a data warning that the service code column contains data of multiple types (some are numerical codes others are purely text) but the import seems to attribute data type Object which I thought would just treat them both as strings and all would be fine....
I want the output dataframe to have the same structure as the the imported data (Costs1516), but only to include rows where 'Service Code' = '110'.
I have pulled the following SQL from Access which seems to do the job well, and returns 136k rows:
SELECT [1b Data MFF adjusted].*, [1b Data MFF adjusted].``[Service code]
FROM [1b Data MFF adjusted]
WHERE ((([1b Data MFF adjusted].[Service code])="110"));
My pandas equivalent is below but only returns 99k records:
Costs1516Ortho = Costs1516.loc[Costs1516['Service code'] == '110']
I have compared the two outputs and I can't see any reason why pandas is excluding some lines and including others....I'm really stuck...any suggested areas to look or approaches to test gratefully received.

Related

using pandas.read_csv() for malformed csv data

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

Combining Code Steps into a User-Defined Function

I'm writing a function that retrieves data from an API based on an ID#, then reads the json response into a pandas dataframe, munges the dataframe, and finally compiles every dataframe together. The goal is to pass a pandas series of ID#'s into the function, to retrieve the relevant data for a list of thousands of IDs.
When I execute every step manually, the steps work. I get a nice one-row pandas dataframe with all of the columns and the values that I want. When I combine all of the steps within a function containing a for-loop, it stops working.
Here are the steps:
req = Request('https://gs-api.greatschools.org/schools/3601714/metrics') ##request
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXX') ##authenticate
content = urlopen(req).read() ##retrieve
data = pd.read_json(content) ##convert json to pandas dataframe
data.reset_index(inplace=True) ##reset index
data['id'] = 3601714 ##add id column
data.drop(columns=['head-official-name','head-official-email'],inplace=True) ##drop columns
data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity') ##pivot the dataframe
I've combined all of these steps into a function:
def demographics(universal_id):
demo_mstr = []
for item in universal_id:
id = item
req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
content = urlopen(req).read()
data = pd.read_json(content)
data.reset_index(inplace=True)
data['id'] = id
data.drop(columns=['head-official-name','head-official-email'],inplace=True)
data = data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity')
demo_mstr.append(data)
return demo_mstr
If I run the function on a test list of ID#s, I get the following error: HTTPError: HTTP Error 422:
I've rewritten the function a number of times, and I've managed to get different error types, but not a working function.
What am I missing?
Update: I am answering my own question, in the hopes that it helps someone.
So, I figured out that the 422 error was related to the fact that not every ID# had the data associated with it, in the API. Hence, some of the API calls were returning no data, which caused the error.
As for pivot, I realized that the need for pivot was caused by pandas' poor handling of json data. In my mind, pd.read_json is only good for exploratory analysis, and even then, it's kind of useless.
What you should do instead is use r.json() to unpack your raw json into its constituent dictionaries, and you need to write a parse_json function that iterates over the dictionaries, and converts them into the column names you desire.
Converting first to a pandas dataframe, then pivoting, then trying to append dataframes together is a recipe for disaster. Stay in json, do what you need to do with json, append all the json arrays into a master list, and then only convert to pandas dataframe at the very end!

Table scraping Python

I'm currently trying to parse this table: http://kuap.ru/banks/8012/balances/en
However, I ran into a problem: the table includes lots of options for drop-down lists (which I don't need), and tbody seems to end unexpectedly somewhere in the beginning of the table.
So, basically, I've got three questions:
Could you please provide working code to parse the whole table? To parse the table and turn it into a dataframe
Is it possible to parse from specific line in this kind of table? Like "start with id..." How?
Is it possible to parse only a specific column in a table like this? (Where columns don't have specific IDs). For example, can I scrape the data only from the first two columns (names and first column with numbers?)
Thanks a lot in advance!
import pandas as pd
df = pd.read_html("http://kuap.ru/banks/8012/balances/en", skiprows=[0])[-1]
df.drop(df.columns[-1], axis=1, inplace=True)
print(df)

Python Searching Excel sheet for a row based on keyword and returning the row

Disclosure: I am not an expert in this, nor do I have much practice with this. I have however, spent several hours attempting to figure this out on my own.
I have an excel sheet with thousands of object serial numbers and addresses where they are located. I am attempting to write a script that will search columns 'A' and 'B' for either the serial number or the address, and returns the whole row with the additional information on the particular object. I am attempting to use python to write this script as I am trying to integrate this with a preexisting script I have to access other sources. Using Pandas, I have been able to load the .xls sheet and return the values of all of the spreadsheet, but I cannot figure out how to search it in such a way as to only return the row pertaining to what I am talking about.
excel sheet example
Here is the code that I have:
import pandas as pd
data = pd.read_excel('my/path/to.xls')
print(data.head())
I can work with the print function to print various parts of the sheet, however whenever I try to add a search function to it, I get lost and my online research is less than helpful. A.) Is there a better python module to be using for this task? Or B.) How do I implement a search function to return a row of data as a variable to be displayed and/or used for other aspects of the program?
Something like this should work, pandas works with rows, columns and indices. We can take advantage of all three to get your utility working.
import pandas as pd
serial_number = input("What is your serial number: ")
address = input("What is your address: ")
# read in dataframe
data = pd.read_excel('my/path/to.xls')
# filter dataframe with a str.contains method.
filter_df = data.loc[(data['A'].str.contains(serial_number))
| data['B'].str.contains(address)]
print(filter_df)
if you have a list of items e.g
serial_nums = [0,5,9,11]
you can use isin which filters your dataframe based on a list.
data.loc[data['A'].isin(serial_nums)]
hopefully this gets you started.

Creating a Cross Tab Query in SQL Alchemy

I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel

Categories

Resources