SAS macros into Python Pandas - python

I am new to Python and trying to write code where I will be doing data manipulations on separate dataframes. I want to automate the process where I can pass the existing dataframe name in a function and manipulations will happen within the function on each dataframe step by step separately .In SAS I can create Macros and do the task, however I am unable to find a solution in Python.
SAS code:
%macro forecast(scenario):
data work.&scenario;
set work.&scenario;
rownum = _n_;
run;
%mend;
%forecast(base);
%forecast(severe);
Here the input is two datasets base and severe. The output will be two datasets : base and severe with the relevant rownum column added for both.
If I try to do this in Python, I can do it for a single dataframe like below:
Python code:
df_base['rownum'] = np.arange(len(df_base))'''
It will add a column rownum for my dataframe.
Now I want to do the same manipulation for the other existing dataframe df_severe as well using a function(or some other technique) which should work similar to SAS macros. I have more manipulations to do within the functions, so I would like to avoid doing it individually/separately.

Related

Converting SAS to Python

I'm converting a program of SAS code into a python equivalent. One section that i'm struggling with is how to convert a macro program in SAS when the variables used within the macro are used to create a dataset. for example:
%macro program(type);
data portfolio_&type.;
set portfolio;
run;
I basically want to create a dataframe equivalent of portfolio_&type. Any idea how I proceed with this?
Edit: I don’t think I have enough detail originally
Say my data has a column called type, and that takes either value of ‘tree’ or ‘bush’, I want to split my data in two, and then process the same functions on both and create separate output tables for both. In SAS this is quite simple. I write macros which are effectively functions that take my arguments and drop them into the code, making them unique datasets.
%macro program(type);
data portfolio_&type.;
set portfolio (where=(type=&type.));
run;
Proc freq data=Portfolio_&type.;
Tables var1/out=summary_&type.;
Run;
%mend;
%program(Tree);
%program(bush);
The & allows me to drop my text into the dataset name but I can’t do this with a def function type statement in python because I can’t drop the argument into my data frame name
Your SAS macro wants to dynamically name the SAS dataset fed into PROC Freq. pandas dataframes cannot be named dynamically. You need to use the resolved macro variable value Portfolio_Tree and Portfolio_bush as the dataframe names. SAS dataset names are case-insensitive and pandas dataframe are case-sensitive.
For PROC FREQ you would then you would have,
Portfolio_Tree[‘var1’].value_counts

pandas ExcelWriter merge but keep value that's already there

I have a few small data frames that I'm outputting to excel on one sheet. To make then fit better, I need to merge some cells in one table, but to write this in xlsx writer, I need to specify the data parameter. I want to keep the data that is already written in the left cell from using the to_excel() bit of code. Is there a way to do this without having to specify the data parameter? Or do I need to lookup the value in the dataframe to put in there.
For example:
df.to_excel(writer, 'sheet') gives similar to the following output:
Then I want to merge across C:D for this table without having to specify what data should be there (because it is already in column C), using something like:
worksheet.merge_range('C1:D1', cell_format = fmat) etc.
to get below:
Is this possible? Or will I need to lookup the values in the dataframe?
Is this possible? Or will I need to lookup the values in the dataframe?
You will need to lookup the data from the dataframe. There is no way in XlsxWriter to write formatting on top of existing data. The data and formatting need to be written at the same time (apart from Conditional Formatting which can't be used for merging anyway).

Python Searching Excel sheet for a row based on keyword and returning the row

Disclosure: I am not an expert in this, nor do I have much practice with this. I have however, spent several hours attempting to figure this out on my own.
I have an excel sheet with thousands of object serial numbers and addresses where they are located. I am attempting to write a script that will search columns 'A' and 'B' for either the serial number or the address, and returns the whole row with the additional information on the particular object. I am attempting to use python to write this script as I am trying to integrate this with a preexisting script I have to access other sources. Using Pandas, I have been able to load the .xls sheet and return the values of all of the spreadsheet, but I cannot figure out how to search it in such a way as to only return the row pertaining to what I am talking about.
excel sheet example
Here is the code that I have:
import pandas as pd
data = pd.read_excel('my/path/to.xls')
print(data.head())
I can work with the print function to print various parts of the sheet, however whenever I try to add a search function to it, I get lost and my online research is less than helpful. A.) Is there a better python module to be using for this task? Or B.) How do I implement a search function to return a row of data as a variable to be displayed and/or used for other aspects of the program?
Something like this should work, pandas works with rows, columns and indices. We can take advantage of all three to get your utility working.
import pandas as pd
serial_number = input("What is your serial number: ")
address = input("What is your address: ")
# read in dataframe
data = pd.read_excel('my/path/to.xls')
# filter dataframe with a str.contains method.
filter_df = data.loc[(data['A'].str.contains(serial_number))
| data['B'].str.contains(address)]
print(filter_df)
if you have a list of items e.g
serial_nums = [0,5,9,11]
you can use isin which filters your dataframe based on a list.
data.loc[data['A'].isin(serial_nums)]
hopefully this gets you started.

Is there any way to create a root level column-object in Python dataframe like struct type in Scala?

I need to write some data to dataset, and I need to have one column as root and three more columns inside that column. How I can do that in Python? I have working code for Scala.
var myDf = myDf.withColumn(rootColumn, struct(myDf("column1"), myDf("column2"), myDf("column3")))
I have tried with pd.MultiIndex.from_product, I get rootColumn at the top, but that is not working with this situation. I need exact result like in Spark code above.
Schema have three columns, one is root, and inside that root column I have multiple columns.I need to make changes so I can write dataframe to the dataset based on that schema.

Creating a Cross Tab Query in SQL Alchemy

I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel

Categories

Resources