I'm converting a program of SAS code into a python equivalent. One section that i'm struggling with is how to convert a macro program in SAS when the variables used within the macro are used to create a dataset. for example:
%macro program(type);
data portfolio_&type.;
set portfolio;
run;
I basically want to create a dataframe equivalent of portfolio_&type. Any idea how I proceed with this?
Edit: I don’t think I have enough detail originally
Say my data has a column called type, and that takes either value of ‘tree’ or ‘bush’, I want to split my data in two, and then process the same functions on both and create separate output tables for both. In SAS this is quite simple. I write macros which are effectively functions that take my arguments and drop them into the code, making them unique datasets.
%macro program(type);
data portfolio_&type.;
set portfolio (where=(type=&type.));
run;
Proc freq data=Portfolio_&type.;
Tables var1/out=summary_&type.;
Run;
%mend;
%program(Tree);
%program(bush);
The & allows me to drop my text into the dataset name but I can’t do this with a def function type statement in python because I can’t drop the argument into my data frame name
Your SAS macro wants to dynamically name the SAS dataset fed into PROC Freq. pandas dataframes cannot be named dynamically. You need to use the resolved macro variable value Portfolio_Tree and Portfolio_bush as the dataframe names. SAS dataset names are case-insensitive and pandas dataframe are case-sensitive.
For PROC FREQ you would then you would have,
Portfolio_Tree[‘var1’].value_counts
Related
I am new to Python and trying to write code where I will be doing data manipulations on separate dataframes. I want to automate the process where I can pass the existing dataframe name in a function and manipulations will happen within the function on each dataframe step by step separately .In SAS I can create Macros and do the task, however I am unable to find a solution in Python.
SAS code:
%macro forecast(scenario):
data work.&scenario;
set work.&scenario;
rownum = _n_;
run;
%mend;
%forecast(base);
%forecast(severe);
Here the input is two datasets base and severe. The output will be two datasets : base and severe with the relevant rownum column added for both.
If I try to do this in Python, I can do it for a single dataframe like below:
Python code:
df_base['rownum'] = np.arange(len(df_base))'''
It will add a column rownum for my dataframe.
Now I want to do the same manipulation for the other existing dataframe df_severe as well using a function(or some other technique) which should work similar to SAS macros. I have more manipulations to do within the functions, so I would like to avoid doing it individually/separately.
I'm trying to add a column of fake data into a dataframe. It doesn't not matter what the contents of the dataframe are. I just want to add a column of randomly generated fake data, e.g., randomly generated first names with one name per line. Here is some dummy data to play with but I repeat, the contents of the dataframe do not matter:
from faker import Faker
faker = Faker("en_GB")
contact = [faker.profile() for i in range(0, 100)]
contact = spark.createDataFrame(contact)
I'm trying to create a class with functions to do this for different columns as so:
class anonymise:
#staticmethod
def FstName():
def FstName_values():
faker = Faker("en_GB")
return faker.first_name()
FstName_udf = udf(FstName_values, StringType())
return FstName_udf()
The class above has one function as an example but the actual class has multiple functions of exactly the same template, just for different columns, e.g., LastName.
Then, I'm adding in the new columns as so:
contact = contact \
.withColumn("FstName", anonymise.FstName())
I'm using this process to replace real data with realistic-looking, fake, randomly generated data.
This appears to works fine and runs quickly. However, I noticed that every time I display the new dataframe, it will try to generate an entirely new column:
First try:
Second try immediately after the first:
This means that the dataframe isn't just one static dataframe with data and it will try to generate a new column for every subsequent command. This is causing me issues further down the line when I try to write the data to an external file.
I would just like it to generate the column once with some static data that is easily callable. I don't even want it to regenerate the same data. The generation process should happen once.
I've tried copying to a pandas dataframe but the dataframe is too large for this to work (1.3+ million rows) and I can't seem to write a smaller version to an external file anyway.
Any help on this issue appreciated!
Many thanks,
Carolina
Since you are using spark, it is doing the computation across multiple nodes. What you can try is add a contact.persist() after doing the anonymization.
You can read more about persist HERE.
So it was a pretty simple fix in the end...
By putting faker = Faker("en_GB") inside the function where it was, I was generating an instance of faker for every row. I simply had to remove it from within the function and generate the instance outside the class. So now, although it does generate the data every time a command is called, it does so very quickly even for large dataframes and I haven't run into any issues for any subsequent commands.
I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you
I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?
It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.
#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.
If my python script is pivoting and i can no predict how many columns will be outputed, can this be done with the U-SQL REDUCE statement?
e.g.
#pythonOutput =
REDUCE #filteredBets ON [BetDetailID]
PRODUCE [BetDetailID] string, EventID float
USING new Extension.Python.Reducer(pyScript:#myScript);
There could be multiple columns, so i can't hard set the names in the Produce part.
Any ideas?
If you have a way to produce a SqlMap<string,string> value from within Python (I am not sure if that is supported right now, you can do it with a C# reducer :)), then you could use the map for the dynamic schema part.
If it is not supported in Python, please file a feature request at http://aka.ms/adlfeedback.
The only way right now is to serialize all the columns into a single column, either as a byte[] or string in your python script. SqlMap/SqlArray are not supported yet as output columns.