Convert dynamic dataframe with randomly generated fake data to static dataframe - python

I'm trying to add a column of fake data into a dataframe. It doesn't not matter what the contents of the dataframe are. I just want to add a column of randomly generated fake data, e.g., randomly generated first names with one name per line. Here is some dummy data to play with but I repeat, the contents of the dataframe do not matter:
from faker import Faker
faker = Faker("en_GB")
contact = [faker.profile() for i in range(0, 100)]
contact = spark.createDataFrame(contact)
I'm trying to create a class with functions to do this for different columns as so:
class anonymise:
#staticmethod
def FstName():
def FstName_values():
faker = Faker("en_GB")
return faker.first_name()
FstName_udf = udf(FstName_values, StringType())
return FstName_udf()
The class above has one function as an example but the actual class has multiple functions of exactly the same template, just for different columns, e.g., LastName.
Then, I'm adding in the new columns as so:
contact = contact \
.withColumn("FstName", anonymise.FstName())
I'm using this process to replace real data with realistic-looking, fake, randomly generated data.
This appears to works fine and runs quickly. However, I noticed that every time I display the new dataframe, it will try to generate an entirely new column:
First try:
Second try immediately after the first:
This means that the dataframe isn't just one static dataframe with data and it will try to generate a new column for every subsequent command. This is causing me issues further down the line when I try to write the data to an external file.
I would just like it to generate the column once with some static data that is easily callable. I don't even want it to regenerate the same data. The generation process should happen once.
I've tried copying to a pandas dataframe but the dataframe is too large for this to work (1.3+ million rows) and I can't seem to write a smaller version to an external file anyway.
Any help on this issue appreciated!
Many thanks,
Carolina

Since you are using spark, it is doing the computation across multiple nodes. What you can try is add a contact.persist() after doing the anonymization.
You can read more about persist HERE.

So it was a pretty simple fix in the end...
By putting faker = Faker("en_GB") inside the function where it was, I was generating an instance of faker for every row. I simply had to remove it from within the function and generate the instance outside the class. So now, although it does generate the data every time a command is called, it does so very quickly even for large dataframes and I haven't run into any issues for any subsequent commands.

Related

SAS macros into Python Pandas

I am new to Python and trying to write code where I will be doing data manipulations on separate dataframes. I want to automate the process where I can pass the existing dataframe name in a function and manipulations will happen within the function on each dataframe step by step separately .In SAS I can create Macros and do the task, however I am unable to find a solution in Python.
SAS code:
%macro forecast(scenario):
data work.&scenario;
set work.&scenario;
rownum = _n_;
run;
%mend;
%forecast(base);
%forecast(severe);
Here the input is two datasets base and severe. The output will be two datasets : base and severe with the relevant rownum column added for both.
If I try to do this in Python, I can do it for a single dataframe like below:
Python code:
df_base['rownum'] = np.arange(len(df_base))'''
It will add a column rownum for my dataframe.
Now I want to do the same manipulation for the other existing dataframe df_severe as well using a function(or some other technique) which should work similar to SAS macros. I have more manipulations to do within the functions, so I would like to avoid doing it individually/separately.

Python Searching Excel sheet for a row based on keyword and returning the row

Disclosure: I am not an expert in this, nor do I have much practice with this. I have however, spent several hours attempting to figure this out on my own.
I have an excel sheet with thousands of object serial numbers and addresses where they are located. I am attempting to write a script that will search columns 'A' and 'B' for either the serial number or the address, and returns the whole row with the additional information on the particular object. I am attempting to use python to write this script as I am trying to integrate this with a preexisting script I have to access other sources. Using Pandas, I have been able to load the .xls sheet and return the values of all of the spreadsheet, but I cannot figure out how to search it in such a way as to only return the row pertaining to what I am talking about.
excel sheet example
Here is the code that I have:
import pandas as pd
data = pd.read_excel('my/path/to.xls')
print(data.head())
I can work with the print function to print various parts of the sheet, however whenever I try to add a search function to it, I get lost and my online research is less than helpful. A.) Is there a better python module to be using for this task? Or B.) How do I implement a search function to return a row of data as a variable to be displayed and/or used for other aspects of the program?
Something like this should work, pandas works with rows, columns and indices. We can take advantage of all three to get your utility working.
import pandas as pd
serial_number = input("What is your serial number: ")
address = input("What is your address: ")
# read in dataframe
data = pd.read_excel('my/path/to.xls')
# filter dataframe with a str.contains method.
filter_df = data.loc[(data['A'].str.contains(serial_number))
| data['B'].str.contains(address)]
print(filter_df)
if you have a list of items e.g
serial_nums = [0,5,9,11]
you can use isin which filters your dataframe based on a list.
data.loc[data['A'].isin(serial_nums)]
hopefully this gets you started.

Fastest approach to read and process 10k Excell cells in Python/Pandas?

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?
It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.
#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Swapping dataframe column data without changing the index for the table

While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot.
I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data.
I've tried a couple of different ways to swap the data around but keep hitting errors.
My attempts to fix it include:
for rows in df['server']:
if '%USERID' in line:
df['server'] = df[7] # both versions of this and below
df['server'].replace(df['server'],df[7])
else:
pass
if '%USERID' in df['server']: # Attempt to fix missing server name
df['server'] = df[7];
else:
pass
if '%USERID' in df['server']:
return row['7'], row['server']
else:
pass
I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'
Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it.
Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.

Running update of Pandas dataframe

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.

Categories

Resources