Can I have 2 index within a single dataframe? - python

I'm working with computer simulation and I use a lot of variables that change from one simulation to another. I have to run short but numerous simulation (like 1000+) so keeping track of these is important.
Up until now, I was simply adding a new columns with the data inside. So my data would look something like.
DataX, DataY, DataZ, variable1, variable2, variable3, ....
So I was basically making 1 column per variable.
Every time I would need to get new variables I would add them as a new column.
Not effective but at least everything was within the same file which was quite handy tbh.
My internship at my lab is about to end and my tutor asked me to clear up the code and make it so that anyone could keep using it.
The thing is, each of these variable also have 2 further sub variable.
So I made a new function that gather all those variables and make a neat little dataframe which looks like this
Parameter Value Lambda Mod
temperature 10 1 0
VarE 1.5 5 0.5
etc
To make it easily accessible I also turned Parameter as the index so I can use df_param.loc['VarE','Value'] for instance
However because of that, they're not all within the same file. Which isn't handy.
Since they'll have to use above 1000+ data file when plotting, and have to filter everything, having the parameter separated from the data can lead to mistakes (which isn't possible atm since everything is within the same file).
If I convert back "parameter" as a column, I can easily do that
Index DataX, DataY, DataZ, Parameter, Value, Lambda, Mod
The issue I have (mostly from a practical stand point) is that since parameter isn't the index anymore, I can't do df_param.loc['VarE','Value'] anymore. I would need to know at exactly what index 'VarE' is and do df.param.loc['index','value'] and with well over 15 parameter to pick from, it's a bit sketchy.
Basically, is there a way to have 2 index? like
One index for DataX, DataY, DataZ (let's call it 'dt') and one for Value Lambda and Mod which would be 'Parameter'
So two df within one basically.
Thank you in advance

Related

How can I .send_keys(v.S1) a variable?

I have only been programing or coding or whatever it is that I am doing for a few days and could use a hand figuring out what I need to research to figure out what I am trying to do. I am working on a project for charity so I dont really want to learn all kinds of things I will probably never use again so I was hoping someone could tell me how to do this or point me in the direction of what I need to learn to make this happen.
So I have created a crawler that goes and types text into a search bar, Eggs for example and then takes me to the eggs results and captures the data, brand, price, count ect.
searchBox.send_keys(v.S1)
My problem is I can not figure out how to change v.S1 into V.S2 so I can automate going thru many searches without having to copy an paste the code over and over again.
I am working with a main.py to call the functions, a functions.py to store the functions and a variables.py to store the list of variables as S1-S2-S3 ect.
I have been able to to get searchBox.send_keys(v.S1) to work as searchBox.send_keys(X) with a variable X = v.S1
but for the life of me I can not figure out how to add +1 to make X = v.S2 after the function completes the first search.
So far all the information I have needed has been under the same By.CLASS_NAME but I have set those to a variable as well since I may need to change some of those on a per case basis as I go as well.
Well any help or someone pointing me in the right direction would be appreciated. Thanks.
To pass a character sequence e.g. v.S1, v.S2, v.S3, etc you can use the range() function and use argument-unpacking operator i.e. * and finally append it to the constant string v.S
elements = [*range(1,6,1)]
elements = ['v.S{0}'.format(element) for element in elements]
print(elements)
# prints -> ['v.S1', 'v.S2', 'v.S3', 'v.S4', 'v.S5']
I suggest you to add all the S1,S2.... variables in a list or tuple and then iterate using a simple for loop like,
# varlist is a list containing all variables
for i in varlist:
searchBox.send_keys(i)

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

How do I use python loops to iterate through the same code with different arguments?

Moving from SAS to Python, I am trying to replicate a SAS macro-type process using input parameters to generate different iterations of code for each loop. In particular, I am trying to binarize continuous variables for modeling (regardless of the merit that may have). What I'm doing at the moment looks as follows:
Some sample data:
import pandas as pd
data=[[2,20],[4,50],[6,75],[1,80],[3,40]]
df=pd.DataFrame(data,columns=['var1','var2'])
Then I run the following:
df['var1_f'] = pd.cut(df['var1'], [0,1,2,3,4,5,7,np.inf], include_lowest=True, labels=['a','b','c','d','e','f','g'])
df['var2_f'] = pd.cut(df['var2'], [-np.inf,0,62,73,81,98,np.inf], include_lowest=True, labels=['a','b','c','d','e','f'])
.
.
.
df1=pd.get_dummies(df,columns=['var1_f'])
df1=pd.get_dummies(df1,columns=['var2_f'])
.
.
.
The above results in a table that contains the original DataFrame, but now has columns appended that take values 1 or 0, depending on whether the continuous variable falls into a particular band. That's great. There must be a better way to do this, rather than having potentially a dozen or so entries that are structurally identical, just with different arguments for variable names and cutoff/label values?
The SAS equivalent would involve replacing "varx_f", "varx", the cutoff values and the labels with placeholders that change on each iteration. In this case, I would do that through pre-defined values (as per the values in the above code), rather than dynamically.
How would I go about looping through this with different arguments for each iteration?
Apologies if this is an existing topic (I'm sure it is) - I just haven't been able to find it.
Thanks for reading!

Most efficient method of referencing large variables in user defined Python functions?

I was wondering if there was a more efficient method of referencing large variables (such as arrays with hundreds of thousands of entries) from a function in Python than simply passing it in as an argument? I know global is an option, but it's so... unreliable, for lack of a better word, I pretty much consider it irrelevant (unless, perhaps, somebody can explain why this isn't the case). I ask because I recently wrote a script which calls the function:
def build(unique,gene,index): ###Concatenates entries from arguments into single string###
###Builds array from entries in all of unique's sublists###
hold= []
hold.append([category[index] for category in unique[1]])
###Builds a list of string concatenated from entries in other lists/arrays###
line= ['\t'.join(gene[0:7]),'\t'.join(hold[0]),'\t'.join(gene[9:len(gene)])]
###Concatenates array in a single string###
line= '\t'.join(line)
return line
From the loop:
for gene in table[1:]:
buffer.append(build(unique,gene,table.index(gene)))
The variable unique is an array with about 500k entries and the loop runs about 60k times. I understand that this is bound to take a while (it's currently sitting at about 12 minutes for this loop alone), but am hoping there's a way to optimize the method through which the unique is referenced in the function so a massive array doesn't have to be passed every time.
Thanks in advance!
There is nothing large being passed here. unique is a reference to the same list every time; nothing is copied on a function call.
You will need to look elsewhere for optimisations.

Python's Networkx, updating attributes "automatically"

everybody. I'm building a DiGraph using NetworkX and iterating an algorithm over it. In a particular iteration, every node "n" changes a specific attribute, let's say "A_n". Now, every edge concerning to this particular node "n" and a given predecessor "m", has another attribute of interest, that depends on "A_n", let's call it "B_mn". My question is: Is it possible to update "B_mn" "automatically" by modifying "A_n" for all "n","m" in my set of nodes? I mean, not iterating over the nodes, and then over their predecessors, but using kind of a dinamic function "B_mn(A_n)" that changes its value at the very moment "A_n" changes. Is this possible?
I thinking in something like this:
Let X and Y be numbers, let's suppose that
G.node["n"]["A"]=X and G.edge["m"]["n"]["B"]= Y+G.node["n"]["A"]
I want that by changing the value of X, the value of the attribute "B" in the edge would be updated as well.
Thank you very much in advance for your help :)
One problem with this question -> Don't ever delete nodes.
In your example you are assigning X to G.node["n"]["A"]. If you say:
G.node["n"]["A"] = 5
G.node["n"]["A"] = 6
That destroy's data locations and now G.node["n"]["A"] is pointing to a new object with a new memory location.
Instead of assignment like '=' you need to do an update of X. Which will leave the datatype and memory location in place. Which means you need a datatype which supports ".update()" like a dictionary.
Everything past here is dependent on your use case:
If the node data is a value (like an int or float) then you don't have a problem adding them together. You can keep running calculations based on value addition of changes only 1 level deeper than the calculation is being performed.
However if the node data is an expression of expressions...
example G.node.get('n')['A']+ G.node.get('m')['A'] (which G.node.get('m')['A'] is also an expression that needs to be evaluated.)
then you have one of 2 problems:
You will need a recursive function that does the evaluating OR
You will need to keep a running list of dictionaries outside of the Graph and perform the running evaluation there which will update the data values in the Graph.
It is possible to do this all within the graph using something like ast.literal_eval() (warning this is not a GOOD idea)
If you only have one operation to perform (addition?) then there are some tricks you can use like keep a running list of the data locations and then do a sum().

Categories

Resources