I'm relatively new to Python, so I'm not sure how to approach this. I don't expect you to show code, just trying to learn more about best practices in Python. I've been trying to follow the tutorial here but still slightly confused.
So, let's say I have a collection of .csv files in a folder; say in the following path : ("D:/Files/"). All the .csv files have just two columns: 'A' and 'B'. I want to create two new columns using the following criterion: multiply each row in column 'A' by 2, and multiply each row in column 'B' by 5. I then want to create a third row which displays the sum of these two new rows I've created. The final output is the dataframe saved to a new .csv file, with the final column (and 2 original columns).
I'm trying to code this in such a way as to import it in a module in a different script as well, so I can then run it on multiple files in a folder. I approach coding it as follows:
import pandas as pd
def multiply_by_3(df):
A = df['A']
new_column_1 = [elem*2 for elem in A]
return new_column_1
def multiply_by_5(df):
B = df['B']
new_column_2 = [elem*5 for elem in B]
return new_column_2
def find_sum(new_column_1,new_column_2):
C = [sum(x) for x in zip(new_column_1, new_column_2)]
return C
##Attempt at main function
def main():
new_column_1 = multiply_by_3()
new_column_2 = multiply_by_5()
sum_of_new_cols = find_sum(new_column_1,new_column_2)
Here's where I'm lost:
How do I add the new column to the dataframe? When I add df['new_column'] = sum_of_new_cols to the script, it doesn't compile as df is not defined.
I'm not sure how to approach that last "main" function in such a way that I can save it as a module that I can call in a separate script, which will allow me to apply these functions to multiple files in one go.
I can import other scripts as modules, and use specific functions within a module (e.g. multiply_by_5), but how would I run the whole script (i.e. all the functions) where there is a "main" function? The tutorial I linked to says the benefit of using a "main" function is that it runs all the previous functions before it.
Any links to relevant resources would be appreciated, as I realise this is probably a really dumb question.
Related
My DF is very large is there a nice way (not for loop) to modify some values within the DF and save every N steps e.g.
def modifier(x):
x = x.split() # more complex logic is applied here
return x
df['new_col'] = df.old_col.apply(modifier)
Is there a nice way to add to modifier function some code that every 10,000 rows
df.to_pickle('make_copy.pickle')
will be called?
For saving every so-many rows, the issue is making sure that the edge case is properly handled (as the last section might not be a full-size section). Using an approach discussed here then you could do something along the following lines. Although there is a loop it is only for every section. Note if you save every section then you need a mechanism for saving each under a new name (or else append to a List of DFs and save that). Note that it is efficient because it uses the default numerical index for splitting so this needs to be in-place or replaced using reset_index. If this is not available or you want to split into chunks without looping then you could explore numpy array_split; but the same looping would still be required for each chunk to save to a file.
from more_itertools import sliced # this module might need to be installed using pip
SLICE_SIZE = 10000
slices = sliced(range(len(df)), SLICE_SIZE)
for index in slices:
df_slice = df.iloc[index]
print(df_slice) # or do anything you want with the section of the DF such as save it as required
I'm looking to create a dynamic .withColumn.
with the column "rules" being replaced by a list depending on the file being processed.
for example: File A has a column called "Validated" that is based on a different condition to File B but has the same column name A. So can we loop through all files A-Z applying different rules for the same column in each file?
Here I am trying to validate many dataframes. Creating an EmailAddress_Validation field on each dataframe. Each data frame has a different email validation rule set. The rules are stored in a list called EmailRuleList. As we loop through each data set the corresponding rule "EmailRuleList[i]" is passed in from the list.
code below has the syntax. Also commented out with an "#" (hash) is an example of a rule.
Interestingly if I supply the rule with out the loop (the # comment) the code works except it then obviously applies the same rule to all files.
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", EmailAddress_Validation)
Error Message: col should be Column
EmailRuleList is something like...
['when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),1).otherwise(0)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx2,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx3,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx4,0))==(col("EmailAddress")),0).otherwise(1)']
tried lots of different things but am a bit stuck
The error is in the last line of the for loop. The when condition that you want to check in the .withColumn() is actually a string (each element of EmailRuleList which is a string).
Since withColumn expects the send argument to be a column, it is giving the error. Look at a similar error when I try to give something similar to your code (in withColumn()):
from pyspark.sql.functions import when,col
df.withColumn("check","when(col('gname')=='Ana','yes').otherwise('No')").show()
To make it work, I have used eval function. So, using the following code wouldn't throw an error:
from pyspark.sql.functions import when,col
df.withColumn("check",eval("when(col('gname')=='Ana','yes').otherwise('No')")).show()
So, modify your code to the one given below to make it work:
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", eval(EmailAddress_Validation))
So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.
I have a excel with multiple columns which contains address fields. I want that address line to be in proper case. When I am able to get them in proper case, some words like 1st,2nd,SW(South West),10th etc., are transforming into 1St, 2Nd, Sw, 10Th. I need python code to resolve this.
addr_df['ADDRESS1'] = addr_df.apply(set_propercase_fn,args=("Address1",), axis=1)
with the above code I am able to get the data in proper case. I tried using below code to make possible changes, It did work but not appropriate.
def replacestring(val):
reps = {'Parker':'Borker', '1St':'st', 'Sw':'SW', 'S W':'SW'}
for i,j in reps.items():
if i in val: val = val.replace(i,j)
return val
print(addr_df['ADDRESS1'].apply(replacestring))
You can try openpyxl it is a great tool, been using it for a long time. With this tool you can modify each and every cell of an excel file.
I have several pandas dataframes (A,B,C,D) and I want to merge each one of them individually with another dataframe (E).
I wanted to write a for loop that allows me to run the merge code for all of them and save each resulting dataframe with a different name, so for example something like:
tables = [A,B,C,D]
n=0
for df in tables:
merged_n = df.merge(E, left_index = True, right_index = True)
n=n+1
I can't find a way to get the different names for the new dataframes created in the loop. I have searched stackoverflow but people say this should never be done (but couldn't find an explanation why) or to use dictionaries, but having dataframes inside dictionaries is not as practical.
you want to clutter the namespace with automatically generated variable names? if so, don't do that. just use a dictionary.
if you really don't want to use a dictionary (really think about why you don't want to do this), you can just do it the slow-to-write, obvious way:
ea = E.merge(A)
eb = E.merge(B)
...
edit: if you really want to add vars to your namespace, which i don't recommend, you can do something like this:
l = locals()
for c in 'abcd':
l[f'e{c}'] = E.merge(l[c.upper()])