pandas: while loop to simultaneously advance through multiple lists and call functions - python

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.

Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.

def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()

#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

Related

How to create a dataframe in the for loop?

I want to create a dataframe that consists of values obtained inside the for loop.
columns = ['BIN','Date_of_registration', 'Tax','TaxName','KBK',
'KBKName','Paynum','Paytype', 'EntryType','Writeoffdate', 'Summa']
df = pd.DataFrame(columns=columns)
I have this for loop:
for elements in tree.findall('{http://xmlns.kztc-cits/sign}payment'):
print("hello")
tax = elements.find('{http://xmlns.kztc-cits/sign}TaxOrgCode').text
tax_name_ru = elements.find('{http://xmlns.kztc-cits/sign}NameTaxRu').text
kbk = elements.find('{http://xmlns.kztc-cits/sign}KBK').text
kbk_name_ru = elements.find('{http://xmlns.kztc-cits/sign}KBKNameRu').text
paynum = elements.find('{http://xmlns.kztc-cits/sign}PayNum').text
paytype = elements.find('{http://xmlns.kztc-cits/sign}PayType').text
entry_type = elements.find('{http://xmlns.kztc-cits/sign}EntryType').text
writeoffdate = elements.find('{http://xmlns.kztc-cits/sign}WriteOffDate').text
summa = elements.find('{http://xmlns.kztc-cits/sign}Summa').text
print(tax, tax_name_ru, kbk, kbk_name_ru, paynum, paytype, entry_type, writeoffdate, summa)
How can I append acquired values to the initially created(outside for loop) dataframe?
A simple way if you only need the dataframe after the loop is completed is to append the data to a list of lists and then convert to a dataframe. Caveat: Responsibility is on you to make sure the list ordering matches the columns, so if you change your columns in the future you have to reposition the list.
list_of_rows = []
for elements in tree.findall('{http://xmlns.kztc-cits/sign}payment'):
list_of_rows.append([
tax, tax_name_ru, kbk, kbk_name_ru, paynum, paytype,entry_type, writeoffdate, summa])
df = pd.DataFrame(columns=columns, data=list_of_rows)

if statement and call function for dataframe

I know how to apply an IF condition in Pandas DataFrame. link
However, my question is how to do the following:
if (df[df['col1'] == 0]):
sys.path.append("/desktop/folder/")
import self_module as sm
df = sm.call_function(df)
What I really want to do is when value in col1 equals to 0 then call function call_function().
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
I provide a simple example above for call_function().
Since your function interacts with multiple columns and returns a whole data frame, run conditional logic inside the method:
def call_function(ds):
ds['new_age'] = np.nan
ds.loc[ds['col'] == 0, 'new_age'] = ds['age'].mul(0.012345678901).round(12)
return ds
df = call_function(df)
If you are unable to modify the function, run method on splits of data frame and concat or append together. Any new columns in other split will be have values filled with NAs.
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
df = pd.concat([call_function(df[df['col'] == 0].copy()),
df[df['col'] != 0].copy()])

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas

I want to use the loop function to perform filters containing a list of different keywords (i.e different reference numbers) in a specific column (i.e. CaseText) but just filter one keyword each time so that I do not have to manually change the keyword each time. I want to see the result in the form of dataframe.
Unfortunately, my code doesn't work. It just returns the whole dataset.
Anyone can help and find out what's wrong with my code? Additionally, it will be great if the resultant table will be broken into different results of each keyword.
Many thanks.
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
data_container = []
for filename in list_of_files:
print(filename)
df = pd.read_csv(filename, encoding='mac_roman')
data_container.append(df)
all_data = pd.concat(data_container)
reference_list = ['H/04522/11','15/07697/FUL'] # I want to filter the dataset with a single keyword each time. Because I have nearly 70 keywords to filter.
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(all_data[all_data['CaseText'].str.contains("reference_list", na=False)])
select_data = select_data[['CaseReference', 'CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
One of the problems is that the items in reference_container do not match any of the values in column 'CaseReference'. Once you figure out which CaseReference numbers you want to search for then the below code should work for you. Just put the correct CaseReference numbers in the reference_container list.
import pandas as pd
url = ('https://open.barnet.gov.uk/download/2nq32/fzw/'
'Open%20Data%20Planning%202011-2012%20-%20NG.csv')
data = pd.read_csv(url, encoding='mac_roman')
reference_list = ['hH/02159/13','16/4324/FUL']
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(data[data['CaseReference'] == keywords],
ignore_index=True)
select_data = select_data[['CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
This should work
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
# this takes some time
df = pd.concat([pd.read_csv(el, engine='python' ) for el in list_of_files]) # read all csvs
reference_list = ['H/04522/11','15/07697/FUL']
reference_dict = dict() # create an empty dictionary.
# Will populate this where each key will be the keyword and each value will be a dataframe
# filtered for 'CaseText' contain the keyword
for el in reference_list:
reference_dict[el] = df[(df['CaseText'].str.contains(el)) & ~(df['CaseText'].isna())]
# notice the two conditions
# 1) the column CaseText should contain the keyword. (df['CaseText'].str.contains(el))
# 2) there are some elements in CaseText that are NaN so they need to be excluded
# this is what ~(df['CaseText'].isna()) does
# you can see the resulting dataframes like so: reference_dict[keyword]. for example
reference_dict['H/04522/11']
UPDATE
if you want one dataframe to include the cases where any of the keywords is in the column CaseText try this
# lets start after having read in the data
# Seperate your keywords with | in one string.
keywords = 'H/04522/11|15/07697/FUL' # read into regular expressions to understand this
final_df = df[(df['CaseText'].str.contains(keywords)) & ~(df['CaseText'].isna())]
final_df

python: DataFrame.append does not append element

I am working one week with python and I need some help.
I want that if certain condition is fulfilled, it adds a value to a database.
My program doesn't give an error but it doesn't append an element to my database
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
If I run this, at the end I still get an empty dataframe. I have no idea why it doesn't do what I want it to do
You should reassign the dataframe such as:
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU = noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
and don't use the keyword list for a List because it's included in the Python syntax.

Categories

Resources