python: DataFrame.append does not append element - python

I am working one week with python and I need some help.
I want that if certain condition is fulfilled, it adds a value to a database.
My program doesn't give an error but it doesn't append an element to my database
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
If I run this, at the end I still get an empty dataframe. I have no idea why it doesn't do what I want it to do

You should reassign the dataframe such as:
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU = noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
and don't use the keyword list for a List because it's included in the Python syntax.

Related

Trying to add prefixes to url if not present in pandas df column

I am trying to add prefixes to urls in my 'Websites' Column. I can't figure out how to keep each new iteration of the helper column from overwriting everything from the previous column.
for example say I have the following urls in my column:
http://www.bakkersfinedrycleaning.com/
www.cbgi.org
barstoolsand.com
This would be the desired end state:
http://www.bakkersfinedrycleaning.com/
http://www.cbgi.org
http://www.barstoolsand.com
this is as close as I have been able to get:
def nan_to_zeros(df, col):
new_col = f"nanreplace{col}"
df[new_col] = df[col].fillna('~')
return df
df1 = nan_to_zeros(df1, 'Website')
df1['url_helper'] = df1.loc[~df1['nanreplaceWebsite'].str.startswith('http')| ~df1['nanreplaceWebsite'].str.startswith('www'), 'url_helper'] = 'https://www.'
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('http'), 'url_helper'] = ""
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('www'),'url_helper'] = 'www'
print(df1[['nanreplaceWebsite',"url_helper"]])
which just gives me a helper column of all www because the last iteration overwrites all fields.
Any direction appreciated.
Data:
{'Website': ['http://www.bakkersfinedrycleaning.com/',
'www.cbgi.org', 'barstoolsand.com']}
IIUC, there are 3 things to fix here:
df1['url_helper'] = shouldn't be there
| should be & in the first condition because 'https://www.' should be added to URLs that start with neither of the strings in the condition. The error will become apparent if we check the first condition after the other two conditions.
The last condition should add "http://" instead of "www".
Alternatively, your problem could be solved using np.select. Pass in the multiple conditions in the conditions list and their corresponding choice list and assign values accordingly:
import numpy as np
s = df1['Website'].fillna('~')
df1['fixed Website'] = np.select([~(s.str.startswith('http') | ~s.str.contains('www')),
~(s.str.startswith('http') | s.str.contains('www'))
],
['http://' + s, 'http://www.' + s], s)
Output:
Website fixed Website
0 http://www.bakkersfinedrycleaning.com/ http://www.bakkersfinedrycleaning.com/
1 www.cbgi.org http://www.cbgi.org
2 barstoolsand.com http://www.barstoolsand.com

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas

I want to use the loop function to perform filters containing a list of different keywords (i.e different reference numbers) in a specific column (i.e. CaseText) but just filter one keyword each time so that I do not have to manually change the keyword each time. I want to see the result in the form of dataframe.
Unfortunately, my code doesn't work. It just returns the whole dataset.
Anyone can help and find out what's wrong with my code? Additionally, it will be great if the resultant table will be broken into different results of each keyword.
Many thanks.
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
data_container = []
for filename in list_of_files:
print(filename)
df = pd.read_csv(filename, encoding='mac_roman')
data_container.append(df)
all_data = pd.concat(data_container)
reference_list = ['H/04522/11','15/07697/FUL'] # I want to filter the dataset with a single keyword each time. Because I have nearly 70 keywords to filter.
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(all_data[all_data['CaseText'].str.contains("reference_list", na=False)])
select_data = select_data[['CaseReference', 'CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
One of the problems is that the items in reference_container do not match any of the values in column 'CaseReference'. Once you figure out which CaseReference numbers you want to search for then the below code should work for you. Just put the correct CaseReference numbers in the reference_container list.
import pandas as pd
url = ('https://open.barnet.gov.uk/download/2nq32/fzw/'
'Open%20Data%20Planning%202011-2012%20-%20NG.csv')
data = pd.read_csv(url, encoding='mac_roman')
reference_list = ['hH/02159/13','16/4324/FUL']
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(data[data['CaseReference'] == keywords],
ignore_index=True)
select_data = select_data[['CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
This should work
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
# this takes some time
df = pd.concat([pd.read_csv(el, engine='python' ) for el in list_of_files]) # read all csvs
reference_list = ['H/04522/11','15/07697/FUL']
reference_dict = dict() # create an empty dictionary.
# Will populate this where each key will be the keyword and each value will be a dataframe
# filtered for 'CaseText' contain the keyword
for el in reference_list:
reference_dict[el] = df[(df['CaseText'].str.contains(el)) & ~(df['CaseText'].isna())]
# notice the two conditions
# 1) the column CaseText should contain the keyword. (df['CaseText'].str.contains(el))
# 2) there are some elements in CaseText that are NaN so they need to be excluded
# this is what ~(df['CaseText'].isna()) does
# you can see the resulting dataframes like so: reference_dict[keyword]. for example
reference_dict['H/04522/11']
UPDATE
if you want one dataframe to include the cases where any of the keywords is in the column CaseText try this
# lets start after having read in the data
# Seperate your keywords with | in one string.
keywords = 'H/04522/11|15/07697/FUL' # read into regular expressions to understand this
final_df = df[(df['CaseText'].str.contains(keywords)) & ~(df['CaseText'].isna())]
final_df

How to append data to a dataframe whithout overwriting?

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

Categories

Resources