Update values in a column while looping over through a pandas dataframe - python
I am working on a script to extract some details from images. I am trying to loop over a dataframe that has my image names. How can I add a new column to the dataframe, that populates the extracted name appropriately against the image name?
for image in df['images']:
concatenated_name = ''.join(name)
df.loc[image, df['images']]['names'] = concatenated_name
Expected:
Index images names
0 img_01 TonyStark
1 img_02 Thanos
2 img_03 Thor
Got:
Index images names
0 img_01 Thor
1 img_02 Thor
2 img_03 Thor
Use apply to apply a function on each row:
def get_name(image):
# Code for getting the name
return name
df['names'] = df['images'].apply(get_name)
Follwing your answer that added some more details, it should be possible to shorten it to:
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
return data
df['data'] = df['filenames'].apply(get_details)
# save df to csv / excel / other
After multiple trials, I think I have a viable solution to this question.
I was using nested function for this exercise, such that function 1 loops over a dataframe of files and calls to function 2 to extract text, perform validation and return a value if the image had the expected field.
First, I created an empty list which would be populated during each run of function 2. At the end, the user can choose to use this list to create a dataframe.
# dataframes to store data
df = pd.DataFrame(os.listdir(), columns=['filenames'])
df = df[df['filenames'].str.contains(".png|.jpg|.jpeg")]
df['filenames'] = '\\' + df['filenames']
df1 = [] #Empty list to record details
# Function 1
def extract_details(df):
for filename in df['filenames']:
get_details(filename)
# Function 2
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
print(filename, data)
df1.append([filename, data])
df_data = pd.DataFrame(df1, columns=['filenames', 'data']) # Container for final output
df_data.to_csv('data_list.csv') # Write output to a csv file
df_data.to_excel('data_list.xlsx') # Write output to an excel file
Related
How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)
#fetch the data in a sequence of 1 million rows as dataframe df1 = My_functions.get_ais_data(json1) df2 = My_functions.get_ais_data(json2) df3 = My_functions.get_ais_data(json3) df_all = pd.concat([df1,df2,df3], axis = 0 ) #save the data frame with names of the oldest_id and the corresponding iso data format df_all.to_csv('oldest_id + iso_date +.csv') .....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.
You can use an f-string to embed variables in strings like this: df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
if you need the value corresponding to the variable then mids answer is correct thus: df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv') However if you want to use the name of the variable itselfs : df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv') would do the work
Maybe try: file_name = f"{oldest_id}{iso_date}.csv" df_all.to_csv(file_name) Assuming you are using Python 3.6 and up.
Python remove everything after specific string and loop through all rows in multiple columns in a dataframe
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean: Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\ =0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"] I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData": string1 = "&FolderCTID" import pandas as pd df_MasterData = pd.read_excel(FN_MasterData) cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D'] for i in cols: # Objective: Replace "&FolderCTID", delete all string after string1 = "&FolderCTID" # Method 1 df_MasterData[i] = df_MasterData[i].str.split(string1).str[0] # Method 2 df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip() # Method 3 df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1] I did search and google and found similar solutions which were used but none of them work. Can any guru shed some light on this? Any assistance is appreciated. Added below is a few example rows in column A and B for these URLs: Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\ 0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\ 'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\ =%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\ 893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D'] Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\ 206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\ 'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\ 2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\ 648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it, first declare a variable with your target columns. Then use stack() and str.split to get your target output. finally, unstack and reapply the output to your original df. cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD'] string1 = "&FolderCTID" df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1) if you want to replace these columns in your target df then simply do - df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using indexes = len(string1) + df_MasterData[i].str.find(string1) # This selected the final location of this string # if you don't want to add string in result just use below one indexes = len(string1) + df_MasterData[i].str.find(string1) Now do df_MasterData[i] = df_MasterData[i].str[:indexes]
Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas
I want to use the loop function to perform filters containing a list of different keywords (i.e different reference numbers) in a specific column (i.e. CaseText) but just filter one keyword each time so that I do not have to manually change the keyword each time. I want to see the result in the form of dataframe. Unfortunately, my code doesn't work. It just returns the whole dataset. Anyone can help and find out what's wrong with my code? Additionally, it will be great if the resultant table will be broken into different results of each keyword. Many thanks. import pandas as pd pd.set_option('display.max_colwidth', 0) list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv'] data_container = [] for filename in list_of_files: print(filename) df = pd.read_csv(filename, encoding='mac_roman') data_container.append(df) all_data = pd.concat(data_container) reference_list = ['H/04522/11','15/07697/FUL'] # I want to filter the dataset with a single keyword each time. Because I have nearly 70 keywords to filter. select_data = pd.DataFrame() for keywords in reference_list: select_data = select_data.append(all_data[all_data['CaseText'].str.contains("reference_list", na=False)]) select_data = select_data[['CaseReference', 'CaseDate', 'ServiceTypeLabel', 'CaseText', 'DecisionDate', 'Decision', 'AppealRef']] select_data.drop_duplicates(keep='first', inplace=True) select_data
One of the problems is that the items in reference_container do not match any of the values in column 'CaseReference'. Once you figure out which CaseReference numbers you want to search for then the below code should work for you. Just put the correct CaseReference numbers in the reference_container list. import pandas as pd url = ('https://open.barnet.gov.uk/download/2nq32/fzw/' 'Open%20Data%20Planning%202011-2012%20-%20NG.csv') data = pd.read_csv(url, encoding='mac_roman') reference_list = ['hH/02159/13','16/4324/FUL'] select_data = pd.DataFrame() for keywords in reference_list: select_data = select_data.append(data[data['CaseReference'] == keywords], ignore_index=True) select_data = select_data[['CaseDate', 'ServiceTypeLabel', 'CaseText', 'DecisionDate', 'Decision', 'AppealRef']] select_data.drop_duplicates(keep='first', inplace=True) select_data
This should work import pandas as pd pd.set_option('display.max_colwidth', 0) list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv', 'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv', 'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv'] # this takes some time df = pd.concat([pd.read_csv(el, engine='python' ) for el in list_of_files]) # read all csvs reference_list = ['H/04522/11','15/07697/FUL'] reference_dict = dict() # create an empty dictionary. # Will populate this where each key will be the keyword and each value will be a dataframe # filtered for 'CaseText' contain the keyword for el in reference_list: reference_dict[el] = df[(df['CaseText'].str.contains(el)) & ~(df['CaseText'].isna())] # notice the two conditions # 1) the column CaseText should contain the keyword. (df['CaseText'].str.contains(el)) # 2) there are some elements in CaseText that are NaN so they need to be excluded # this is what ~(df['CaseText'].isna()) does # you can see the resulting dataframes like so: reference_dict[keyword]. for example reference_dict['H/04522/11'] UPDATE if you want one dataframe to include the cases where any of the keywords is in the column CaseText try this # lets start after having read in the data # Seperate your keywords with | in one string. keywords = 'H/04522/11|15/07697/FUL' # read into regular expressions to understand this final_df = df[(df['CaseText'].str.contains(keywords)) & ~(df['CaseText'].isna())] final_df
Python - Pandas library returns wrong column values after parsing a CSV file
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order... I am trying to read a comma separated value file with python and then parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need. Here's a look at the csv file format. Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1 E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5 E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62 E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2 E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5 E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5 E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8 This list is passed to pandas.read_csv()'s names parameter. See code. # Returns an array of the column names needed for our raw data table def cols_to_extract(): cols_to_use = [None] * RawDataCols.COUNT cols_to_use[RawDataCols.DATE] = 'Date' cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam' cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam' cols_to_use[RawDataCols.FTHG] = 'FTHG' cols_to_use[RawDataCols.HG] = 'HG' cols_to_use[RawDataCols.FTAG] = 'FTAG' cols_to_use[RawDataCols.AG] = 'AG' cols_to_use[RawDataCols.FTR] = 'FTR' cols_to_use[RawDataCols.RES] = 'Res' cols_to_use[RawDataCols.HTHG] = 'HTHG' cols_to_use[RawDataCols.HTAG] = 'HTAG' cols_to_use[RawDataCols.HTR] = 'HTR' cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance' cols_to_use[RawDataCols.HS] = 'HS' cols_to_use[RawDataCols.AS] = 'AS' cols_to_use[RawDataCols.HST] = 'HST' cols_to_use[RawDataCols.AST] = 'AST' cols_to_use[RawDataCols.HHW] = 'HHW' cols_to_use[RawDataCols.AHW] = 'AHW' cols_to_use[RawDataCols.HC] = 'HC' cols_to_use[RawDataCols.AC] = 'AC' cols_to_use[RawDataCols.HF] = 'HF' cols_to_use[RawDataCols.AF] = 'AF' cols_to_use[RawDataCols.HFKC] = 'HFKC' cols_to_use[RawDataCols.AFKC] = 'AFKC' cols_to_use[RawDataCols.HO] = 'HO' cols_to_use[RawDataCols.AO] = 'AO' cols_to_use[RawDataCols.HY] = 'HY' cols_to_use[RawDataCols.AY] = 'AY' cols_to_use[RawDataCols.HR] = 'HR' cols_to_use[RawDataCols.AR] = 'AR' return cols_to_use # Extracts raw data from the raw data csv and populates the raw match data table in the database def extract_raw_data(csv): # Clear the database table if it has any logs # if MatchRawData.objects.count != 0: # MatchRawData.objects.delete() cols_to_use = cols_to_extract() # Read and parse the csv file parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0) for col in cols_to_use: values = parsed_csv[col].values for val in values: print(str(col) + ' --------> ' + str(val)) Where RawDataCols is an IntEnum. class RawDataCols(IntEnum): DATE = 0 HOME_TEAM = 1 AWAY_TEAM = 2 FTHG = 3 HG = 4 FTAG = 5 AG = 6 FTR = 7 RES = 8 ... The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using values = parsed_csv[col].values pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line values = parsed_csv[["Column Name","Column Name2"]] Or you select Index wise by cols = [1,2,3,4] values = parsed_csv[parsed_csv.columns[cols]]
How to increase multi dimension of array in tensorflow?
I have a txt file which has 8 columns and I am selecting 1 column for my feature extraction which gives me 13 features values, the shape of output array will be [1x13]. Similarly I have 5 txt files in a folder I want to run a loop so that the returned variable will have 5x13 data. def loadinfofromfile(directory,sd,channel): # subdir selection and read file names in it for particular crack type. subdir, filenames = loadfilenamesindirectory(directory,sd) for i in range(5): # join the directory sub directory and the filename loadfile = os.path.join(directory,subdir,filenames[i]) # load the values of that paticular file into tensor fileinfo = tf.constant(np.loadtxt(loadfile),tf.float32) # select the particular column data ( choosen from crack type, channel no) fileinfo_trans = tf.transpose(fileinfo) fileinfo_back = tf.gather(fileinfo_trans,channel) # extracting features from selected column data gives [1x13] pool = features.pooldata(fileinfo_back) poolfinal = tf.concat_v2([tf.expand_dims(pool,0)],axis=0) return poolfinal In the above function I am able to get [1x13] to the variable 'pool' and I am expecting the size of the variable poolfinal as [5x13] but i get it as [1x13]. how to concat in vertical direction ? What is the mistake i did in the loop ?
each loop creates pool and poolfinal from sctratch. That's why you see only one data in poolfinal. instead please try following: pools = [] for ...: pools.append(...) poolfinal = tf.concat_v2(pools, axis=0)