I am working on a project that will take our old database system queries and map them to the new database system queries. I want to be able to read the FROM statements, split them into Database, Schema, Table, and Alias. The data looks like this:
FROM DatabaseA.SchemaA.Table1 tbl
INNER JOIN DatabaseB.SchemaC.Table13 tbl13
ON tbl.column12 = tbl.column12
I have created a dataframe from the full SQL query and extracted just the FROM statement into a new dataframe (I'm using two different functions):
# Reads file - creates dataframe
def readSQL(file_path):
# convert to a string
file_path=file_path.decode("utf-8")
# Set sql_script as a global variable
# Once sql_script is populated it can
# be called from other functions
global sql_script
# Open file
# Separate every line
df = open(file_path,'r')
lines = df.readlines()
df.close()
# print(lines)
# Create the dataframe
sql_script = pd.DataFrame(columns=('StatementDSC','ColumnTable'))
i = 0
StatementDSC = ""
ColumnTable = ""
for line in lines:
if ('SELECT' in line) or ('FROM' in line) or ('WHERE' in line):
StatementDSC = line.strip() # remove \n
else: #(',' in line) or ('.' in line):
ColumnTable = line.strip() # remove \t and /n
# Create next line
sql_script.loc[i] = [StatementDSC,ColumnTable]
i = i + 1
return sql_script
# print(sql_script)
# Reads FROM Statement breaks down tables/aliases/joins
def readFROM(full_SQL):
# Create FROM Dataframe
df = full_SQL.loc[full_SQL['StatementDSC']== 'FROM']
# Drop empty rows
# df.drop(df.ColumnTable == "",axis=0)
print(df)
# split ColumnTable by periods
# Create new Columns:
# Database, Schema, Table
split_data = df["ColumnTable"].str.split('.')
data = split_data.to_list()
names = ["Database","Schema","TableAlias"]
new_df = pd.DataFrame(data,columns=names)
print(new_df)
I can get the data to split from this:
StatementDSC ColumnTable
5 FROM
6 FROM PPDAC.PLAN.DOCTORS Docs
7 FROM INNER JOIN PPDAC.PLAN.DOCTORSINFO DocInfo
8 FROM ON Docs.DocID = DocInfo.DocID
To this:
Database Schema TableAlias
0 None None
1 PPDAC PLAN DOCTORS Docs
2 INNER JOIN PPDAC PLAN DOCTORSINFO DocInfo
How do I add another delimiter to separate these into four by using both a period and a space?
I've seen other answers, but they've been for one line and not repeating on multiple lines.
Use regex to split based on multiple delimiters, such as any delimiter in the [ brackets ], in this case . or space. See this toy example:
row1list = ['1.2.3 4']
row2list = ['1.4.5 6']
row3list = ['2.7.8 9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['ColumnTable'])
df2 = df['ColumnTable'].str.split('[ .]', expand=True)
print(df2)
# 0 1 2 3
# 0 1 2 3 4
# 1 1 4 5 6
# 2 2 7 8 9
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .
Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL
Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18
Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!
I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?