I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa
I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())
I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))
I am making a generic tool which can take up any csv file.The file contains a city column which needs to be geocoded to latitudes and Longitudes. I have a csv file which looks something like this. The first row is the column name and the second row is the type of variable.
Time,M1,M2,M3,CityName
temp,num,num,num,loc
20-May-13,19,20,0,delhi
20-May-13,25,42,7,agra
20-May-13,23,35,4,mumbai
20-May-13,21,32,3,delhi
20-May-13,17,27,1,mumbai
20-May-13,16,40,5,delhi
First of all, I find the unique values in the City column and form a list of it.
filename = 'data_file.csv'
data_date = pd.read_csv(filename)
column_name = data_date.ix[:, data_date.loc[0] == "city"]
column_work = column_name.iloc[1:]
column_unique = column_work.iloc[:,3].unique().tolist()
Secondly, I have written code for geocoding my cities.
def geocode(address):
i = 0
try:
while i < len(geocoders):
# try to geocode using a service
location = geocoders[i].geocode(address)
# if it returns a location
if location != None:
# return those values
return [location.latitude, location.longitude]
else:
# otherwise try the next one
i += 1
except:
print (sys.exc_info()[0])
return ['null','null']
# if all services have failed to geocode, return null values
return ['null','null']
list = ['delhi', 'agra', 'mumbai']
j = 0
lat = []
for row in list:
print ('processing #',j)
j+=1
try:
state = row
address = state
result = geocode(address)
# add the lat/lon values to the row
lat.extend(result)
except:
# print 'Unsuccessful'
to_print = 'Unsuccessful'
# row.extend(to_print)
dout.append(row)
print(lat)
This gives me a list of latitudes and longitudes [28.7040592, 77.10249019999999, 27.1766701, 78.00807449999999, 19.0759837, 72.8776559]. I want to write this onto my CSV file as
Time,M1,M2,M3,CityName,Latitude,Longitude
temp,num,num,num,loc,lat,lng
20-May-13,19,20,0,delhi,28.7040592,77.10249019999999
20-May-13,25,42,7,agra,27.1766701,78.00807449999999
20-May-13,23,35,4,mumbai,19.0759837, 72.8776559
20-May-13,21,32,3,delhi,28.7040592,77.10249019999999
20-May-13,17,27,1,mumbai,19.0759837, 72.8776559
20-May-13,16,40,5,delhi,28.7040592,77.10249019999999
I tried making a separate list of latitudes and longitudes latitude = lat[0::2] longitude = lat[1::2] or convert it to into a dictionary {'delhi': [28.7040592, 77.10249019999999], 'agra': [27.1766701, 78.00807449999999], 'mumbai': [19.0759837, 72.8776559]} but somehow could not figure out how to write it on a csv file.
I think converting them into a dictionary is a good approach.
dic = {'delhi': [28.7040592, 77.10249019999999],
'agra': [27.1766701, 78.00807449999999],
'mumbai': [19.0759837, 72.8776559]}
# Create new columns
data_date["Latitude"] = data_date.apply(lambda row: dic.get(row["CityName"])[0], axis = 1)
data_date["Longitude"] = data_date.apply(lambda row: dic.get(row["CityName"])[1], axis = 1)
# Write the data back to csv file
data_date.to_csv(filename, index = False)
In this way it gets values of corresponding city names from the dictionary, and write them into the specified column. Finally it overwrites the old csv file with the new data frame.
I have a python script to build inputs for a Google chart. It correctly creates column headers and the correct number of rows, but repeats the data for the last row in every row. I tried explicitly setting the row indices rather than using a loop (which wouldn't work in practice, but should have worked in testing). It still gives me the same values for each entry. I also had it working when I had this code on the same page as the HTML user form.
end1 = number of rows in the data table
end2 = number of columns in the data table represented by a list of column headers
viewData = data stored in database
c = connections['default'].cursor()
c.execute("SELECT * FROM {0}.\"{1}\"".format(analysis_schema, viewName))
viewData=c.fetchall()
curDesc = c.description
end1 = len(viewData)
end2 = len(curDesc)
Creates column headers:
colOrder=[curDesc[2][0]]
if activityOrCommodity=="activity":
tableDescription={curDesc[2][0] : ("string", "Activity")}
elif (activityOrCommodity == "commodity") or (activityOrCommodity == "aa_commodity"):
tableDescription={curDesc[2][0] : ("string", "Commodity")}
for i in range(3,end2 ):
attValue = curDesc[i][0]
tableDescription[curDesc[i][0]]= ("number", attValue)
colOrder.append(curDesc[i][0])
Creates row data:
data=[]
values = {}
for i in range(0,end1):
for j in range(2, end2):
if j == 2:
values[curDesc[j][0]] = viewData[i][j].encode("utf-8")
else:
values[curDesc[j][0]] = viewData[i][j]
data.append(values)
dataTable = gviz_api.DataTable(tableDescription)
dataTable.LoadData(data)
return dataTable.ToJSon(columns_order=colOrder)
An example javascript output:
var dt = new google.visualization.DataTable({cols:[{id:'activity',label:'Activity',type:'string'},{id:'size',label:'size',type:'number'},{id:'compositeutility',label:'compositeutility',type:'number'}],rows:[{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]}]}, 0.6);
it seems you're appending values to the data but your values are not being reset after each iteration...
i assume this is not intended right? if so just move values inside the first for loop in your row setting code