Creating a df based on total permutations deriving from user-input variables

Creating a df based on total permutations deriving from user-input variables - python

I would like to pass 'n' amount of cities to travel to and corresponding days in each city to a function that would return a df with all possible permutations of the journey. The kayak_search_url column in the df should contain this string in the first row:
https://www.kayak.com/flights/AMS-WAW,nearby/2023-02-14/WAW-BOG,nearby/2023-02-17/BOG-MIL,nearby/2023-02-20/MIL-SDQ,nearby/2023-02-23/SDQ-AMS,nearby/2023-02-25/?sort=bestflight_a
...but instead contains this string:
https://www.kayak.com/flights/AMS-WAW,nearby/2023-02-14/AMS-BOG,nearby/2023-02-17/AMS-MIL,nearby/2023-02-20/AMS-SDQ,nearby/2023-02-23/AMS,nearby/2023-02-25/?sort=bestflight_a
I can't figure out why the origin code 'AMS' shows up instead of the chain of cities. Here's the code:
# List the cities you want to travel to and from, how long you'd like to stay in each, and the appropriate start/end dates
start_city = 'Amsterdam'
end_city = 'Amsterdam'
start_date = '2023-02-14'
cities = ['Warsaw', 'Bogota', 'Milan', 'Santo Domingo']
days = [3,3,3,2]
def generate_permutations(cities, days, start_city, end_city, start_date):
city_to_days = dict(zip(cities, days))
permutations = list(itertools.permutations(cities))
df = pd.DataFrame(permutations, columns=['city' + str(i) for i in range(1, len(cities) + 1)])
df['origin'] = start_city
df['end'] = end_city
first_column = df.pop('origin')
df.insert(0, 'origin', first_column)
st_dt = pd.to_datetime(start_date)
df = df.assign(flight_dt_1=st_dt)
for i in range(len(cities)):
df['flight_dt_' + str(i + 2)] = df['flight_dt_' + str(i + 1)] + df['city' + str(i + 1)].map(city_to_days).map(lambda x: pd.Timedelta(days=x))
# IATA city code dictionary from iata_code.csv file in repo and create Kayak 'url' column for each permutation
iata = {'Amsterdam': 'AMS',
'Warsaw': 'WAW',
'Bogota': 'BOG',
'Milan': 'MIL',
'Santo Domingo': 'SDQ'}
url = 'https://www.kayak.com/flights/'
df['kayak_search_url'] = df.apply(lambda x: url + ''.join([iata[x['origin']] + '-' + iata[x['city' + str(i+1)]] + \
',nearby/' + str(x['flight_dt_' + str(i+1)].strftime("%Y-%m-%d")) + '/' \
for i in range(len(cities))]) + iata[x['end']] + ',nearby/' + str(x['flight_dt_' + str(len(cities) + 1)].strftime("%Y-%m-%d")) + \
'/?sort=bestflight_a', axis=1)
return df

Let's break down the desired URL to highlight its structure:
https://www.kayak.com/flights
/AMS-WAW,nearby/2023-02-14
/WAW-BOG,nearby/2023-02-17
/BOG-MIL,nearby/2023-02-20
/MIL-SDQ,nearby/2023-02-23
/SDQ-AMS,nearby/2023-02-25
/?sort=bestflight_a
Obviously only the middle section needs to generated as the other parts are static. We can also generate that middle section before constructing the dataframe:
def generate_permutations(cities, days, start_city, end_city, start_date):
iata = {
"Amsterdam": "AMS",
"Warsaw": "WAW",
"Bogota": "BOG",
"Milan": "MIL",
"Santo Domingo": "SDQ",
}
permutations = [
(start_city,) + p + (end_city,) for p in itertools.permutations(cities)
]
flight_dates = pd.to_datetime(start_date) + pd.to_timedelta(
np.array([0] + days).cumsum(),
unit="D",
)
# Generate the URLs
urls = []
for p in permutations:
# The pattern for each segment is
# START-END,nearby/yyyy-dd-dd
mid_url = "/".join(
[
f"{iata[s]}-{iata[e]},nearby/{fd:%Y-%m-%d}"
for s, e, fd in zip(p[:-1], p[1:], flight_dates)
]
)
urls.append(f"https://www.kayak.com/flights/{mid_url}/?sort=bestflight_a")
# Generate the resulting dataframe
return (
pd.DataFrame(
permutations,
columns=["origin", *[f"city{i+1}" for i in range(len(cities))], "end"],
)
.merge(
pd.DataFrame(
flight_dates,
index=[f"flight_dt_{i+1}" for i in range(len(flight_dates))],
).T,
how="cross",
)
.assign(kayak_search_url=urls)
)

Related

Python folium - ValueError: Location values cannot contain NANs

I have a problem.
After fetching my .csv file in Python I keep getting the following error:
ValueError: Location values cannot contain NANs.
My code looks like this:
df = pd.read_csv("surveyed.csv")
fc=folium.FeatureGroup(name="Tbs",overlay=True)
cf_survey_cluster = MarkerCluster(name="Tbs").add_to(map)
for i,row in df.iterrows():
city = df.at[i,'City']
address = df.at[i,'Address']
postcode = df.at[i,'Post Code']
dead = df.at[i,'Deadline']
lat = df.at[i, 'Latitude']
lng = df.at[i, 'Longitude']
popup = '<b>CITY: </b>' + str(city) + '<br>' + '<b>ADDRESS: </b>' + str(address) + ', ' + str(postcode) + '<br>' + '<b>DEADLINE: </b>' + str(dead)
cf_survey_marker = folium.Marker(location=[lat,lng], popup=popup, icon = folium.Icon(color='green', icon='glyphicon-calendar'))
My .csv file is fine, no gaps seen at all.
I tried the following query:
ValueError: Location values cannot contain NaNs, got: [nan, nan]
but I don't know how to provide the isnull option to my code. I tried:
lat = df[df.isnull(at[i, 'Latitude'])]
but now the error shows:
The value at is not defined.
Is there any chance to make it fixed?
UPDATE:
This approach also doesn't work:
df = pd.read_csv("surveyed.csv")
fc=folium.FeatureGroup(name="To be surveyed",overlay=True)
cf_survey_cluster = MarkerCluster(name="To be surveyed").add_to(map)
for i,row in df.iterrows():
city = df.at[i,'City']
address = df.at[i,'Address']
postcode = df.at[i,'Post Code']
dead = df.at[i,'Deadline']
#lat = df.at[i, 'Latitude']
#lng = df.at[i, 'Longitude']
latlon = df.dropna(subset=['Longitude','Latitude'])
popup = '<b>CITY: </b>' + str(city) + '<br>' + '<b>ADDRESS: </b>' +
str(address) + ', ' + str(postcode) + '<br>' + '<b>DEADLINE: </b>' +
str(dead)
cf_survey_marker = folium.Marker(location=[latlon], popup=popup, icon = folium.Icon(color='green', icon='glyphicon-calendar'))
as I get an error:
ValueError: Expected two (lat, lon) values for location, instead got: [ City Address ... Latitude Longitude

You can use the dropna() function to remove nan values from columns.
df.dropna(axis='columns')
Example:
df = df.dropna(subset=['Longitude','Latitude'])
df = pd.read_csv("surveyed.csv")
df = df.dropna(subset=['Longitude','Latitude'])
fc=folium.FeatureGroup(name="To be surveyed",overlay=True)
cf_survey_cluster = MarkerCluster(name="To be surveyed").add_to(map)
for i,row in df.iterrows():
city = df.at[i,'City']
address = df.at[i,'Address']
postcode = df.at[i,'Post Code']
dead = df.at[i,'Deadline']
lat = df.at[i, 'Latitude']
lng = df.at[i, 'Longitude']
popup = '<b>CITY: </b>' + str(city) + '<br>' + '<b>ADDRESS: </b>' +
str(address) + ', ' + str(postcode) + '<br>' + '<b>DEADLINE: </b>' +
str(dead)
cf_survey_marker = folium.Marker(location=[lat, lon], popup=popup, icon = folium.Icon(color='green', icon='glyphicon-calendar'))

Extracting Nested List-Dictionaries to Pandas Series in a DataFrame

I have a pandas dataframe that I have extracted from a JSON file for breweries I'm interested in. most of these columns are as nested list of dictionaries. However two columns 'hours' and 'memberships' are being problematic.
I'd like to extract the 'hours' column into 7 columns "Mon_Hours","Tue_hours"...'Sun_Hours'.
I have tried and tried to figure this out but these two columns are proving challenging.
Here is a link to the initial data: https://www.coloradobrewerylist.com/wp-json/cbl_api/v1/locations/?location-type%5Bnin%5D=404,405&page_size=1000&page_token=1
and here is my code:
import requests
import re
import pandas as pd
import numpy as np
import csv
import json
from datetime import datetime
### get the data from the Colorado Brewery list
url = "https://www.coloradobrewerylist.com/wp-json/cbl_api/v1/locations/?location-type%5Bnin%5D=404,405&page_size=1000&page_token=1"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
data=response.json()
### convert results to table
pd.set_option('display.max_columns', None)
brewdf = pd.DataFrame.from_dict(data['results'])
#brewdf
############################################
#### CLEAN UP NESTED LIST-DICT COLUMNS #####
############################################
## cleanup dogs column
dogs = pd.json_normalize(brewdf['dogs'])
dogs2 = dogs.squeeze()
dogsdf = pd.json_normalize(dogs2)
dogsdf = dogsdf.drop(columns =['id','slug'])
dogsdf = dogsdf.rename(columns={'name':'dogs_allowed'})
#dogsdf
## cleanup parking column
parking = pd.json_normalize(brewdf['parking'])
parking = parking.rename(columns = {0:'Parking1',1:'Parking2',2:'Parking3'})
a = pd.json_normalize(parking['Parking1'])
b = pd.json_normalize(parking['Parking2'])
c = pd.json_normalize(parking['Parking3'])
parkcombo = pd.concat([a,b,c],ignore_index=True, axis=1)
parkcombo = parkcombo.rename(columns = {2:'P1',5:'P2',8:'P3'})
parkcombo['parking_type'] = parkcombo['P1'].map(str) + ',' + parkcombo['P2'].map(str) + ',' + parkcombo['P3'].map(str)
parkcombo['parking_type'] = parkcombo['parking_type'].str.replace(",nan",'')
parkdf = parkcombo['parking_type'].to_frame()
#parkdf
## cleanup food type column
food = pd.json_normalize(brewdf['food_type'])
food
food = food.rename(columns = {0:'Food1',1:'Food2',2:'Food3',3:'Food4',4:'Food5',5:'Food6'})
a = pd.json_normalize(food['Food1'])
b = pd.json_normalize(food['Food2'])
c = pd.json_normalize(food['Food3'])
d = pd.json_normalize(food['Food4'])
e = pd.json_normalize(food['Food5'])
f = pd.json_normalize(food['Food6'])
foodcombo = pd.concat([a,b,c,d,e,f],ignore_index=True, axis =1)
foodcombo
foodcombo = foodcombo.rename(columns = {2:'F1',5:'F2',8:'F3',11:'F4',14:'F5',17:'F6'})
foodcombo['food_type'] = foodcombo['F1'].map(str) + ',' + foodcombo['F2'].map(str) + ',' + foodcombo['F3'].map(str) + ',' + foodcombo['F4'].map(str)+ ',' + foodcombo['F5'].map(str) + ',' + foodcombo['F6'].map(str)
foodcombo['food_type'] = foodcombo['food_type'].str.replace(",nan",'')
fooddf = foodcombo['food_type'].to_frame()
#fooddf
## cleanup patio column
patio = pd.json_normalize(brewdf['patio'])
patio = patio.rename(columns = {0:'P1',1:'P2',2:'P3'})
a = pd.json_normalize(patio['P1'])
b = pd.json_normalize(patio['P2'])
c = pd.json_normalize(patio['P3'])
patiocombo = pd.concat([a,b,c],ignore_index=True, axis =1)
patiocombo
patiocombo = patiocombo.rename(columns = {2:'P1',5:'P2',8:'P3'})
patiocombo['patio_type'] = patiocombo['P1'].map(str) + ',' + patiocombo['P2'].map(str) + ',' + patiocombo['P3'].map(str)
patiocombo['patio_type'] = patiocombo['patio_type'].str.replace(",nan",'')
patiodf = patiocombo['patio_type'].to_frame()
#patiodf
## clean visitor type column
visitor = pd.json_normalize(brewdf['visitors'])
visitor
visitor = visitor.rename(columns = {0:'V1',1:'V2',2:'V3'})
a = pd.json_normalize(visitor['V1'])
b = pd.json_normalize(visitor['V2'])
c = pd.json_normalize(visitor['V3'])
visitorcombo = pd.concat([a,b,c],ignore_index=True, axis =1)
visitorcombo
visitorcombo = visitorcombo.rename(columns = {2:'V1',5:'V2',8:'V3'})
visitorcombo['visitor_type'] = visitorcombo['V1'].map(str) + ',' + visitorcombo['V2'].map(str) + ',' + visitorcombo['V3'].map(str)
visitorcombo['visitor_type'] = visitorcombo['visitor_type'].str.replace(",nan",'')
visitordf = visitorcombo['visitor_type'].to_frame()
#visitordf
## clean tour type column
tour = pd.json_normalize(brewdf['tour_type'])
tour
tour = tour.rename(columns = {0:'T1',1:'T2',2:'T3',3:'T4'})
a = pd.json_normalize(tour['T1'])
b = pd.json_normalize(tour['T2'])
c = pd.json_normalize(tour['T3'])
d = pd.json_normalize(tour['T4'])
tourcombo = pd.concat([a,b,c,d],ignore_index=True, axis =1)
tourcombo
tourcombo = tourcombo.rename(columns = {2:'T1',5:'T2',8:'T3',11:'T4'})
tourcombo['tour_type'] = tourcombo['T1'].map(str) + ',' + tourcombo['T2'].map(str) + ',' + tourcombo['T3'].map(str) + ','+ tourcombo['T4'].map(str)
tourcombo['tour_type'] = tourcombo['tour_type'].str.replace(",nan",'')
tourdf = tourcombo['tour_type'].to_frame()
#tourdf
## clean other drinks column
odrink = pd.json_normalize(brewdf['otherdrinks_type'])
odrink
odrink = odrink.rename(columns = {0:'O1',1:'O2',2:'O3',3:'O4',4:'O5',5:'O6',6:'O7',7:'O8',8:'O9'})
a = pd.json_normalize(odrink['O1'])
b = pd.json_normalize(odrink['O2'])
c = pd.json_normalize(odrink['O3'])
d = pd.json_normalize(odrink['O4'])
e = pd.json_normalize(odrink['O5'])
f = pd.json_normalize(odrink['O6'])
g = pd.json_normalize(odrink['O7'])
h = pd.json_normalize(odrink['O8'])
i = pd.json_normalize(odrink['O9'])
odrinkcombo = pd.concat([a,b,c,d,e,f,g,h,i],ignore_index=True, axis =1)
odrinkcombo
odrinkcombo = odrinkcombo.rename(columns = {2:'O1',5:'O2',8:'O3',11:'O4',14:'O5',17:'O6',20:'O7',23:'O8',26:'O9'})
odrinkcombo['odrink_type'] = odrinkcombo['O1'].map(str) + ',' + odrinkcombo['O2'].map(str) + ',' + odrinkcombo['O3'].map(str) + ','+ odrinkcombo['O4'].map(str) + ','+ odrinkcombo['O5'].map(str)+ ','+ odrinkcombo['O6'].map(str)+ ','+ odrinkcombo['O7'].map(str)+','+ odrinkcombo['O8'].map(str)+','+ odrinkcombo['O9'].map(str)
odrinkcombo['odrink_type'] = odrinkcombo['odrink_type'].str.replace(",nan",'')
odrinkdf = odrinkcombo['odrink_type'].to_frame()
#odrinkdf
## clean to-go column
togo = pd.json_normalize(brewdf['togo_type'])
togo
togo = togo.rename(columns = {0:'TG1',1:'TG2',2:'TG3',3:'TG4',4:'TG5'})
a = pd.json_normalize(togo['TG1'])
b = pd.json_normalize(togo['TG2'])
c = pd.json_normalize(togo['TG3'])
d = pd.json_normalize(togo['TG4'])
e = pd.json_normalize(togo['TG5'])
togocombo = pd.concat([a,b,c,d,e],ignore_index=True, axis =1)
togocombo
togocombo = togocombo.rename(columns = {2:'TG1',5:'TG2',8:'TG3',11:'TG4',14:'TG5'})
togocombo['togo_type'] = togocombo['TG1'].map(str) + ',' + togocombo['TG2'].map(str) + ',' + togocombo['TG3'].map(str) + ','+ togocombo['TG4'].map(str) + ','+ togocombo['TG5'].map(str)
togocombo['togo_type'] = togocombo['togo_type'].str.replace(",nan",'')
togodf = togocombo['togo_type'].to_frame()
#togodf
## clean merch column
merch = pd.json_normalize(brewdf['merch_type'])
merch
merch = merch.rename(columns = {0:'M1',1:'M2',2:'M3',3:'M4',4:'M5',5:'M6',6:'M7',7:'M8',8:'M9',9:'M10',10:'M11',11:'M12'})
a = pd.json_normalize(merch['M1'])
b = pd.json_normalize(merch['M2'])
c = pd.json_normalize(merch['M3'])
d = pd.json_normalize(merch['M4'])
e = pd.json_normalize(merch['M5'])
f = pd.json_normalize(merch['M6'])
g = pd.json_normalize(merch['M7'])
h = pd.json_normalize(merch['M8'])
i = pd.json_normalize(merch['M9'])
j = pd.json_normalize(merch['M10'])
k = pd.json_normalize(merch['M11'])
l = pd.json_normalize(merch['M12'])
merchcombo = pd.concat([a,b,c,d,e,f,g,h,i,j,k,l],ignore_index=True, axis =1)
merchcombo
merchcombo = merchcombo.rename(columns = {2:'M1',5:'M2',8:'M3',11:'M4',14:'M5',17:'M6',20:'M7',23:'M8',26:'M9',29:'M10',32:'M11',35:'M12'})
merchcombo['merch_type'] = (merchcombo['M1'].map(str) + ',' + merchcombo['M2'].map(str) + ',' + merchcombo['M3'].map(str) + ','+ merchcombo['M4'].map(str) + ','
+ merchcombo['M5'].map(str) + ',' + merchcombo['M6'].map(str)+ ',' + merchcombo['M7'].map(str) + ',' + merchcombo['M8'].map(str)
+ ',' + merchcombo['M9'].map(str)+ ',' + merchcombo['M10'].map(str)+ ',' + merchcombo['M11'].map(str)+ ',' + merchcombo['M12'].map(str))
merchcombo['merch_type'] = merchcombo['merch_type'].str.replace(",nan",'')
merchdf = merchcombo['merch_type'].to_frame()
#merchdf
### clean description column
brewdf['description'] = brewdf['description'].str.replace(r'<[^<>]*>', '', regex=True)
#brewdf
### replace nan with null
brewdf = brewdf.replace('nan',np.nan)
brewdf = brewdf.replace('None',np.nan)
brewdf
cleanedbrewdf = brewdf.drop(columns = {'food_type','tour_type','otherdrinks_type','articles','merch_type','togo_type','patio','visitors','parking','dogs'})
mergedbrewdf = pd.concat([cleanedbrewdf,dogsdf,parkdf,fooddf,patiodf,
visitordf,tourdf,odrinkdf,togodf,merchdf,],ignore_index=False,axis=1)
mergedbrewdf
### remove non-existing
finalbrewdf = mergedbrewdf.loc[(mergedbrewdf['lon'].notnull())].copy()
finalbrewdf['lon'] = finalbrewdf['lon'].astype(float)
finalbrewdf['lat'] = finalbrewdf['lat'].astype(float)
finalbrewdf
Can someone please point me in the right direction for the hours and memberships columns? Also, is there a more efficient way to look through these different columns? They have different nested list-dict lengths which I thought might prevent me from writing a function.

How can I display max number of loses from this dataframe in Pandas?

I wrote a webscraper which is downloading table tennis data. There is info about players, match score etc. I would like to display players which lost the most matches per day. I've created data frame and I would like to sum p1_status and p2_status, then I would like to display Surname and number of loses next to player.
https://gyazo.com/19c70e071db78071e83045bfcea0e772
Here is my code:
s = Service("D:/setka/chromedriver.exe")
option = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=s)
hall = 10
num =1
filename = "C:/Users/filip/result2.csv"
f=open(filename,"w")
headers = "p1_surname, p1_name, p1_score, p2_surname, p2_name, p2_score, p1_status, p2_status \n"
f.write(headers)
while hall <= 10:
for period in [1]:
url = 'https://tabletennis.setkacup.com/en/schedule?date=2021-12-04&hall=' + \
str(hall) + '&' + 'period=' + str(period)
driver.get(url)
time.sleep(5)
divs = driver.find_elements(By.CSS_SELECTOR, "div.score-result")
for div in divs:
data = div.text.split()
#print(data)
if(num % 2) == 0:
f.write(str(data[0]) + "," + str(data[1]) + "," + str(data[2] + "," + "\n"))
else:
f.write(str(data[0]) + "," + str(data[1]) + "," + str(data[2] + ","))
num = num +1
hall =hall + 1
f.close()
df_results=pd.read_csv('C:/Users/filip/result2.csv', sep = r',',
skipinitialspace = True)
df_results.reset_index(drop=True, inplace=True)
df_results.loc[df_results['p1_score'] > df_results['p2_score'], ['p1_status','p2_status']] = ['won','lost']
df_results.loc[df_results['p1_score'] < df_results['p2_score'], ['p1_status','p2_status']] = ['lost','won']
df_results.loc[df_results['p1_score'] == df_results['p2_score'], ['p1_status','p2_status']] = ['not played','not played']
df_results.loc[((df_results['p1_score'] < 3) & (df_results['p1_score']!=0) & (df_results['p2_score'] <3) & (df_results['p2_score']!=0)), ['p1_status','p2_status']] = ['inplay','inplays']
df_results.loc[df_results['p1_status'] != df_results['p2_status'], ['match_status']] = ['finished']
df_results.loc[df_results['p1_status'] == df_results['p2_status'], ['match_status']] = ['not played']
df_results.loc[((df_results['p1_status'] =='inplay') & (df_results['p2_status']=='inplays')), ['match_status']] = ['inplay']
df_results = df_results.dropna(axis=1)
df_results.head(30)

Split your dataframe in 2 parts (p1_, p2_) to count defeats of each player then merge them:
Setup a MRE:
df = pd.DataFrame({'p1_surname': list('AABB'), 'p2_surname': list('CDCD'),
'p1_status': list('LWWW'), 'p2_status': list('WLLL')})
print(df)
# Output:
p1_surname p2_surname p1_status p2_status
0 A C L W
1 A D W L
2 B C W L
3 B D W L
>>> pd.concat([
df.filter(like='p1_').set_index('p1_surname')['p1_status'].eq('L').rename('loses'),
df.filter(like='p2_').set_index('p2_surname')['p2_status'].eq('L').rename('loses')]) \
.groupby(level=0).sum().rename_axis('surname').reset_index()
surname loses
0 A 1
1 B 0
2 C 1
3 D 2

How do I improve my code to make it run faster?

I am conducting a project in data science to analyse large volumes of cancer genome data, my computer is relatively inefficient and has a low cpu and low ram. As a result to run through all the samples it take sufficiently too long.
I have tried reducing any excess code, I have tried getting rid of for loops for list comprehensions, I have used multiprocessing to split up my tasks to run faster.
import re
import xlrd
import os
import time
from multiprocessing import Pool
import collections
import pandas as pd
if os.path.exists("C:\\Users\\js769\\genomemutations\\Input\\ChromosomesVersion") == True:
print("chromosomes in folder")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\Input\\ChromosomesVersion")
print(
"Chromosome Folder Created, Please transfer current version of chromosome number base data to new file."
)
if os.path.exists("C:\\Users\\js769\\genomemutations\\Input\\MutationSamples") == True:
print("Add sample data to run.")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\Input\\MutationSamples")
print("Mutation Sample Folder Created, please add mutation sample data to folder.")
if os.path.exists("C:\\Users\\js769\\genomemutations\\output") == True:
print("3")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\output")
# Require editing of this so it works both on a mac or windows system. Currently this version suited to mac because of higher processing power.
# Require ability to check to see if error occurs
def Main(Yeram):
import os
import glob
import errno
import shutil
import xlrd
import pandas as pd
import time
import re
import numpy as np
FragmentSize = 10000000 # This is fragment size which is adjustable.
# Code not needed
Position1 = Yeram.vectx
Position2 = Yeram.vecty
samplelist = Yeram.samplelist
dictA = Yeram.dictA
FragmentSize = Yeram.FragmentSize
chromosomesizes = Yeram.chromosomesizes
def chromosomex_mutation_data(
chromosomenumber, mutationlist
): # It selects the correct chromosome mutation point data, then it selects the data before the -. Mutation data in form(12-20)
chromosomexlist = ["0-1"]
for mutationposition in mutationlist:
if mutationposition[0:2] == str(chromosomenumber):
chromosomexlist.append(mutationposition[3:])
elif mutationposition[0:2] == (str(chromosomenumber) + ":"):
chromosomexlist.append(mutationposition[2:])
else:
continue
Puremutationdatapoints = [int(mutationposition.split("-")[0]) for mutationposition in chromosomexlist]
return Puremutationdatapoints
def Dictionary_Of_Fragment_mutation(FragmentSize, MutationData, ChromosomeNumber): #
chromosomes = {} # Dictionary
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
# Opening up specific chromosome data and calculating amount of bases present in chromosome
Number_of_fragments = int(chromosomesize / FragmentSize)
for mutation in MutationData:
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
if mutation in range(i * FragmentSize + 1, i * FragmentSize + FragmentSize + 1):
if chromosomes.get(a) == None:
chromosomes.update({a: 1})
else:
b = (chromosomes.get(a)) + 1
chromosomes.update({a: b})
else:
if chromosomes.get(a) == None:
chromosomes.update({a: 0})
else:
continue
return chromosomes # adds
# This adds mutations or no mutation to each fragment for chromosome,makes dicitonaries
def DictionaryRead(FragmentSize, Dict, ChromosomeNumber):
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
Number_of_fragments = int(chromosomesize / FragmentSize)
chromosomefragmentlist = []
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
chromosomefragmentlist.append(str(Dict.get((a))))
return chromosomefragmentlist
# This uses dictionary to create list
def forwardpackage2(FragmentSize, PureMutationData):
C = [] # list of data in numerical order 0 = no mutation
for i in range(1, 23, 1):
A = chromosomex_mutation_data(i, PureMutationData) # Purifies Data
B = Dictionary_Of_Fragment_mutation(FragmentSize, A, i) # Constructs Dictionary
C += DictionaryRead(
FragmentSize, B, i
) # Uses constructed Dictionary amd generates list of numbers, each number being a fragment in numerical order.
return C
def Mutationpointdata(Position1, Position2, dictA, FragmentSize): # Require dictA
vectx = Position1
vecty = Position2
Samplesandmutationpoints = []
for i in range(vectx, vecty):
print(samplelist[i])
new = [k for k, v in dictA.items() if int(v) == samplelist[i]]
mutationlist = [excelsheet.cell_value(i, 23) for i in new]
mutationlist.sort()
Samplesandmutationpoints.append(forwardpackage2(FragmentSize, mutationlist))
return Samplesandmutationpoints
# Opening sample data from excel table
return Mutationpointdata(Position1, Position2, dictA, FragmentSize) # yeram to james samples
def ChromosomeSequenceData(ChromosomeNumber): # Formats the chromosome file into readable information
with open(
r"C:\Users\js769\genomemutations\Input\ChromosomesVersion\chr" + str(ChromosomeNumber) + ".fa"
) as text_file:
text_data = text_file.read()
listA = re.sub("\n", "", text_data)
# list2=[z for z in text_data if z!= "\n"]
if ChromosomeNumber < 10:
ChromosomeSequenceData = listA[5:]
else:
ChromosomeSequenceData = listA[6:]
return ChromosomeSequenceData
def basepercentage_single(
i, FragmentSize, ChromosomeSequenceData
): # Creates a list of base percentage known for certain type of chromosome.
sentence = ChromosomeSequenceData[(i * FragmentSize + 1) : (i * FragmentSize + FragmentSize)]
a = sentence.count("N") + sentence.count("n")
c = str(((FragmentSize - a) / FragmentSize) * 100) + "%"
return c
def basepercentage_multiple(
FragmentSize, ChromosomeSequenceData
): # Creates a a list of base percentages known which correspond with the dna fragments for every chromosome.
fragmentamount = int(len(ChromosomeSequenceData) / FragmentSize)
list = [
basepercentage_single(i, FragmentSize, ChromosomeSequenceData) for i in range(0, (fragmentamount), 1)
]
return list
def FragmentEncodedPercentage(
FragmentSize
): # Packages a list of base percentages known which correspond with the dna fragments for every chromosome.
Initial_list = [basepercentage_multiple(FragmentSize, ChromosomeSequenceData(i)) for i in range(1, 23, 1)]
List_of_fragment_encoded_percentages = [item for sublist in Initial_list for item in sublist]
return List_of_fragment_encoded_percentages
def chromosomefragmentlist(
FragmentSize, ChromosomeNumber
): # Creares a list of fragment sizes for a specific chromosome.
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
Number_of_fragments = int(chromosomesize / FragmentSize)
chromosomefragmentlist = []
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
chromosomefragmentlist.append(str(((a))))
return chromosomefragmentlist
def GenomeFragmentGenerator(
FragmentSize
): # Creates the genome fragments for all chromosomes and adds them all to a list.
list = [chromosomefragmentlist(FragmentSize, i) for i in range(1, 23, 1)]
A = [item for sublist in list for item in sublist]
return A
def excelcreation(
mutationdata, samplelist, alpha, bravo, FragmentSize, A, B
): # Program runs sample alpha to bravo and then constructs excel table
data = {"GenomeFragments": A, "Encoded Base Percentage": B}
for i in range(alpha, bravo):
data.update({str(samplelist[i]): mutationdata[i]})
df = pd.DataFrame(data, index=A)
export_csv = df.to_csv(
r"C:/Users/js769/genomemutations/output/chromosomeAll.csv", index=None, header=True
)
start_time = time.time()
# Code determine base fragment size
FragmentSize = 1000000
chromosomesizes = [] # This calculates the base pair sizes for each chromosome.
for i in range(1, 23):
with open(r"C:\Users\js769\genomemutations\Input\ChromosomesVersion\chr" + str(i) + ".fa") as text_file:
text_data = text_file.read()
list = re.sub("\n", "", text_data)
if i < 10:
chromosomesizes.append(len(list[5:]))
else:
chromosomesizes.append(len(list[6:]))
wb = xlrd.open_workbook("C:/Users/js769/genomemutations/input/MutationSamples/Complete Sample For lungs.xlsx")
excelsheet = wb.sheet_by_index(0)
excelsheet.cell_value(0, 0)
sampleswithduplicates = [excelsheet.cell_value(i, 5) for i in range(1, excelsheet.nrows)]
samplelist = []
for sample in sampleswithduplicates:
if sample not in samplelist:
samplelist.append(int(sample)) # Constructs list of sample , each sample only comes up once
dictA = {}
counter = 1 # Creates a dictionary where it counts the
for sample in sampleswithduplicates:
dictA.update({counter: int(sample)})
counter = counter + 1
A = GenomeFragmentGenerator(FragmentSize)
B = FragmentEncodedPercentage(FragmentSize)
value = collections.namedtuple(
"value", ["vectx", "vecty", "samplelist", "dictA", "FragmentSize", "chromosomesizes"]
)
SampleValues = (
value(
vectx=0,
vecty=2,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=2,
vecty=4,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=4,
vecty=6,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=6,
vecty=8,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=8,
vecty=10,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=10,
vecty=12,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=12,
vecty=14,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=14,
vecty=16,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
)
print("starting multiprocessing")
if __name__ == "__main__":
with Pool(4) as p:
result = p.map(Main, SampleValues)
Allmutationdata = []
for i in result:
for b in i:
Allmutationdata.append(b)
excelcreation(Allmutationdata, samplelist, 0, 16, FragmentSize, A, B)
print("My program took " + str(time.time() - start_time) + " to run")
So the program runs that isn't the issue, the issue is the time it runs,can anyone spot anywhere where my code maybe at fault.

This article How to make your pandas loop run 72,000x faster has really resonated with me and I think will help you.
It provides clear instructions on how to vectorize your for loops to drastically speed them up
Methods to speed up a For Loop:
Utilize pandas iterrows()
~321 times faster
Example
for index, row in dataframe.iterrows():
print(index, row)
Pandas Vectorization
~9280 times faster
Example
df.loc[((col1 == val1) & (col2 == val2)), column_name] = conditional_result
Numpy Vectorization
~72,000 times faster
Example
df.loc[((col1.values == val1) & (col2.values == val2)), column_name] = conditional_result
By adding .values we receive a numpy array.
Credit for the timing results goes to this article

Loop for extracting dictionary values ends prematurely

I'm trying to use this dictionary:
student_data_dict = {'Student_1': 'bbbeaddacddcddaaadbaabdad', 'Student_2': 'acbccaddcadaaacdadbcabcad', 'Student_3': 'babcabdccadcDdbccdbaadbad', 'Student_4': 'bcbcabddcadcdabccdbaadcbd', 'Student_5': 'DCBCCADDCADBDACCDBBACBCAD', 'Student_6': 'acbeccddcadbaaccabbacdcad', 'Student_7': 'BCBCBCDABADCADCCDABAACCAD', 'Student_8': 'dcbccbddcadaabcbcacabbcad', 'Student_9': 'DDBDBBCDDCCBABCCBACADAAAC', 'Student_10': 'cbbdacdacadcbadbabaabcaTa', 'Student_11': 'BDBECADCAADCAAAAACBACACAD', 'Student_12': 'DBBCCBDCCADCDABABCBAABCAD', 'Student_13': 'BCBCBCDDCADCAAACCABACACAD', 'Student_14': 'DBBECBDACADAAACBCBAAABCBD', 'Student_15': 'acbebbddcadbaacccbcaddcad', 'Student_16': 'ACBEBCDDCADBAACCAACADBCAD', 'Student_17': 'DBBCACDDCADCAABCADBABDDAD', 'Student_18': 'dcbcdcdbbddccabbdacacccbd', 'Student_19': 'dbbccbddcadaaaccbdcaaacad', 'Student_20': 'abbdaaddcadcaaccbdcaaccbd', 'Student_21': 'DCDCABDBCADAAACDCCDAACAAD', 'Student_22': 'dabdaddabddbaacdacbaaaaad', 'Student_23': 'BCBCDDDACCDCAABDDABACACAD', 'Student_24': 'ACBDCBDBBCDAACCCCBDAADCBD', 'Student_25': 'DCBCACDAADDCADCBAABACBCAD', 'Student_26': 'dcbaabdccadcdadcccbaabdbd', 'Student_27': 'abbadbddcadacbcacccacbdad'}
and store the first letter for all students as a dictionary entry and then do the same for the next letter ect... to result in:
{'question_1': 'babbDaBdDcBDBDaADddaDdBADda', 'question_2': 'bcacCcCcDbDBCBcCBcbbCaCCCcb', 'question_3': 'bbbbBbBbBbBBBBbBBbbbDbBBBbb', 'question_4': 'ecccCeCcDdECCEeECccdCdCDCaa', 'question_5': 'acaaCcBcBaCCBCbBAdcaAaDCAad', 'question_6': 'dabbAcCbBcABCBbCCcbaBdDBCbb', 'question_7': 'ddddDdDdCdDDDDdDDdddDdDDDdd', 'question_8': 'adcdDdAdDaCCDAdDDbddBaABAcd', 'question_9': 'ccccCcBcDcACCCcCCbccCbCBAcc', 'question_10': 'daaaAaAaCaAAAAaAAdaaAdCCDaa', 'question_11': 'ddddDdDdCdDDDDdDDdddDdDDDdd', 'question_12': 'caccBbCaBcCCCAbBCcacAbCACca', 'question_13': 'daDdDaAaAbADAAaAAcaaAaAAAdc', 'question_14': 'dadaAaDbBaAAAAaAAaaaAaACDab', 'question_15': 'acbbCcCcCdABACcCBbccCcBCCdc', 'question_16': 'adccCcCbCbAACBcCCbccDdDCBca', 'question_17': 'aaccDaDcBaABCCcAAdbbCaDCAcc', 'question_18': 'ddddBbAaAbCCABbADaddCcABAcc', 'question_19': 'bbbbBbBcCaBBBAcCBcccDbBDBbc', 'question_20': 'acaaAaAaAaAAAAaAAaaaAaAAAaa', 'question_21': 'aaaaCcAbDbCACAdDBcaaAaCACac', 'question_22': 'bbddBdCbAcABABdBDcacCaADBbb', 'question_23': 'dcbcCcCcAaCCCCcCDcccAaCCCdd', 'question_24': 'aaabAaAaATAAABaAAbabAaABAba', 'question_25': 'ddddDdDdCaDDDDdDDdddDdDDDdd'}
x = 1
all_letters = ''
letter = ''
y = 1
i = 0
z = 0
for start in student_data_dict:
student = student_data_dict.get('Student_' + str(y))
letter = student[z]
all_letters = all_letters + letter
y = y + 1
i = i + 1
question_data_dict["question " + str(x)] = all_letters
if i == 27:
z = z + 1
x = x + 1
i = 0
print(question_data_dict)
data_file.close()
{'question 1': 'babbDaBdDcBDBDaADddaDdBADda'}
is what I get but I can't get the answers for the other 25 questions.
I tried changing for start in student_data_dict: into while z<26: but at the line "letter = student[z]" I get the error "'NoneType' object is not subscriptable"

num_questions = 25
answers_dict = {}
for i in range(num_questions):
answers_dict['question' + str(i)] = ''.join(c[i] for c in student_data_dict.values())
print(answers_dict)
Will give you the result you want.
Edit
Fixed code. Extracted number of questions to a variable so it can be used as index
Edit2
I created an OrderedDict from your original dictionary to maintain answer order when iterating. Now the answers_dict contains valid data.
from collections import OrderedDict
ordered_data = OrderedDict()
for i in range(len(student_data_dict.items())):
ordered_data['Student_' + str(i + 1)] = student_data_dict.get('Student_' + str(i + 1))
num_questions = 25
answers_dict = {}
for i in range(num_questions):
answers_dict['question' + str(i + 1)] = ''.join(c[i] for c in ordered_data.values())

You need to reset y when you move to the next question.

Here's an alternative a way to get what you're looking for with Pandas:
import pandas as pd
sdd = {k:[x for x in v] for k,v in student_data_dict}
df = pd.DataFrame(sdd)
df = df.reindex_axis(sorted(df.columns,
key = lambda col: int(col.split("_")[-1])), axis=1)
df.index = [f"Question {i+1}" for i in df.index]
{k:''.join(v) for k,v in zip(df.index, df.values)}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a df based on total permutations deriving from user-input variables - python

Related

Python folium - ValueError: Location values cannot contain NANs

Extracting Nested List-Dictionaries to Pandas Series in a DataFrame

How can I display max number of loses from this dataframe in Pandas?

How do I improve my code to make it run faster?

Loop for extracting dictionary values ends prematurely

Categories

Resources