Make This Input Function Faster - python

I'm practicing some exam questions and I've encountered a time limit issue that I can't figure out. I think its to do with how I'm iterating through the inputs.
It's the famous titanic dataset so I won't bother printing a sample of the df as I'm sure everyone is familiar with it.
The function compares the similarity between two passengers which are provided as input. Also, I am mapping the Sex column with integers in order to compare between passengers you'll see below.
I was also thinking it could be how I'm indexing and locating the values for each passenger but again I'm not sure
The function is as follows and the time limit is 1 second but when no_of_queries == 100 the function takes 1.091s.
df = pd.read_csv("titanic.csv")
mappings = {'male': 0, 'female':1}
df['Sex'] = df['Sex'].map(mappings)
def function_similarity(no_of_queries):
for num in range(int(no_of_queries)):
x = input()
passenger_a, passenger_b = x.split()
passenger_a, passenger_b = int(passenger_a), int(passenger_b)
result = 0
if int(df[df['PassengerId'] == passenger_a]['Pclass']) == int(df[df['PassengerId'] == passenger_b]['Pclass']):
result += 1
if int(df[df['PassengerId'] ==passenger_a]['Sex']) == int(df[df['PassengerId'] ==passenger_b]['Sex']):
result += 3
if int(df[df['PassengerId'] ==passenger_a]['SibSp']) == int(df[df['PassengerId'] ==passenger_b]['SibSp']):
result += 1
if int(df[df['PassengerId'] == passenger_a]['Parch']) == int(df[df['PassengerId'] == passenger_b]['Parch']):
result += 1
result += max(0, 2 - abs(float(df[df['PassengerId'] ==passenger_a]['Age']) - float(df[df['PassengerId'] ==passenger_b]['Age'])) / 10.0)
result += max(0, 2 - abs(float(df[df['PassengerId'] ==passenger_a]['Fare']) - float(df[df['PassengerId'] ==passenger_b]['Fare'])) / 5.0)
print(result / 10.0)
function_similarity(input())

Calculate passenger row by id value once per passengers a and b.
df = pd.read_csv("titanic.csv")
mappings = {'male': 0, 'female':1}
df['Sex'] = df['Sex'].map(mappings)
def function_similarity(no_of_queries):
for num in range(int(no_of_queries)):
x = input()
passenger_a, passenger_b = x.split()
passenger_a, passenger_b = df[df['PassengerId'] == int(passenger_a)], df[df['PassengerId'] == int(passenger_b)]
result = 0
if int(passenger_a['Pclass']) == int(passenger_b['Pclass']):
result += 1
if int(passenger_a['Sex']) == int(passenger_b['Sex']):
result += 3
if int(passenger_a['SibSp']) == int(passenger_b['SibSp']):
result += 1
if int(passenger_a['Parch']) == int(passenger_b['Parch']):
result += 1
result += max(0, 2 - abs(float(passenger_a['Age']) - float(passenger_b['Age'])) / 10.0)
result += max(0, 2 - abs(float(passenger_a['Fare']) - float(passenger_b['Fare'])) / 5.0)
print(result / 10.0)
function_similarity(input())

Related

Python using different module + package, and counts

yut.py is the following.
I need to define two variables:
throw_yut1() which randomly selects '배' or '등' by 60% and 40% respectively.
throw_yut4() which prints 4 results from throw_yut1() such as '배등배배' or '배배배배' etc.
import random
random.seed(10)
def throw_yut1():
if random.random() <= 0.6 :
return '배'
else:
return '등'
def throw_yut4():
result = ''
for i in range(4):
result = result + throw_yut1()
return result
main.py is the following.
Here, I need to repeat throw_yut4 1000 times and print the value and percentages of getting each variables that are listed in if statement.
import yut
counts = {}
for i in range(1000):
result = yut.throw_yut4
back = yut.throw_yut4().count('등')
belly = yut.throw_yut4().count('배')
if back == 3 and belly == 1:
counts['도'] = counts.get('도', 0) + 1
elif back == 2 and belly == 2:
counts['개'] = counts.get('개', 0) + 1
elif back == 1 and belly == 3:
counts['걸'] = counts.get('걸', 0) + 1
elif back == 0 and belly == 4:
counts['윷'] = counts.get('윷', 0) + 1
elif back == 4 and belly == 0:
counts['모'] = counts.get('모', 0) + 1
for key in ['도','개','걸','윷','모']:
print(f'{key} - {counts[key]} ({counts[key] / 1000 * 100:.1f}%)')
I keep getting
도 - 33 (3.3%)
개 - 115 (11.5%)
걸 - 131 (13.1%)
윷 - 22 (2.2%)
모 - 1 (0.1%)
but I am meant to get
도 - 157 (15.7%)
개 - 333 (33.3%)
걸 - 349 (34.9%)
윷 - 135 (13.5%)
모 - 26 (2.6%)
How can I fix my error?
I think the problem is this part of the code:
for i in range(1000):
result = yut.throw_yut4
back = yut.throw_yut4().count('등')
belly = yut.throw_yut4().count('배')
Seems like it should be:
for i in range(1000):
result = yut.throw_yut4()
back = result.count('등')
belly = result.count('배')
Otherwise you are counting back/belly from independent yut.throw_yut4() calls, so some unhandled (and presumably unintended) results are possible like back=4 and belly=4 ... this is why the totals you count are less than 1000 and 100%
It seems like your counts dictionary isn't getting filled with 1000 elements. It is likely because you are calling the yut.throw_yut4() twice, which give different results, such that none of your conditional checks pass.
try this instead:
result = yut.throw_yut4()
back = result.count('등')
belly = result.count('등')
천만에요

HackerRank Frequency Queries Error Python

You are given queries. Each query is of the form two integers described below:
: Insert x in your data structure.
: Delete one occurence of y from your data structure, if present.
: Check if any integer is present whose frequency is exactly . If yes, print 1 else 0.
The queries are given in the form of a 2-D array queries of size q where queries[i][0] contains the operation, and contains the data element.
Example:
queries = [(1,1), (2,2), (3,2), (1,1), (1,1), (2,1), (3,2))
This would return [0,1]
def freqQuery(queries):
lis = []
freq = {}
count = {}
for pair in queries:
if pair[0] == 1:
freq[pair[1]] = freq.get(pair[1], 0) + 1
count[freq[pair[1]]] = count.get(freq[pair[1]], 0) + 1
count[freq[pair[1]] - 1] = count.get(freq[pair[1]] - 1, 1) - 1
if pair[0] == 2:
if pair[1] in freq:
count[freq[pair[1]]] = count.get(freq[pair[1]], 1) - 1
freq[pair[1]] -= 1
count[freq[pair[1]]] = count.get(freq[pair[1]], 1) + 1
if count[freq[pair[1]]] < 0:
count[freq[pair[1]]] = 0
if pair[0] == 3:
if pair[1] in count:
if count[pair[1]] > 0:
lis.append(1)
else:
lis.append(0)
else:
lis.append(0)
return lis
I got 12/15 of the test cases correct, but I can't figure out what is wrong for the last couple of test cases.

Making permanent change in a dataframe using python pandas

I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4

How to optimize an O(N*M) to be O(n**2)?

I am trying to solve USACO's Milking Cows problem. The problem statement is here: https://train.usaco.org/usacoprob2?S=milk2&a=n3lMlotUxJ1
Given a series of intervals in the form of a 2d array, I have to find the longest interval and the longest interval in which no milking was occurring.
Ex. Given the array [[500,1200],[200,900],[100,1200]], the longest interval would be 1100 as there is continuous milking and the longest interval without milking would be 0 as there are no rest periods.
I have tried looking at whether utilizing a dictionary would decrease run times but I haven't had much success.
f = open('milk2.in', 'r')
w = open('milk2.out', 'w')
#getting the input
farmers = int(f.readline().strip())
schedule = []
for i in range(farmers):
schedule.append(f.readline().strip().split())
#schedule = data
minvalue = 0
maxvalue = 0
#getting the minimums and maximums of the data
for time in range(farmers):
schedule[time][0] = int(schedule[time][0])
schedule[time][1] = int(schedule[time][1])
if (minvalue == 0):
minvalue = schedule[time][0]
if (maxvalue == 0):
maxvalue = schedule[time][1]
minvalue = min(schedule[time][0], minvalue)
maxvalue = max(schedule[time][1], maxvalue)
filled_thistime = 0
filled_max = 0
empty_max = 0
empty_thistime = 0
#goes through all the possible items in between the minimum and the maximum
for point in range(minvalue, maxvalue):
isfilled = False
#goes through all the data for each point value in order to find the best values
for check in range(farmers):
if point >= schedule[check][0] and point < schedule[check][1]:
filled_thistime += 1
empty_thistime = 0
isfilled = True
break
if isfilled == False:
filled_thistime = 0
empty_thistime += 1
if (filled_max < filled_thistime) :
filled_max = filled_thistime
if (empty_max < empty_thistime) :
empty_max = empty_thistime
print(filled_max)
print(empty_max)
if (filled_max < filled_thistime):
filled_max = filled_thistime
w.write(str(filled_max) + " " + str(empty_max) + "\n")
f.close()
w.close()
The program works fine, but I need to decrease the time it takes to run.
A less pretty but more efficient approach would be to solve this like a free list, though it is a bit more tricky since the ranges can overlap. This method only requires looping through the input list a single time.
def insert(start, end):
for existing in times:
existing_start, existing_end = existing
# New time is a subset of existing time
if start >= existing_start and end <= existing_end:
return
# New time ends during existing time
elif end >= existing_start and end <= existing_end:
times.remove(existing)
return insert(start, existing_end)
# New time starts during existing time
elif start >= existing_start and start <= existing_end:
# existing[1] = max(existing_end, end)
times.remove(existing)
return insert(existing_start, end)
# New time is superset of existing time
elif start <= existing_start and end >= existing_end:
times.remove(existing)
return insert(start, end)
times.append([start, end])
data = [
[500,1200],
[200,900],
[100,1200]
]
times = [data[0]]
for start, end in data[1:]:
insert(start, end)
longest_milk = 0
longest_gap = 0
for i, time in enumerate(times):
duration = time[1] - time[0]
if duration > longest_milk:
longest_milk = duration
if i != len(times) - 1 and times[i+1][0] - times[i][1] > longest_gap:
longes_gap = times[i+1][0] - times[i][1]
print(longest_milk, longest_gap)
As stated in the comments, if the input is sorted, the complexity could be O(n), if that's not the case we need to sort it first and the complexity is O(nlog n):
lst = [ [300,1000],
[700,1200],
[1500,2100] ]
from itertools import groupby
longest_milking = 0
longest_idle = 0
l = sorted(lst, key=lambda k: k[0])
for v, g in groupby(zip(l[::1], l[1::1]), lambda k: k[1][0] <= k[0][1]):
l = [*g][0]
if v:
mn, mx = min(i[0] for i in l), max(i[1] for i in l)
if mx-mn > longest_milking:
longest_milking = mx-mn
else:
mx = max((i2[0] - i1[1] for i1, i2 in zip(l[::1], l[1::1])))
if mx > longest_idle:
longest_idle = mx
# corner case, N=1 (only one interval)
if len(lst) == 1:
longest_milking = lst[0][1] - lst[0][0]
print(longest_milking)
print(longest_idle)
Prints:
900
300
For input:
lst = [ [500,1200],
[200,900],
[100,1200] ]
Prints:
1100
0

Python randomly generated walks give same outcome when graphed

from pylab import *
no_steps = 10000
number = random()
position = zeros(no_steps)
position[0] = 0
time = zeros(no_steps)
time[0] = 0
for i in range(1, no_steps):
time[i] = time[i-1] + 1
if number >= 0.5:
position[i] = position[i-1] + 1
number = random()
else:
position[i] = position[i-1] - 1
number = random()
plot(time, position)
number2 = random()
position2 = zeros(no_steps)
position2[0] = 0
time2 = zeros(no_steps)
time2[0] = 0
for t2 in range(1, no_steps):
time2[t2] = time[t2-1] + 1
if number2 >= 0.5:
position2[t2] = position2[t2-1] + 1
number2 = random()
else:
position2[t2] = position[t2-1] - 1
number2 = random()
plot(time2,position2)
This is supposed to generate random walks by generating a random number each time and checking the conditions. Therefore I assumed that if it works for one walk I can just add more of the same and put them all on the same graph at the end. However, apparently that's not how this works and the graphs that do end up being plotted are extremely similar with the difference in the positions being one of -2 for some reason. The code if I run the blocks separately from their own program will generate two completely different walks, it's just when I put them together that it stops working as intended. What exactly am I missing?
You've accidentally reused variables from the first plot:
for t2 in range(1, no_steps):
time2[t2] = time[t2-1] + 1
^^^^^ ^^^^
if number2 >= 0.5:
position2[t2] = position2[t2-1] + 1
number2 = random()
else:
position2[t2] = position[t2-1] - 1
^^^^^^^^^ ^^^^^^^^
number2 = random()
plot(time2,position2)
I would generate the random walk with a function so you don't have to worry about renaming variables like this:
import numpy
from pylab import *
no_steps = 10000
def random_walk(no_steps):
# 2 * [0, 1] - 1 -> [0, 2] - 1 -> [-1, 1]
directions = 2 * numpy.random.randint(0, 2, size=(1, no_steps)) - 1
positions = numpy.cumsum(directions)
positions -= positions[0] # To make it start from zero
return positions
time1 = numpy.arange(0, no_steps)
plot(time1, random_walk(no_steps))
savefig('1.png')
clf()
time2 = numpy.arange(0, no_steps)
plot(time2, random_walk(no_steps))
savefig('2.png')

Categories

Resources