How can I add two values replacing / between them? I tried with item.replace("/","+") but it only places the + sign replacing / and does nothing else. For example If i try the logic on 2/1.5 and -3/-4.5, I get 2+1.5 and -3+-4.5.
My intention here is to add the two values replacing / between them and divide it into 2 so that the result becomes 1.875 and -3.75 respectively if I try the logic on (2/1.5 and -3/-4.5).
This is my try so far:
for item in ['2/1.5','-3/-4.5']:
print(item.replace("/","+"))
What I'm having now:
2+1.5
-3+-4.5
Expected output (adding the two values replacing / with + and then divide result by two):
1.75
-3.75
Since / is only a separator, you don't really need to replace it with +, but use it with split, and then sum up the parts:
for item in ['2/1.5', '-3/-4.5']:
result = sum(map(float, item.split('/'))) / 2
print(result)
or in a more generalized form:
from statistics import mean
for item in ['2/1.5', '-3/-4.5']:
result = mean(map(float, item.split('/')))
print(result)
You can do it using eval like this:
for item in ['2/1.5','-3/-4.5']:
print((eval(item.replace("/","+")))/2)
My answer is not that different from others, except I don't understand why everyone is using lists. A list is not required here because it won't be altered, a tuple is fine and more efficient:
for item in '2/1.5','-3/-4.5': # Don't need a list here
num1, num2 = item.split('/')
print((float(num1) + float(num2)) / 2)
A further elaboration of #daniel's answer:
[sum(map(float, item.split('/'))) / 2 for item in ('2/1.5','-3/-4.5')]
Result:
[1.75, -3.75]
You can do it like this(by splitting the strings into two floats):
for item in ['2/1.5','-3/-4.5']:
itemArray = item.split("/")
itemResult = float(itemArray[0]) + float(itemArray[1])
print(itemResult/2)
from ast import literal_eval
l = ['2/1.5','-3/-4.5']
print([literal_eval(i.replace('/','+'))/2 for i in l])
Related
I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %
So I'm using pandas to filter a csv and I need to filter three different string elements of a column, but when I use the or (|) I get that mistake. Any other way I can filter many strings without having to name different variables to act like one filter each? This is the code:
# What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
bdegree = df[(df["education"] == "Bachelors") & (df["salary"] >= "50K")].count()
mdegree = df[(df["education"] == "Masters") & (df["salary"] >= "50K")].count()
phddegree = df[(df["education"] == "Doctorate") & (df["salary"] >= "50K")].count()
all_degrees = bdegree + mdegree + phddegree
print(all_degrees)
percentaje_of_more50 = (all_degrees / df["education" == "Bachelors"|"Masters"|"Doctorate"].count())*100
print("The percentaje of people with bla bla bla is", percentaje_of_more50["education"].round(1))
By the way, I am working in an error in the logic on this code, so just ignore it :).
== looks for an exact match and since no one's "education" includes the string "Bachelors"|"Masters"|"Doctorate", it will return a Series of all Falses
.
You can use isin instead like:
msk = df["education"].isin(["Bachelors","Masters","Doctorate"])
The above will return a boolean Series, so using the .count method on it will just show the length of it, which is probably not something you want. So you need to use it to filter the relevant rows:
df[msk].count()
Then you can write percentage_of_more50 as:
percentage_of_more50 = (all_degrees / df[msk].count())*100
Note that you can also derive all_degrees using isin as well:
all_degrees = df[df["education"].isin(["Bachelors","Masters","Doctorate"]) & (df['salary']>='50K')].count()
Also df["salary"] >= "50K" works as you intend only if all salaries are below "99k" otherwise you'll end up with wrong output because if you check "100k" > "50k" it throws up False, even though it's True. One way to get rid of this problem is to fill the "salary" column data with "0"s until each entry is a certain number of characters long using str.zfill like:
df['salary'] = df['salary'].str.zfill(5)
Then each entry becomes 5 characters long. For example,
s = pd.Series(['100k','50k']).str.zfill(5)
becomes:
0 0100k
1 0050k
dtype: object
Then you can make the correct comparison.
Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']
Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1
I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo
Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]
i have the following list:
erra_eus_repo = [(u'RHSA-2017:2796', u'6.7'), (u'RHSA-2017:2796', u'6.8'), (u'RHSA-2017:2794', u'7.2'), (u'RHSA-2017:2793', u'7.3')]
what I am trying to take the floating point numbers from each tuple:
6.7, 6.8 ,7.2, 7.3
and get the max number for each version that before the dot .ie :
new_list = [ 6.8, 7.3 ]
Note that max() will not work here, since if I have 5.9 and 5.11, I will get the max as 5.9, I want the result to be 5.11 since 11 > 9.
What I have tried:
eus_major = []
eus_minor = []
for major in erra_eus_repo:
minor = (major[1][2])
major = (major[1][0])
if major not in eus_major:
eus_major.append(major)
if minor not in eus_minor:
eus_minor.append(minor)
print(eus_major, eus_minor)
currently i am getting:
[u'6', u'7'] [u'7', u'2', u'3']
You can achieve this for instance with a combination of groupby and sorting:
from itertools import groupby
srt_list = sorted(erra_eus_repo, key=lambda x: x[1]);
max_list = []
for key, group in groupby(srt_list, lambda x: x[1].split('.')[0]):
max_el = max(list(group), key = lambda y: int(y[1].split('.')[1]))
max_list.append(float(max_el[1]))
First the array is sorted based on second element of each tuple to get sequences of elements with matching non-decimal number for grouping with groupby. groupby groups the elements into just that - each group will represent a sequence X.Z with common X. In each of these sequences - groups the program finds the one with maximum decimal part treated as a stand-along number. The whole number is then appended to the list with max values as a float.
Do not treat the version numbers as floating point, treat as '.' separated strings. Then split each version string (split on '.') and compare. Like this:
def normalize_version(v):
return tuple(map(int, v.split('.')))
Then you can see:
>>> u'5.11' > u'5.9'
False
>>> normalize_version(u'5.11') > normalize_version(u'5.9')
True
Here is another take on the problem which provides the highest value for each RHSA value (which is want I think you're after):
erra_eus_repo = [(u'RHSA-2017:2796', u'6.7'), (u'RHSA-2017:2796', u'6.8'), (u'RHSA-2017:2794', u'7.2'), (u'RHSA-2017:2793', u'7.3')]
eus_major = {}
for r in erra_eus_repo:
if r[0] not in eus_major.keys():
eus_major[r[0]] = 0
if float(r[1]) > float(eus_major[r[0]]):
eus_major[r[0]] = r[1]
print(eus_major)
output:
{'RHSA-2017:2796': '6.8', 'RHSA-2017:2794': '7.2', 'RHSA-2017:2793': '7.3'}
I left the value as a string, but it could easily be cast as a float.
The following simply uses the built-in min and max functions:
erra_eus_repo = [(u'RHSA-2017:2796', u'6.7'),
(u'RHSA-2017:2796', u'6.8'),
(u'RHSA-2017:2794', u'7.2'),
(u'RHSA-2017:2793', u'7.3')]
eus_major = max(float(major[1])for major in erra_eus_repo)
eus_minor = min(float(major[1])for major in erra_eus_repo)
newlist = [eus_minor, eus_major]
print(newlist) # -> [6.7, 7.3]
This may look like you're trying to compare decimal values, but you really aren't. It goes without saying (but I will) that while 9<11, .9>.11. So the idea of splitting the number into two separate values is really the only way to get a valid comparison.
The list is a list of lists - you have the master and each has a sub-list of RHSA and a value. Apparently you want discard the first item in the list and only get the (I assume) version of that item. Here's some code that, while crude, will give you an idea of what to do. (I'd welcome comments on how to clean that up...) So I've taken the lists, split them, then split the versions into major and minor, then compared them and if nothing exists in the list, add the value. I also added for sake of testing a 6.11 version number.
lstVersion = []
lstMaxVersion=[]
erra_eus_repo = [(u'RHSA-2017:2796', u'6.7'), (u'RHSA-2017:2796', u'6.8'), (u'RHSA-2017:2796', u'6.11'), (u'RHSA-2017:2794', u'7.2'), (u'RHSA-2017:2793', u'7.3')]
for strItem in erra_eus_repo:
lstVersion.append(strItem[1])
for strVersion in lstVersion:
blnAdded = False
intMajor = int(strVersion.split('.')[0])
intMinor = int(strVersion.split('.')[1])
print 'intMajor: ', intMajor
print 'intMinor:' , intMinor
for strMaxItem in lstMaxVersion:
intMaxMajor = int(strMaxItem.split('.')[0])
intMaxMinor = int(strMaxItem.split('.')[1])
print 'strMaxitem: ', strMaxItem
print 'intMaxMajor: ', intMaxMajor
print 'intMaxMinor: ', intMaxMinor
if intMajor == intMaxMajor:
blnAdded = True
if intMinor > intMaxMinor:
lstMaxVersion.remove(strMaxItem)
lstMaxVersion.append(str(intMajor)+'.'+str(intMinor))
if not blnAdded:
lstMaxVersion.append(str(intMajor)+'.'+str(intMinor))
How would I write a function in Python to determine if a list of filenames matches a given pattern and which files are missing from that pattern? For example:
Input ->
KUMAR.3.txt
KUMAR.4.txt
KUMAR.6.txt
KUMAR.7.txt
KUMAR.9.txt
KUMAR.10.txt
KUMAR.11.txt
KUMAR.13.txt
KUMAR.15.txt
KUMAR.16.txt
Desired Output-->
KUMAR.5.txt
KUMAR.8.txt
KUMAR.12.txt
KUMAR.14.txt
Input -->
KUMAR3.txt
KUMAR4.txt
KUMAR6.txt
KUMAR7.txt
KUMAR9.txt
KUMAR10.txt
KUMAR11.txt
KUMAR13.txt
KUMAR15.txt
KUMAR16.txt
Desired Output -->
KUMAR5.txt
KUMAR8.txt
KUMAR12.txt
KUMAR14.txt
You can approach this as:
Convert the filenames to appropriate integers.
Find the missing numbers.
Combine the missing numbers with the filename template as output.
For (1), if the file structure is predictable, then this is easy.
def to_num(s, start=6):
return int(s[start:s.index('.txt')])
Given:
lst = ['KUMAR.3.txt', 'KUMAR.4.txt', 'KUMAR.6.txt', 'KUMAR.7.txt',
'KUMAR.9.txt', 'KUMAR.10.txt', 'KUMAR.11.txt', 'KUMAR.13.txt',
'KUMAR.15.txt', 'KUMAR.16.txt']
you can get a list of known numbers by: map(to_num, lst). Of course, to look for gaps, you only really need the minimum and maximum. Combine that with the range function and you get all the numbers that you should see, and then remove the numbers you've got. Sets are helpful here.
def find_gaps(int_list):
return sorted(set(range(min(int_list), max(int_list))) - set(int_list))
Putting it all together:
missing = find_gaps(map(to_num, lst))
for i in missing:
print 'KUMAR.%d.txt' % i
Assuming the patterns are relatively static, this is easy enough with a regex:
import re
inlist = "KUMAR.3.txt KUMAR.4.txt KUMAR.6.txt KUMAR.7.txt KUMAR.9.txt KUMAR.10.txt KUMAR.11.txt KUMAR.13.txt KUMAR.15.txt KUMAR.16.txt".split()
def get_count(s):
return int(re.match('.*\.(\d+)\..*', s).groups()[0])
mincount = get_count(inlist[0])
maxcount = get_count(inlist[-1])
values = set(map(get_count, inlist))
for ii in range (mincount, maxcount):
if ii not in values:
print 'KUMAR.%d.txt' % ii