Inserting random values based on condition - python

I have the following DataFrame containing various information about a certain product. Input3 is a list of sentences created as shown below:
sentence_list = (['Køb online her','Sammenlign priser her','Tjek priser fra 4 butikker','Se produkter fra 4 butikker', 'Stort udvalg fra 4 butikker','Sammenlign og køb'])
df["Input3"] = np.random.choice(sentence_list, size=len(df))
Full_Input is a string created by joining various columns, its content being something like: "ProductName from Brand - Buy online here - Sitename". It is created like this:
df["Full_Input"] = df['TitleTag'].astype(str) + " " + df['Input2'].astype(str) + " " + df['Input3'].astype(str) + " " + df['Input4'].astype(str) + " " + df['Input5'].astype(str)
The problem here is that Full_Input_Length should be under 55. Therefore I am trying to figure out how to put a condition while randomly generating Input3 so when it adds up with the other columns' strings, the full input length does not go over 55.
This is what I tried:
for col in range(len(df)):
condlist = [df["Full_Input"].apply(len) < 55]
choicelist = [sentence_list]
df['Input3_OK'][col] = np.random.choice.select(condlist, choicelist)
As expected, it doesn't work like that. np.random.choice.select is not a thing and I am getting an AttributeError.
How can I do that instead?

If you are guaranteed to have at least one item in Input3 that will satisfy this condition, you may want to try something like conditioning your random selection ONLY on values in your sentence_list that would be of an acceptable length:
# convert to series to enable use of pandas filtering mechanism:
my_sentences = [s for s in sentence_list if len(s) < MAX_LENGTH]
# randomly select from this filtered list:
np.random.choice(my_sentences)
In other words, perform the filter on each list of strings BEFORE you call random.choice.
You can run this for each row in a dataframe like so:
def choose_string(full_input):
return np.random.choice([
s
for s in sentence_list
if len(s) + len(full_input) < 55
])
df["Input3_OK"] = df.Full_Input.map(choose_string)

Related

Python closest match between two string columns

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %

Python: Rendering a chart from multidimensional data

I'm writing a script to render pie charts from multidimensional array data, and am struggling trying to keep each step synchronized.
Here's my code, with some dummy data (animal types) for example:
def parse_data(chart):
print("\n=====Chart Data=====")
for a, b in chart.items(): # the dictionary "chart" has items - key:a value:b
output = [] # list of what to print to the console
output.append(str(a)) # key:a - the name of the superset
if type(b) is float: # if there are no subsets and the value is a float...
output.append(': ' + str(b) + '%') # add that number (%) to the console output list
elif type(b) is list: # else if there are subsets they'll be in a list
for c in b:
for d, e in c.items():
output.append('\n ' + str(d) + ': ' + str(e) + '%')
else: # if 'b' is neither a float nor a list
print('Error: Data could not be parsed correctly.')
print('\n' + ''.join(map(str, output))) # put the output list together and print it
chart_data = {
"Mammal Group": [{"Hamster":23.1}, {"Yeti":16.4}],
"Platypus": 14.2,
"Reptile Group": [{"Snake":4.0}, {"Komodo Dragon":0.7}]
}
parse_data(chart_data)
The console output is:
=====Chart Data=====
Mammal Group
Hamster: 23.1%
Yeti: 16.4%
Platypus: 14.2%
Reptile Group
Snake: 4.0%
Komodo Dragon: 0.7%
This all looks fine so far. Groups/supersets are represented by inner slices, and subsets are represented by outer slices. Notice some animals (Platypus) do not belong to a group, and a percentage is listed next to their name directly. Other animals are subsets of some larger group/superset. The next step is to grab that data and group-by-group send the data to a function to be rendered.
Here's a mock-up animating the order I imagine it would be logical to render in: Superset, then subsets of that superset, then move to the next superset if there is one. If there's no parent superset (Platypus), skip rendering of the superset and let the subset take up both inner and outer areas). The final chart will not be animated; this is only to demonstrate the order.
When each slice is created, the starting angle and ending angle need to be kept track of. And those angles will be different for subset slices than for superset slices. This needs to be tracked in a way so that after a set is rendered, the angle marker is placed at the end of it, ready to make the next slice.
I've got the rendering function working, but trying to feed it correct data slice by slice is driving me nuts. How can the chart_data be extracted in an organized way, to feed to the rendering function? Thanks for any help!
Although it's unlikely that others will need to code something quite like this, since I have reached a solution and I would prefer not to leave this question unanswered, I will attempt to explain my solution. Anyone randomly curious can read on....
I solved this by creating two empty lists - one for the "superset" inner slices, and one for the "subset" outer slices, and then populating them as the chart data gets iterated through. Although the subset slices are always children of a superset slice in the structure of the original chart data, that structure does not need to be preserved in order to render the chart.
The supersets that have no subset children get added to both super and sub lists, but this is invisible in the final pie chart. To understand what is meant by "invisible", imagine that the blue slice in the mock-up image is actually comprised of an inner segment and outer segment, one beside the other. If the pie chart were uncurled and laid out as two straight parallel tracks they would still line up with one another, as they have the same percentage.
But only one should be labeled, and there is more room for text in the outer area. For this reason, superset slices containing no subset slices get their name labels removed and set to data type None. Those name labels get moved to each one's neighboring subset slice instead. Slices labeled None will not be rendered, but the place marker will rotate by the percentage of that slice. To use the mock-up image as an example, this is what allows moving from the end of the green superset slice to the beginning of the orange superset slice without drawing an unwanted second slice "over" the blue slice. (Imagine the inner superset slices as being laid on top of and covering the outer subset slices.)
As most people reading this probably know, having a data type of None as a key in a dictionary doesn't work more than once. for example, the entry "Platypus" : 14.2 will result in None : 14.2, which will be overwritten by "Jabberwock" : 41.6 into None : 41.6. Because of this, I decided to convert the dictionary chart_data into a list first before formatting it for the graph.
As the iteration happens, the percentages of the subset slices get tallied up so that their parent superset group matches them in size. (Size meaning width if laid out in two parallel tracks, and rotation angle if viewed in the pie chart.)
Finally, I threw in some code to check whether the chart percentages add up to 100.
chart_data = {
"Mammal Group": [{"Hamster":23.1}, {"Yeti":16.4}],
"Platypus": 14.2,
"Reptile Group": [{"Snake":4.0}, {"Komodo Dragon":0.7}],
"Jabberwock" : 41.6
}
track_sup = []
track_sub = []
def dict_to_list(dict): # converts a dictionary to a list (nested dictionaries are untouched)
new_list = []
for key, value in dict.items():
super_pair = [key, value]
new_list.append(super_pair)
return new_list
def format_data(dict_data):
global track_sup
global track_sub
track_sup = []
track_sub = []
print('\n\n\n====== Formatting data ======')
super_slices = dict_to_list(dict_data) # convert to list to allow more than one single super slice (their labels are type: None)
chart_perc = 0.0 # for checking that the chart slices all add up to 100%
i = 0
while i < len(super_slices):
tally = 0.0 # for adding up subset percentages to get each superset percentage
is_single = True # supersets single by default
super_slice = super_slices[i]
slice_label = ''
super_pair = []
sub_pair = []
if type(super_slice[1]) == list: # if [1] is a list, there are sub slices
is_single = False # mark superset as containing multiple subsets
slice_label = super_slice[0]
sub_slices = super_slice[1]
j = 0
while j < len(sub_slices): # iterate sub slices to gather label names and percentages
sub_slice = sub_slices[j]
for k, v in sub_slice.items(): # in each dict, k is a label and v is a percentage
v = float(v)
tally = tally + v # count toward super slice (group) percentage
chart_perc = chart_perc + v # count toward chart total percentage
sub_pair = [k, v] # convert each key-value pair into a list
print(str(sub_pair[0]) + ' ' + str(sub_pair[1]) + ' %')
track_sub.append(sub_pair) # append this pair to final sub output list
j = j + 1
print('Group [' + slice_label + '] combined total is ' + str(tally) + ' % \n')
elif type(super_slice[1]) == float: # this super slice (group) contains no sub slices
slice_label = super_slice[0]
tally = super_slice[1] # no sub slice percentages to add up
chart_perc = chart_perc + super_slice[1] # count toward chart total percentage
sub_pair = [slice_label, tally] # label drops to the sub slot as it only labels itself
track_sub.append(sub_pair) # append this pair to final sub output list
print(slice_label + ' ' + str(tally) + ' % (Does not belong to a group)\n')
else:
print('Error: Could not format data.')
if is_single == True:
slice_label = None # label removed for each single slice - only the percentage is used (for spacing)
super_pair = [slice_label, tally] # pair up each label name and super slice percentage in a list
track_sup.append(super_pair) # append this pair to final super output list
i = i + 1
chart_perc = round(chart_perc, 6) # round to 6 decimal places
short = 0.0
if chart_perc == 100.0:
print('______ Sum of all chart slices is 100 % ______\n')
else:
print('****** WARNING: Chart slices do not add up to 100 % ! ******')
short = round(100.0 - chart_perc, 6)
print('Sum of all chart slices is only ' + str(chart_perc) + ' % (Falling short by ' + str(short) + ' %)\n')
format_data(chart_data)
print(track_sup)
print(track_sub)
And the resulting console output:
====== Formatting data ======
Hamster 23.1 %
Yeti 16.4 %
Group [Mammal Group] combined total is 39.5 %
Platypus 14.2 % (Does not belong to a group)
Snake 4.0 %
Komodo Dragon 0.7 %
Group [Reptile Group] combined total is 4.7 %
Jabberwock 41.6 % (Does not belong to a group)
______ Sum of all chart slices is 100 % ______
[['Mammal Group', 39.5], [None, 14.2], ['Reptile Group', 4.7], [None, 41.6]]
[['Hamster', 23.1], ['Yeti', 16.4], ['Platypus', 14.2], ['Snake', 4.0], ['Komodo Dragon', 0.7], ['Jabberwock', 41.6]]
track_sup and track_sub are the two lists whose data is shown in the last two lines of output, that will be actually used to render the chart.

More efficient way to evaluate overlap of a key-value list (Pandas)

I have a list (csv) of 1M rows like this:
Keyword,URL
Word1,URL1
Word1,URL2
..
Word1,URL100
Word2,URL4
Word1,URL101,
..
Word10000,URLN
So I have 10,000 keywords with 100 URLs for each keyword. Each URL could be related to one or more keyword(s).
I need to obtain a Pandas dataframe (or a csv), like this:
Keyword1,Keyword2,Weight
Word1,Word2,5
Word1,Word3,6
Where the weight is the number of equal URLs for each pair of keywords I found. So I the example I suppose "Word1" and "Word2" have 5 shared URLs.
I used Pandas and I done a nested iteration over dataframes, but I need a more efficient way to do that, assuming that a nested iteration is not the best way to perform this task.
for index, row in keylist.iterrows():
keyurlcompare = keyurl[keyurl['Keyword'] == row['Keyword']]
idx1 = pd.Index(keyurlcompare['URL'])
# Second iterations
for index2, row2 in keylist.iterrows():
keyurlcompare2 = keyurl[keyurl['Keyword'] == row2['Keyword']]
idx2 = pd.Index(keyurlcompare2['URL'])
# Intersection evaluation
interesectw = idx1.intersection(idx2)
we = len(interesectw)
if we > 0 and row['Keyword'] != row2['Keyword']:
df1 = pd.DataFrame([[row['Keyword'],row2['Keyword'],we]],columns=['Source', 'Target', 'Weight'])
df = df.append(df1)
print('Keyword n. ' + str(index) + ' (' + row['Keyword'] + ') con Keyword n. ' + str(index2) + ' (' + row2['Keyword'] +') - Intersect: ' + str(we))
It works and I print this kind of output:
Keyword n. 0 (word1) with Keyword n. 9908 (word2) - Intersect: 1
Keyword n. 0 (word1) with Keyword n. 10063 (word3) - Intersect: 12
Keyword n. 0 (word1) con Keyword n. 10064 (word4) - Intersect: 1
But it's obviously incredibly slow. Could you help me in finding a more efficient way to perform this task?
I would try to reverse the processing:
find all keywords per URL
build a dataframe giving per URL all keyword pairs
sum the number of occurences per pair
Code could be:
detail = df.groupby('URL').apply(
lambda z: pd.DataFrame(list(itertools.combinations(z.Keyword,2)),
columns=['Keyword1', 'Keyword2']))
result = detail.reset_index(level=0).groupby(
['Keyword1', 'Keyword2']).count().rename({'URL': 'Weight'}).reset_index()
The result dataframe should be what you want
Detail is rather expensive to obtain with large data, several minutes on a decent machine for the magnitude order of data size you gave, result is much quicker. But at least there should be no memory error with a machine having more than 12 GB RAM

How to concatenate string from integers?

I'm very new to python.
How do I get a string like this:
53 (46.49 %)
But, I'm getting this:
1 53 (1 46.49 %)
I'm trying to get the last value from the table count and the proportion (i'm not sure what it's called in python)
table = pd.value_counts(data[var].values, sort=False)
prop_table = (table/table.sum() * 100).round(2)
num = table[[1]].to_string()
prop = prop_table[[1]].to_string()
test = num + " (" + prop + " %)"
but, it puts 1 before displaying the number.

How to control all combination subset against userinput using itertool in Python 2.7

I want to write a code to get all combinations against 5 user input sets where each output subset only matches <= 3 elements from any of the input sets.
Example:
userInput1=(a,b,c,d,e)
userInput2=(c,d,e,f,g)
userInput3=(f,g,h,i,j)
userInput4=(g,h,i,j,k)
userInput5=(k,l,m,n,o)
# Turn 5 lists into 1 large list with no duplicates
allEntries = list(set(userInput1 + userInput2 + userInput3 + userInput4 + userInput5 ))
# Generate all possible list combinations
allCombinations = list(itertools.combinations( allEntries,5))
print "All combinations:"
for subset in allCombinations:
?????????
print subset
How do I do this check to limit the overlap? For instance, (g,i,j,k,o) fails because it shares 4 elements with userInput4.
E.g. - all combination
(a,c,j,l,o)
(k,b,a,m,n)
This isn't a simple solution with itertools. However, you do have the correct start. Now, check each list as you produce it:
check_set = [
set(userinput1),
set(userinput2),
set(userinput3),
set(userinput4),
set(userinput5)
]
for five in itertools.combinations( allEntries,5):
five_set = set(five)
# If there are no overlaps of more than 3 elements,
# accept the solution.
if !any(len(five_set.intersection(user_set)) > 3
for user_set in check_set):
print five
# ... or whatever you do to save the good combination.

Categories

Resources