find relational values in two column in pandas - python

I am trying to extract values in one column based on another column in pandas,
For example suppose I have 2 columns in dataframe as below
>>> check
child parent
0 b a
1 c a
2 d b
3 e d
Now I want to extract all values in column "child" for value in column "parent"
My initial value can differ for now suppose it is "a" in column "parent"
also length of dataframe might differ.
I tried below but it is not working if there are few more matching values and length of dataframe is more
check = pd.read_csv("Book2.csv",encoding='cp1252')
new = (check.loc[check['parent'] == 'a', 'child']).tolist()
len(new)
a=[]
a.append(new)
for i in range(len(new)):
new[i]
new1 = (check.loc[check['parent'] == new[i], 'child']).tolist()
len(new1)
if(len(new1)>0):
a.append(new1)
for i in range(len(new1)):
new2 = (check.loc[check['parent'] == new1[i], 'child']).tolist()
if(len(new1)>0):
a.append(new2)
flat_list = [item for sublist in a for item in sublist]
>>> flat_list
['b', 'c', 'd', 'e']
Is there any efficient way to get desired results, it will be a great help. Please advice

Recursion is a way to do it. Suppose that check is your dataframe, define a recursive function:
final = [] #empty list which is used to store all results
def getchilds(df, res, value):
where = df['parent'].isin([value]) #check rows where parent is equal to value
newvals = list(df['child'].loc[where]) #get the corresponding child values
if len(newvals) > 0:
res.extend(newvals)
for i in newvals: #recursive calls using child values
getchilds(df, res, i)
getchilds(check, final, 'a')
print(final)
print(final) prints ['b', 'c', 'd', 'e'] if check is your example.
This works if you do not have cyclic calls, like 'b' is child of 'a' and 'a' is child of 'b'. If this is the case, you need to add further checks to prevent infinite recursion.

out_dict = {}
for v in pd.unique(check['parent']):
out_dict[v] = list(pd.unique(check['child'][check['parent']==v]))
Then calling out_dict prints:
{'a': ['b', 'c'], 'b': ['d'], 'd': ['e']}

Let me just make a guess and say you want to get all the values of a column child where parent value is x
import pandas as pd
def get_x_values_of_y(comparison_val, df, val_type="get_parent"):
val_to_be_found = ["child","parent"][val_type=="get_parent"]
val_existing = ["child","parent"][val_type != "get_parent"]
mask_value = df[val_existing] == "a"
to_be_found_column = df[mask_value][val_to_be_found]
unique_results = to_be_found_column.unique().tolist()
return unique_results
check = pd.read_csv("Book2.csv",encoding='cp1252')
# to get results of all parents of child "a"
print get_x_values_of_y("a", check)
# to get results of all children of parent "b"
print get_x_values_of_y("b", check, val_type="get_child")
# to get results of all parents of every child
list_of_all_children = check["child"].unique().tolist()
for each_child in list_of_all_children:
print get_x_values_of_y(each_child, check)
# to get results of all children of every parent
list_of_all_parents = check["parent"].unique().tolist()
for each_parent in list_of_all_parents:
print get_x_values_of_y(each_parent, check, val_type= "get_child")
Hope this solves your problem.

Related

Loop over each item in a row and compare with each item from another row then save the result in a new column_python

I want to loop in python, over each item from a row against other items from the correspondent row from another column.
If item is not present in the row of the second column then should append to the new list that will be converted in another column (this should also eliminate duplicates when appending through if i not in c).
The goal is to compare items from each row of a column against items from the correspondent row in another column and to save the unique values from the first column, in a new column same df.
df columns
This is just an example, I have much many items in each row
I tried using this code but nothing happened and conversion of the list into the column it's not correct from what I have tested
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
c = []
for i in df.values:
for i in a:
if i in a:
if i not in b:
if i not in c:
c.append(i)
print(c)
df['new'] = pd.Series(c)
Any help is more than needed, thanks in advance
So seeing as you have these two variables one way would be:
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
Try something like this:
new = {}
for index, items in enumerate(a):
for thing in items:
if thing not in b[index]:
if index in new:
new[index].append(thing)
else:
new[index] = [thing]
Then map the dictionary to the df.
df['new'] = df.index.map(new)
There are better ways to do it but this should work.
This should be what you want:
import pandas as pd
data = {'final_key_concat':[['Camiseta', 'Tecnica', 'hombre', 'barate'],
['deportivas', 'calcetin', 'hombres', 'deportivas', 'shoes']],
'attributes_tokenize':[['The', 'North', 'Face', 'manga'], ['deportivas',
'calcetin', 'shoes', 'North']]} #recreated from your image
df = pd.DataFrame(data)
a= df['final_key_concat'].tolist() #this generates a list of lists
b = df['attributes_tokenize'].tolist()#this also generates a list of lists
#Both list a and b need to be flattened so as to access their elements the way you want it
c = [itm for sblst in a for itm in sblst] #flatten list a using list comprehension
d = [itm for sblst in b for itm in sblst] #flatten list b using list comprehension
final_list = [itm for itm in c if itm not in d]#Sort elements common to both list c and d
print (final_list)
Result
['Camiseta', 'Tecnica', 'hombre', 'barate', 'hombres']
def parse_str_into_list(s):
if s.startswith('[') and s.endswith(']'):
return ' '.join(s.strip('[]').strip("'").split("', '"))
return s
def filter_restrict_words(row):
targets = parse_str_into_list(row[0]).split(' ', -1)
restricts = parse_str_into_list(row[1]).split(' ', -1)
print(restricts)
# start for loop each words
# use set type to save words or list if we need to keep words in order
words_to_keep = []
for word in targets:
# condition to keep eligible words
if word not in restricts and 3 < len(word) < 45 and word not in words_to_keep:
words_to_keep.append(word)
print(words_to_keep)
return ' '.join(words_to_keep)
df['FINAL_KEYWORDS'] = df[[col_target, col_restrict]].apply(lambda x: filter_restrict_words(x), axis=1)

is there in python library's to drop rows with two conditions without loop?

I have a problem that I can not solve,
I have a Dataframe of size greater than 150000, I want to delete the lines when ID to several Key, it erases lines with the Key "X" if it has others keys than the Key X, it leaves the lines if the id has only key X, you know the libraries in python who can do that, without going through if or loop condition, thanks
Edit :
if the id has only the key X that's keep it , if the id has multiple values , it's delete only rows which have X as key for this id
Example :
Input
What I need :
output
You notice that's "2 B X" was deleted
First find IDs which have multiple unique Keys:
import pandas as pd
df = pd.DataFrame({'ID':['A', 'A', 'B', 'B', 'B'],
'Key':['X', 'X', 'X', 'Y', 'Z']})
g = df.groupby('ID')['Key'].nunique()
multiple_keys = list(g[g > 1].index)
print(multiple_keys)
Output:
['B']
Now use this to filter your DataFrame:
result = df[~((df['ID'].isin(multiple_keys)) & (df['Key'] == 'X'))]
print(result)
Output:
ID Key
0 A X
1 A X
3 B Y
4 B Z

Problem with retrieving the value of item in list/dictionary of objects in Python3

I'm trying to put string variables into list/dictionary in python3.7 and trying to retrieve them later for use.
I know that I can create a dictionary like:
string_dict1 = {"A":"A", "B":"B", "C":"C", "D":"D", "E":"E", "F":"F"}
and then retrieve the values, but it is not I want in this specific case.
Here is the code:
A = ""
B = "ABD"
C = ""
D = "sddd"
E = ""
F = "dsas"
string_dict = {A:"A", B:"B", C:"C", D:"D", E:"E", F:"F"}
string_list = [A,B,C,D,E,F]
for key,val in string_dict.items():
if key == "":
print(val)
for item in string_list:
if item == "":
print(string_list.index(item))
The result I got is:
E
0
0
0
And the result I want is:
A
C
E
0
2
4
If you print string_dict you notice the problem:
string_dict = {A:"A", B:"B", C:"C", D:"D", E:"E", F:"F"}
print(string_dict)
# output: {'': 'E', 'ABD': 'B', 'sddd': 'D', 'dsas': 'F'}
It contains a single entry with the value "".
This is because you are associating multiple values ​​to the same key, and this is not possible in python, so only the last assignment is valid (in this case E:"E").
If you want to associate multiple values ​​with the same key, you could associate a list:
string_dict = {A:["A","C","E"], B:"B", D:"D", F:"F"}
Regarding the list of strings string_list, you get 0 since the method .index(item) returns the index of the first occurrence of item in the list. In your case 0. For example, if you change the list [A,B,C,D,E,F] to [B,B,C,D,E,F]. Your code will print 2.
If you want to print the index of the empty string in your list:
for index, value in enumerate(string_list):
if value == '':
print(index)
Or in a more elegant way you can use a list comprehension:
[i for i,x in enumerate(string_list) if x=='']
Well, I don't think there's a way to get what you want from a dictionary because of how they work. You can print your dictionary and see that it looks like this:
{'': 'E', 'ABD': 'B', 'sddd': 'D', 'dsas': 'F'}
What happened here is A was overwritten by C and then E.
But I played around with the list and here's how I got the last three digits right:
for item in string_list:
if item != '':
print(string_list.index(item) - 1)
This prints:
0
2
4

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?
Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.
I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}
You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}
original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.
a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

Python: Append double items to new array

lets say I have an array "array_1" with these items:
A b A c
I want to get a new array "array_2" which looks like this:
b A c A
I tried this:
array_1 = ['A','b','A','c' ]
array_2 = []
for item in array_1:
if array_1[array_1.index(item)] == array_1[array_1.index(item)].upper():
array_2.append(array_1[array_1.index(item)+1]+array_1[array_1.index(item)])
The problem: The result looks like this:
b A b A
Does anyone know how to fix this? This would be really great!
Thanks, Nico.
It's because you have 2 'A' in your array. In both case for the 'A',
array_1[array_1.index(item)+1
will equal 'b' because the index method return the first index of 'A'.
To correct this behavior; i suggest to use an integer you increment for each item. In that cas you'll retrieve the n-th item of the array and your program wont return twice the same 'A'.
Responding to your comment, let's take back your code and add the integer:
array_1 = ['A','b','A','c' ]
array_2 = []
i = 0
for item in array_1:
if array_1[i] == array_1[i].upper():
array_2.append(array_1[i+1]+array_1[i])
i = i + 1
In that case, it works but be careful, you need to add an if statement in the case the last item of your array is an 'A' for example => array_1[i+1] won't exist.
I think that simple flat list is the wrong data structure for the job if each lower case letter is paired with the consecutive upper case letter. If would turn it into a list of two-tuples i.e.:
['A', 'b', 'A', 'c'] becomes [('A', 'b'), ('A', 'c')]
Then if you are looping through the items in the list:
for item in list:
print(item[0]) # prints 'A'
print(item[1]) # prints 'b' (for first item)
To do this:
input_list = ['A', 'b', 'A', 'c']
output_list = []
i = 0;
while i < len(input_list):
output_list.append((input_list[i], input_list[i+1]))
i = i + 2;
Then you can swap the order of the upper case letters and the lower case letters really easily using a list comprehension:
swapped = [(item[1], item[0]) for item in list)]
Edit:
As you might have more than one lower case letter for each upper case letter you could use a list for each group, and then have a list of these groups.
def group_items(input_list):
output_list = []
current_group = []
while not empty(input_list):
current_item = input_list.pop(0)
if current_item == current_item.upper():
# Upper case letter, so start a new group
output_list.append(current_group)
current_group = []
current_group.append(current_item)
Then you can reverse each of the internal lists really easily:
[reversed(group) for group in group_items(input_list)]
According to your last comment, you can get what you want using this
array_1 = "SMITH Mike SMITH Judy".split()
surnames = array_1[1::2]
names = array_1[0::2]
print array_1
array_1[0::2] = surnames
array_1[1::2] = names
print array_1
You get:
['SMITH', 'Mike', 'SMITH', 'Judy']
['Mike', 'SMITH', 'Judy', 'SMITH']
If I understood your question correctly, then you can do this:
It will work for any length of array.
array_1 = ['A','b','A','c' ]
array_2 = []
for index,itm in enumerate(array_1):
if index % 2 == 0:
array_2.append(array_1[index+1])
array_2.append(array_1[index])
print array_2
Output:
['b', 'A', 'c', 'A']

Categories

Resources