Get unique items in a column starting with a given string - python

Consider a column with its unique values:
df['something'].unique() =array(['aa','bb','a','c']).
Now I want to know which of the items start with an a .
My expected answer is
'aa','a'

I think here is the simplest use of list comprehension with filtering:
out = [x for x in df['something'].unique() if x.startswith('a')]
print (out)
['aa', 'a']
For pandas solution use:
s = pd.Series(df['something'].unique())
out = s[s.str.startswith('a')].tolist()
print (out)
['aa', 'a']

Related

how to convert dataframes into list

I am trying to convert a dataframe into list and i have written the following but the output i get is list of list, what should i do to get just the list or how to convert the current output to list.
please check the image attached for the output and code below
import pandas as pd
import mysql.connector
region1 = mysql.connector.connect(host="localhost", user="xxxxxx", passwd="xxxxxxxx")
query1 = "SHOW DATABASES"
df1 = pd.read_sql(query1, region1)
print(df1.values.tolist())
To convert the current nested list to a single flattened list
from nltk import flatten
a = [['a'], ['b'], ['c']]
print(flatten(a))
Output:
['a', 'b', 'c']
This may help you
df1 = pd.read_sql(query1, region1)
res = []
for col in df.columns:
res.append(df[col].values.tolist())
print(res)
You can use a list comprehension:
[i[0] for i in df1.values.tolist()]
Output:
['atest', 'btest', 'ctest', 'information_schema', 'mysql', 'performance_schema', 'sakila', 'sys', 'telusko', 'world']
That is for when each list inside the list has only one element in it.
If there would be multiple things in each list:
[i for j in df1.values.tolist() for i in j]

Creating a Python list with given indexes for each repeating element

First list : contains the list indexes of corresponding category name
Second list : contains the category names as string
Intervals=[[Indexes_Cat1],[Indexes_Cat2],[Indexes_Cat3], ...]
Category_Names=["cat1","cat2","cat3",...]
Desired Output:
list=["cat1", "cat1","cat2","cat3","cat3"]
where indexes of any element in output list is placed using Intervals list.
Ex1:
Intervals=[[0,4], [2,3] , [1,5]]
Category_Names=["a","b","c"]
Ex: Output1
["a","c","b","b","a","c"]
Edit: More Run Cases
Ex2:
Intervals=[[0,1], [2,3] , [4,5]]
Category_Names=["a","b","c"]
Ex: Output2
["a","a","b","b","c","c"]
Ex3:
Intervals=[[3,4], [1,5] , [0,2]]
Category_Names=["a","b","c"]
Ex: Output3
["c","b","c","a","a","b"]
My solution:
Create any empty array of size n.
Run a for loop for each category.
output=[""]*n
for i in range(len(Category_Names)):
for index in Intervals[I]:
output[index]=Categories[i]
Is there a better solution, or a more pythonic way? Thanks
def categorise(Intervals=[[0,4], [2,3] , [1,5]],
Category_Names=["a","b","c"]):
flattened = sum(Intervals, [])
answer = [None] * (max(flattened) + 1)
for indices, name in zip(Intervals, Category_Names):
for i in indices:
answer[i] = name
return answer
assert categorise() == ['a', 'c', 'b', 'b', 'a', 'c']
assert categorise([[3,4], [1,5] , [0,2]],
["a","b","c"]) == ['c', 'b', 'c', 'a', 'a', 'b']
Note that in this code you will get None values in the answer if the "intervals" don't cover all integers from zero to the max interval number. It is assumed that the input is compatable.
I am not sure if there is a way to avoid the nested loop (I can't think of any right now) so it seems your solution is good.
A way you could do it a bit better is to construct the output array with one of the categories:
output = [Category_Names[0]]*n
and then start the iteration skipping that category:
for i in range(1, len(Category_Names)):
If you know there is a category that appears more than the others then you should use that as the one initializing the array.
I hope this helps!
You can reduce the amount of strings created and use enumerate to avoid range(len(..)) for indexing.
Intervals=[[0,4], [2,3] , [1,5]]
Category_Names=["a","b","c"]
n = max(x for a in Intervals for x in a) + 1
# do not construct strings that get replaced anyhow
output=[None] * n
for i,name in enumerate(Category_Names):
for index in Intervals[i]:
output[index]=name
print(output)
Output:
["a","c","b","b","a","c"]

How do I replace characters in lists using dictionary in python

I want to write a code that will replace certain characters in a list in an efficient way using a dictionary.
If I have:
key = {'a':'z','b':'y','c':'x'}
List = ['a','b','c']
How can I get the output
zyx
edit to clarify. The output I want is really
randomvariableorsomething = ['z', 'y', 'x']
My apologies.
Will [key[x] for x in List] work if I don't have a key for it in the dict?
Use get and join:
>>> ''.join(key.get(e,'') for e in List)
'zyx'
If by 'replace' you mean to change the list to the values of the dict in the order of the elements of the original list, you can do:
>>> List[:]=[key.get(e,'') for e in List]
>>> List
['z', 'y', 'x']
key = {'a':'z','b':'y','c':'x'}
List = ['a','b','c']
print([key.get(x,"No_key") for x in List])
#### Output ####
['z', 'y', 'x']
If your interest is only to print them as string,then:
print(*[key.get(x,"No_key") for x in List],sep="")
#### Output ####
zxy
Just in case you need the solution without join.
ss = ''
def fun_str(x):
global ss
ss = ss + x
return(ss)
print([fun_str(x) for x in List][-1])
#### Output ####
zxy
Both keys and List are words in python that can collide with existing objects or methods (dict.keys() and List objects), so I replaced them with k and lst respectively for best practice:
[k[x] for x in lst]

Efficient and faster implementation of finding and matching unique values in a pandas dataframe

Regarding the following Pandas dataframe,
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,30]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
df.loc['D001', 'A'] = [20,20,10]
df.loc['D001', 'B'] = [68,71,92]
#df is a dataframe
df
I am interested to know the Ids which include the all the values from a set or list in their 'A' column. Let's define a list with values as [10,20]. In this case, I should get locations 'B001' and 'D001' as the answer since both these locations have the values mentioned the list in their 'A' column .
Further can you suggest a faster implementation since I have to work on really big data set.
You can use set.intersection for your calculation, and pd.Index.get_level_values to extract the first level of your index:
search = {10, 20}
idx = (set(df[df['A'] == i].index.get_level_values(0)) for i in search)
res = set.intersection(*idx)
Basically -
search_list = {10,20}
op = df.groupby(level=0)['A'].apply(lambda x: search_list.issubset(set(x))).reset_index()
print(op[op['A']]['ID'])
Thanks #Ben.T for taking out the unnecessary unique()
Output
1 B001
Name: ID, dtype: object
Explanation
df.groupby(level=0)['A'] groups by level 0 and gives you the lists -
ID
A001 [10]
B001 [10, 20, 30]
C001 [20]
Next, for each of these lists, we convert it into a set and check whether the search_list is a subset.
ID
A001 False
B001 True
C001 False
It returns a Series of boolean values which can then be used as a mask -
print(op[op['A']]['ID'])
Final Output -
1 B001

Convert items in a list element to a list

I am forced to use comma separation in one of my input arguments to separate multiple values. So I end up with
my_string = ['a,b,c']
How can I convert this so that
my_new_string = ['a', 'b', 'c']
One possible way:
my_new_string = my_string[0].split(',')
Try this one-liner:
my_new_string = [x for y in my_string for x in y.split(',')]
Try this one-liner:
print(list(my_string[0].split(',')))

Categories

Resources