Python string matching and give repeated numbers for unmatched strings - python

I have set of some words in list1:"management consultancy services better financial health"
user_search="management consultancy services better financial health"
user_split = nltk.word_tokenize(user_search)
user_length=len(user_split)
assign :management=1, consultancy=2,services=3 ,better=4, financial=5 ,health=6.
Then compare this with set of some lists.
list2: ['us',
'paleri',
'home',
'us',
'consulting',
'services',
'market',
'research',
'analysis',
'project',
'feasibility',
'studies',
'market',
'strategy',
'business',
'plan',
'model',
'health',
'human' etc..]
So that any match occurs it will reflect on corresponding positions as 1,2 3 etc. If the positions are unmatched then the positions are filled with number 6 on words.
Expected output example:
[1] 7 8 9 10 11 3 12 13 14 15 16 17 18 19 20 21 22 6 23 24
This means string 3 and 4, ie. services and health is there in this list(matched). Other numbers indicates the unmatched.user_length=6. So unmatched positions will starts from 7. How to get such a expected result in python?

You can use itertools.count to create a counter and iterate via next:
from itertools import count
user_search = "management consultancy services better financial health"
words = {v: k for k, v in enumerate(user_search.split(), 1)}
# {'better': 4, 'consultancy': 2, 'financial': 5,
# 'health': 6, 'management': 1, 'services': 3}
L = ['us', 'paleri', 'home', 'us', 'consulting', 'services',
'market', 'research', 'analysis', 'project', 'feasibility',
'studies', 'market', 'strategy', 'business', 'plan',
'model', 'health', 'human']
c = count(start=len(words)+1)
res = [next(c) if word not in words else words[word] for word in L]
# [7, 8, 9, 10, 11, 3, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 6, 23]

Related

How to split pandas dataframe into list of dataframes by id?

I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3

Python Pandas to group columns only

A simple data-frame as below on the left and I want to achieve the right:
I use:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jason', 'Amy', 'Jason', 'River', 'Kate', 'David', 'Jack', 'David'],
'Department' : ['Sales', 'Operation', 'Operation', 'Sales', 'Operation', 'Sales', 'Operation', 'Sales', 'Finance', 'Finance', 'Finance'],
'Weight lost': [4, 4, 1, 4, 4, 4, 7, 2, 8, 1, 8],
'Point earned': [2, 2, 1, 2, 2, 2, 4, 1, 4, 1, 4]}
df = pd.DataFrame(data)
final = df.pivot_table(index=['Department','name'], values='Weight lost', aggfunc='count', fill_value=0).stack(dropna=False).reset_index(name='Weight_lost_count')
del final['level_2']
del final['Weight_lost_count']
print (final)
It seems non-necessary steps in the 'final' line.
What would be the better way to write it?
Try groupby with head
out = df.groupby(['Department','name']).head(1)
Isn't this just drop_duplicates:
df[['Department','name']].drop_duplicates()
Output:
Department name
0 Sales Jason
1 Operation Molly
2 Operation Tina
4 Operation Amy
6 Operation River
7 Sales Kate
8 Finance David
9 Finance Jack
And to exactly match the final:
(df[['Department','name']].drop_duplicates()
.sort_values(by=['Department','name'])
)
Output:
Department name
8 Finance David
9 Finance Jack
4 Operation Amy
1 Operation Molly
6 Operation River
2 Operation Tina
0 Sales Jason
7 Sales Kate

adding child node under every parent node in for loop in treeview in tkinter

I am trying to make a treeview which displays data from a mysql database. It retrieves the data, and converts it into a list of tuples. I have already created a for loop which quite nicely takes the data and puts it into a treeview.
count = 0
for record in rows: #forloop adding all the information from data list, no matter how many their are in the list
my_tree.insert(parent='', index='end', iid=count, text='', values=(record[0], record[1], record[2], record[3], record[4], record[5]))
count += 1
However, under every parent node, I want the for loop to also place a child node. The current treeview looks like this
studentid firstname lastname class1 class2 class3
0 5 Ayoung ere 23 29 22
1 6 Emma 4343 24 22 25
2 7 John 343G&$ 28 26 27
3 8 Anthony #^b 26 25 22
4 9 Enshean E(! 23 26 29
5 12 Ian %^&67HN 23 25 26
6 13 Ludwig Beethoven 23 26 29
7 14 Wolfgang Mozart 23 24 26
8 19 Joseph Haydn 23 26 27
9 20 Enshean #&V 23 26 29
10 21 Enshean L^& 23 26 29
Under every person in the list there will be a child node that would display 3 pieces of information from another list that was retrieved from a database. The list looks like this:
[(22, 'Math', 'Mr. Rosario', 'D2'), (23, 'Music', 'Mr. Young', 'M1'), (24, 'Biology', 'Ms. Marks', 'C4'), (25, 'Chemistry', 'Mr. Musk', 'C2'), (26, 'Physics', 'Mr. Walrath', 'A8'), (27, 'Economics', 'Mr. Sinclair', 'E12'), (28, 'DGT', 'Mr. Turing', 'F3'), (29, 'English', 'Mr. Gibson', 'B5')]
As we can see in my treeview, under class1,2 and 3 there is a number, that corresponds to a id from another database as seen above. What I want in the child node is under every one of the classes, will be the name of the class. For example:
studentid firstname lastname class1 class2 class3
0 5 Ayoung ere 23 29 22
> Music English Math
1 6 Emma BI$! 24 22 25
> Biology Math Chemistry
It's kind of hard to put it into words, but hopefully you get the idea. Any help in trying to put a child node with the corresponding data under every parent node in a for loop would be appreciated.
You can turn your list in a dictonary: {number: subject}.
If your list is called rows2 you can do
dic = {values[0]: values[1] for values in rows2}
Then add the subitem in the tree with
tree.insert(count, 'end', values=('',)*3 + tuple(dic[record[i]] for i in range(3,6)))
where count is the item iid and record the values from the first database.
Full example
import tkinter as tk
from tkinter import ttk
root = tk.Tk()
columns = ['studentid', 'firstname', 'lastname', 'class1', 'class2', 'class3']
tree = ttk.Treeview(root, columns=columns)
for col in columns:
tree.heading(col, text=col)
tree.pack()
# first database
rows = [
[5, 'Ayoung', 'ere', 23, 29, 22],
[6, 'Emma', '4343', 24, 22, 25],
[7, 'John', '343G&$', 28, 26, 27],
[8, 'Anthony', '#^b', 26, 25, 22],
[9, 'Enshean', 'E(!', 23, 26, 29],
[12, 'Ian', '%^&67HN', 23, 25, 26],
[13, 'Ludwig', 'Beethoven', 23, 26, 29],
[14, 'Wolfgang', 'Mozart', 23, 24, 26],
[19, 'Joseph', 'Haydn', 23, 26, 27],
[20, 'Enshean', '#&V', 23, 26, 29],
[21, 'Enshean', 'L^&', 23, 26, 29]
]
# second database
rows2 = [
(22, 'Math', 'Mr. Rosario', 'D2'), (23, 'Music', 'Mr. Young', 'M1'),
(24, 'Biology', 'Ms. Marks', 'C4'), (25, 'Chemistry', 'Mr. Musk', 'C2'),
(26, 'Physics', 'Mr. Walrath', 'A8'),
(27, 'Economics', 'Mr. Sinclair', 'E12'), (28, 'DGT', 'Mr. Turing', 'F3'),
(29, 'English', 'Mr. Gibson', 'B5')
]
# dictionary from second database
dic = {values[0]: values[1] for values in rows2}
# put data in treeview
for count, record in enumerate(rows): #forloop adding all the information from data list, no matter how many their are in the list
tree.insert(parent='', index='end', iid=count, text='', values=(record[0], record[1], record[2], record[3], record[4], record[5])) # data from first database
tree.insert(count, 'end', values=('',)*3 + tuple(dic[record[i]] for i in range(3,6))) # subitem using second database
root.mainloop()

Matching keywords of list elements with pandas columns

This question is further part of this. So I added it as new question
If my dataframe B would be something like:
ID category words bucket_id
1 audi a4, a6 94
2 bugatti veyron, chiron 86
3 mercedez s-class, e-class 79
4 dslr canon, nikon 69
5 apple iphone,macbook,ipod 51
6 finance sales,loans,sales price 12
7 politics trump, election, votes 77
8 entertainment spiderman,thor, ironmen 88
9 music beiber, rihana,drake 14
........ ..............
......... .........
I want mapped category along with its corresponding column ID as dictionary. Something like:-
{'id': 2, 'term': 'bugatti', 'bucket_id': 86}
{'id': 3, 'term': 'mercedez', 'bucket_id': 79}
{'id': 6, 'term': 'finance', 'bucket_id': 12}
{'id': 7, 'term': 'politics', 'bucket_id': 77}
{'id': 9, 'term': 'music', 'bucket_id': 14}
edit
I just want to map keywords with exact match in between two commas in column words not in between strings or along with any other words.
EDIT:
df = pd.DataFrame({'ID': [1, 2, 3],
'category': ['bugatti', 'entertainment', 'mercedez'],
'words': ['veyron,chiron', 'spiderman,thor,ironmen',
's-class,e-class,s-class'],
'bucket_id': [94, 86, 79]})
print (df)
ID category words bucket_id
0 1 bugatti veyron,chiron 94
1 2 entertainment spiderman,thor,ironmen 86
2 3 mercedez s-class,e-class,s-class 79
A = ['veyron','s-class','derman']
idx = [i for i, x in enumerate(df['words']) for y in x.split(',') if y in A]
print (idx)
[0, 2, 2]
L = (df.loc[idx, ['ID','category','bucket_id']]
.rename(columns={'category':'term'})
.to_dict(orient='r'))
print (L)
[{'ID': 1, 'term': 'bugatti', 'bucket_id': 94},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79}]

group number of counts by category

I wrote a script that goes over data, checks for emoticons using regex, and when an emoticon is found the counter is updated. Then, the number of counts per category should be written to a list for example cat ne has 25 emoticons, category fr has 45.... Here is where it goes wrong. The results I get are:
[1, 'ag', 2, 'dg', 3, 'dg', 4, 'fr', 5, 'fr', 6, 'fr', 7, 'fr', 8, 'hp', 9, 'hp', 10, 'hp', 11, 'hp', 12, 'hp', 13, 'hp', 14, 'hp', 15, 'hp', 16, 'hp', 17, 'hp', 18, 'hp', 19, 'hp', 20, 'hp', 21, 'hp', 22, 'hp', 23, 'hp', 24, 'hp', 25, 'ne', 26, 'ne', 27, 'ne', 28, 'ne', 29, 'ne', 30, 'ne', 31, 'ne', 32, 'ne', 33, 'ne', 34, 'ne', 35, 'ne', 36, 'ne', 37, 'ne', 38]
The fileid is of this form, one big file contains 7 smaller files (each file is a category). Within the category files are around 100 files per category:
data/ne/567.txt
The data in each of the .txt files is just one sentence, and looks like this
I am so happy today :)
This is my script:
counter = 0
lijst = []
for fileid in corpus.fileids():
for sentence in corpus.sents(fileid):
cat = str(fileid.split('/')[0])
s = " ".join(sentence)
m = re.search('(:\)|:\(|:\s|:\D|:\o|:\#)+', s)
if m is not None:
counter +=1
lijst += [counter] + [cat]
You should do:
import collections
counts = collections.defaultdict(lambda: 0)
for fileid in corpus.fileids():
for sentence in corpus.sents(fileid):
cat = str(fileid.split('/')[0])
s = " ".join(sentence)
counts[cat] += len(re.findall('(:\)|:\(|:\s|:\D|:\o|:\#)+', s))

Categories

Resources