Matching keywords of list elements with pandas columns

Matching keywords of list elements with pandas columns - python

This question is further part of this. So I added it as new question
If my dataframe B would be something like:
ID category words bucket_id
1 audi a4, a6 94
2 bugatti veyron, chiron 86
3 mercedez s-class, e-class 79
4 dslr canon, nikon 69
5 apple iphone,macbook,ipod 51
6 finance sales,loans,sales price 12
7 politics trump, election, votes 77
8 entertainment spiderman,thor, ironmen 88
9 music beiber, rihana,drake 14
........ ..............
......... .........
I want mapped category along with its corresponding column ID as dictionary. Something like:-
{'id': 2, 'term': 'bugatti', 'bucket_id': 86}
{'id': 3, 'term': 'mercedez', 'bucket_id': 79}
{'id': 6, 'term': 'finance', 'bucket_id': 12}
{'id': 7, 'term': 'politics', 'bucket_id': 77}
{'id': 9, 'term': 'music', 'bucket_id': 14}
edit
I just want to map keywords with exact match in between two commas in column words not in between strings or along with any other words.

EDIT:
df = pd.DataFrame({'ID': [1, 2, 3],
'category': ['bugatti', 'entertainment', 'mercedez'],
'words': ['veyron,chiron', 'spiderman,thor,ironmen',
's-class,e-class,s-class'],
'bucket_id': [94, 86, 79]})
print (df)
ID category words bucket_id
0 1 bugatti veyron,chiron 94
1 2 entertainment spiderman,thor,ironmen 86
2 3 mercedez s-class,e-class,s-class 79
A = ['veyron','s-class','derman']
idx = [i for i, x in enumerate(df['words']) for y in x.split(',') if y in A]
print (idx)
[0, 2, 2]
L = (df.loc[idx, ['ID','category','bucket_id']]
.rename(columns={'category':'term'})
.to_dict(orient='r'))
print (L)
[{'ID': 1, 'term': 'bugatti', 'bucket_id': 94},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79}]

Related

How to convert a list with nested dictionary into dataframe to save as a csv?

I have the following list:
a = [{'cluster_id': 0, 'points': [{'id': 1, 'name': 'Alice', 'lat': 52.523955, 'lon': 13.442362}, {'id': 2, 'name': 'Bob', 'lat': 52.526659, 'lon': 13.448097}]}, {'cluster_id': 0, 'points': [{'id': 1, 'name': 'Alice', 'lat': 52.523955, 'lon': 13.442362}, {'id': 2, 'name': 'Bob', 'lat': 52.526659, 'lon': 13.448097}]}, {'cluster_id': 1, 'points': [{'id': 3, 'name': 'Carol', 'lat': 52.525626, 'lon': 13.419246}, {'id': 4, 'name': 'Dan', 'lat': 52.52443559865125, 'lon': 13.41261723049818}]}, {'cluster_id': 1, 'points': [{'id': 3, 'name': 'Carol', 'lat': 52.525626, 'lon': 13.419246}, {'id': 4, 'name': 'Dan', 'lat': 52.52443559865125, 'lon': 13.41261723049818}]}]
I would like to convert this list into a dataframe with the following columns:
cluster_id
id
name
lat
lon
to save it as a csv. I tried a couple of solutions which I found like:
pd.concat([pd.DataFrame(l) for l in a],axis=1).T
But it didn't work as I expected.
What is the mistake I am doing?
Thanks

You can use pd.json_normalize
df = pd.json_normalize(a, record_path='points', meta='cluster_id')
print(df)
id name lat lon cluster_id
0 1 Alice 52.523955 13.442362 0
1 2 Bob 52.526659 13.448097 0
2 1 Alice 52.523955 13.442362 0
3 2 Bob 52.526659 13.448097 0
4 3 Carol 52.525626 13.419246 1
5 4 Dan 52.524436 13.412617 1
6 3 Carol 52.525626 13.419246 1
7 4 Dan 52.524436 13.412617 1

Python Pandas to group columns only

A simple data-frame as below on the left and I want to achieve the right:
I use:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jason', 'Amy', 'Jason', 'River', 'Kate', 'David', 'Jack', 'David'],
'Department' : ['Sales', 'Operation', 'Operation', 'Sales', 'Operation', 'Sales', 'Operation', 'Sales', 'Finance', 'Finance', 'Finance'],
'Weight lost': [4, 4, 1, 4, 4, 4, 7, 2, 8, 1, 8],
'Point earned': [2, 2, 1, 2, 2, 2, 4, 1, 4, 1, 4]}
df = pd.DataFrame(data)
final = df.pivot_table(index=['Department','name'], values='Weight lost', aggfunc='count', fill_value=0).stack(dropna=False).reset_index(name='Weight_lost_count')
del final['level_2']
del final['Weight_lost_count']
print (final)
It seems non-necessary steps in the 'final' line.
What would be the better way to write it?

Try groupby with head
out = df.groupby(['Department','name']).head(1)

Isn't this just drop_duplicates:
df[['Department','name']].drop_duplicates()
Output:
Department name
0 Sales Jason
1 Operation Molly
2 Operation Tina
4 Operation Amy
6 Operation River
7 Sales Kate
8 Finance David
9 Finance Jack
And to exactly match the final:
(df[['Department','name']].drop_duplicates()
.sort_values(by=['Department','name'])
)
Output:
Department name
8 Finance David
9 Finance Jack
4 Operation Amy
1 Operation Molly
6 Operation River
2 Operation Tina
0 Sales Jason
7 Sales Kate

Turn dataframe into dictionary of row series

I need to iterate (vector operation not possible) over a very large dataframe (10 million x 70). df.iterrows and directly accessing the dataframe using df.loc[i, col] is way too slow. In the past I would first turn the dataframe to a dictionary of dictionaries which allowws me to iterate very quickly. However, this method takes up a lot of memory and is not feasible anymore for my current data.
I need to sacrifice some lookup speed to save memory. What is the best way to do this? Would turning my dataframe into an dictionary of row series {index: Series} work?

Do you mean something like this:
In [1112]: pd.DataFrame(df.reset_index().to_dict(orient='records'))
Out[1112]:
index id block check
0 0 6 25 yes
1 1 6 32 no
2 2 9 18 yes
3 3 12 17 no
4 4 15 23 yes
5 5 15 11 yes
6 6 15 15 yes
In [1113]: df.reset_index().to_dict(orient='records')
Out[1113]:
[{'index': 0, 'id': 6, 'block': 25, 'check': 'yes'},
{'index': 1, 'id': 6, 'block': 32, 'check': 'no'},
{'index': 2, 'id': 9, 'block': 18, 'check': 'yes'},
{'index': 3, 'id': 12, 'block': 17, 'check': 'no'},
{'index': 4, 'id': 15, 'block': 23, 'check': 'yes'},
{'index': 5, 'id': 15, 'block': 11, 'check': 'yes'},
{'index': 6, 'id': 15, 'block': 15, 'check': 'yes'}]

you could just do this (thanks #oppressionslayer for the example df):
df
id block check
0 6 25 yes
1 6 32 no
2 9 18 yes
3 12 17 no
4 15 23 yes
5 15 11 yes
6 15 15 yes
df.to_dict('index')
output:
{0: {'id': 6, 'block': 25, 'check': 'yes'}, 1: {'id': 6, 'block': 32, 'check': 'no'}, 2: {'id': 9, 'block': 18, 'check': 'yes'}, 3: {'id': 12, 'block': 17, 'check': 'no'}, 4: {'id': 15, 'block': 23, 'check': 'yes'}, 5: {'id': 15, 'block': 11, 'check': 'yes'}, 6: {'id': 15, 'block': 15, 'check': 'yes'}}
if you specifically (for some reason) want it to be {index:series} you could do this, which can be accessed the same way (i.e. df_name[i][col])
df.T.to_dict('series')

Create pandas dataframe from a series containing list of dictionaries

One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}

I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0

Match list elements with values of dictionary [duplicate]

This question already has answers here:
Comparing List against Dict - return key if value matches list
(2 answers)
Closed 6 months ago.
I have dictionaries inside list as:-
L= [{'id': 3, 'term': 'bugatti', 'bucket_id': 'ad_3'},
{'id': 4, 'term': 'mercedez', 'bucket_id': 'ad_4'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 9, 'term': 'music', 'bucket_id': 'ad_9'}]
and another list as:-
words=['bugatti', 'entertainment', 'music','politics']
All I want to map elements of list words with key term and wants to get corresponding dictionary. Output expected as:
new_list= [{'id': 3, 'term': 'bugatti', 'bucket_id': 'ad_3'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 9, 'term': 'music', 'bucket_id': 'ad_9'}]
What I have tried as:
for d in L:
for k,v in d.items():
for w in words:
if v==w:
print (k,v)
gives me only:
term bugatti
term entertainment
term entertainemnt
term music

Using a list comprehension.
Ex:
L= [{'id': 3, 'term': 'bugatti', 'bucket_id': 'ad_3'},
{'id': 4, 'term': 'mercedez', 'bucket_id': 'ad_4'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 8, 'term': 'entertainment', 'bucket_id': 'ad_8'},
{'id': 9, 'term': 'music', 'bucket_id': 'ad_9'}]
words=['bugatti', 'entertainment', 'music','politics']
print([i for i in L if i["term"] in words])
Output:
[{'bucket_id': 'ad_3', 'id': 3, 'term': 'bugatti'},
{'bucket_id': 'ad_8', 'id': 8, 'term': 'entertainment'},
{'bucket_id': 'ad_8', 'id': 8, 'term': 'entertainment'},
{'bucket_id': 'ad_9', 'id': 9, 'term': 'music'}]

You can use list comprehension but I included the full loop so you can see the logic more clearly
new_l = [i for i in l if i['term'] in words]
Full loop
new_l = []
for i in l:
if i['term'] in words:
new_l.append(i)

print [dict for dict in L if dict["term"] in words]

The problem is that you are printing (k,v) which is just the key and the value of one dictionary entry. If you want to have the whole dictionary you have to put the whole dictionary in the print Statement.
for d in L:
for k,v in d.items():
for w in words:
if v==w:
print (d)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching keywords of list elements with pandas columns - python

Related

How to convert a list with nested dictionary into dataframe to save as a csv?

Python Pandas to group columns only

Turn dataframe into dictionary of row series

Create pandas dataframe from a series containing list of dictionaries

Match list elements with values of dictionary [duplicate]

Categories

Resources