Convert pandas dataframe columns into nested python dictionary - python

I want to create python dictionary with pandas data frame column 2(source) and column 3(description) and group by column 1(title)
Also, I want to get values of only provided titles
titles = ['test1','test2']
title source description
1 Test1 ABC description1
2 Test2 ABC description2
3 Test2 DEF description3
4 Test3 XYZ description4
output = {'Test1':{'ABC':'description1'},'Test2':{'ABC':'description2':'DEF':'description3'}

Use boolean indexing with Series.isin for filter first, then is used GroupBy.apply with lambda function for Series of dicts and last Series.to_dict:
titles = ['Test1','Test2']
d = (df[df['title'].isin(titles)]
.groupby('title')[['source','description']]
.apply(lambda x: dict(x.to_numpy()))
.to_dict())
print (d)
{'Test1': {'ABC': 'description1'}, 'Test2': {'ABC': 'description2', 'DEF': 'description3'}}

You can group by the dataframe w.r.t. title and then use python zip function to create inner dictionary with source and description. Please find below code for the same.
final_dict=dict()
all_groups = df.groupby('title')
for title in titles:
title_group = all_groups.get_group(title)
source_desc=dict(zip(title_group.source, title_group.description))
final_dict[title_group] = source_desc
print(final_dict)

try this,
result = {}
filter_ = ['Test1','Test2']
for x in df[df['title'].isin(filter_)].to_dict(orient='records'):
result.setdefault(x['title'], {}).update({x['source']: x['description']})
{'Test1': {'ABC': 'description1'}, 'Test2': {'ABC': 'description2', 'DEF': 'description3'}}

Related

retrieve multiple substrings from a string in pandas dataframe

I have a dataframe contains strings of email addresses with the format:
d = {'Country':'A', 'Email':'123#abc.com,456#def.com,789#ghi.com'}
df = pd.DataFrame(data=d)
and I want the username of emails only. So the new dataframe should look like this:
d = {'Country':'A', 'Email':'123,456,789'}
df1 = pd.DataFrame(data=d)
The best way I could think of is to split the original string by comma, delete the domain part of emails and join the list back again. Are there better ways to this problem?
If you want a string as output, you can remove the part starting on #. Use str.replace with the #[^,]+ regex:
df['Email'] = df['Email'].str.replace(r'#[^,]+', '', regex=True)
Output:
Country Email
0 A 123,456,789
For a list you could use str.findall:
df['Email'] = df['Email'].str.findall(r'[^,]+(?=#)')
Output:
Country Email
0 A [123, 456, 789]
This is a regex question, not really a Pandas question but here's a solution that'll return a list (which you can join together as a string)
import re
df['Email'].apply(lambda s: re.findall('\w+(?=#)', s))
Output:
0 [123, 456, 789]
Name: Email, dtype: object
Try this
import pandas as pd
d = {'Country':['A', 'B'], 'Email':['123#abc.com,456#def.com,789#ghi.com', '134#abc.com,436#def.com,229#ghi.com']}
df = pd.DataFrame(d)
df['Email'] = df['Email'].str.findall('#\w+.com').apply(', '.join).str.replace('#','')
df
Output
Country Email
A abc.com, def.com, ghi.com
B abc.com, def.com, ghi.com
here is one way to do it
# iterate through the dictionary values are replace anything after # and , or '
d={k: re.sub(r"#[^,\']+",'' , v) for k, v in d.items()}
d
{'Country': 'A', 'Email': '123,456, 789'}

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Get All Row Values After Split and Put Them In List

UPDATED: I've the following DataFrame:
df = pd.DataFrame({'sports': ["['soccer', 'men tennis']", "['soccer']", "['baseball', 'women tennis']"]})
print(df)
sports
0 ['soccer', 'men tennis']
1 ['soccer']
2 ['baseball', 'women tennis']
I need to extract all the unique sport names and put them into a list. I'm trying the following code:
out = pd.DataFrame(df['sports'].str.split(',').tolist()).stack()
out.value_counts().index
However, it's returning Nan values.
Desired output:
['soccer', 'men tennis', 'baseball', 'women tennis']
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
If these are lists, then you could explode + unique:
out = df['sports'].explode().unique().tolist()
If these are strings, then you could use ast.literal_eval first to parse it:
import ast
out = df['sports'].apply(ast.literal_eval).explode().unique().tolist()
or use ast.literal_eval in a set comprehension and unpack:
out = [*{x for lst in df['sports'].tolist() for x in ast.literal_eval(lst)}]
Output:
['soccer', 'men tennis', 'baseball', 'women tennis']
Assuming the type of values stored in sports column is list, we can flatten the column using hstack, then use set to get unique values
set(np.hstack(df['sports']))
{'baseball', 'men tennis', 'soccer', 'women tennis'}
lst = []
df['sports'].apply(lambda x: [lst.append(element) for element in x])
lst = list(set(lst))
Not sure how efficient is this, but works.

How to transform unordered list to one row in pandas?

I have a word document with lists which I'm turning into a pandas dataframe. The document is made up of lists, something like:
Item1: abca=2bc=01
Item2: abdbd=12
Item3: abt
I have been able to pull this data into a table like this:
d = {'item': ['item1','item1','item1','item2', 'item2', 'item3'], 'description': ['abc', 'a=2', 'bc=01', 'abd',' bd=12', 'abt']}
df = pd.DataFrame(data=d)
But the goal is to create a table like this:
d2 = {'item': ['item1', 'item2', 'item3'], 'description': ['abc', 'abd', 'abt'], 'labels': ["'a=2', 'bc=01'", 'bd=12', np.nan]}
good_df = pd.DataFrame(data=d2)
I was originally planning to:
set the item column as an index
split the dataframe in 2: one only with rows that don't have an equal sign, the other with the rows that have the equal sign
turn the rows into a list
append list to new column in split-dataframe-1
# split df
split_df_1 = df[~df['labels'].str.contains("=")]
split_df_2 = df[df['labels'].str.contains("=")]
# set new index
split_df_2 = df[df['labels'].str.contains("=")].set_index('item')
split_df_2.loc('item1')
and this is where I'm stuck. I get the error "No axis named item1 for object type <class 'pandas.core.frame.DataFrame'>"
Any help with this error, or cleaner way to accomplish this task, would be super helpful.
thanks in advance
It seems the error you're getting is due to the use of parentheses instead of square brackets.
split_df_2.loc('item1')
# Should be
split_df_2.loc['item1']
This should create your intended output. First we extract the "label" rows, then we combine rows that share the same item
d = {'item': ['item1','item1','item1','item2', 'item2', 'item3'], 'description': ['abc', 'a=2', 'bc=01', 'abd',' bd=12', 'abt']}
df = pd.DataFrame(data=d)
label_mask = df["description"].str.contains("=")
new_labels = (df.loc[label_mask, :]
.groupby("item")
.apply(lambda g: ", ".join(g["description"]))
print(new_labels)
item
item1 a=2, bc=01
item2 bd=12
dtype: object
Now we can just add it to the "description" only rows to create the final DataFrame.
new_df = (df.loc[~label_mask, :] # Select the correct "description" rows
.set_index("item") # Change the index to be the item so our DataFrame aligns with our `new_labels` Series
.assign(labels=new_labels)) # Add `new_labels` as its own column
print(new_df)
description labels
item
item1 abc a=2, bc=01
item2 abd bd=12
item3 abt NaN

Map two dataframes and perform sum operation using a dictionary

I have a dataframe df
df
Object Action Cost1 Cost2
0 123 renovate 10000 2000
1 456 do something 0 10
2 789 review 1000 50
and a dictionary (called dictionary)
dictionary
{'Object_new': ['Object'],
'Action_new': ['Action'],
'Total_Cost': ['Cost1', 'Cost2']}
Further, I have a (at the beginning empty) dataframe df_new that should contain almost the identicall information as df, except that the column names need to be different (naming according to the dictionary) and that some columns from df should be consolidated (e.g. a sum-operation) based on the dictionary.
The result should look like this:
df_new
Object_new Action_new Total_Cost
0 123 renovate 12000
1 456 do something 10
2 789 review 1050
How can I achieve this result using only the dictionary? I tried to use the .map() function but could not figure out how to perform the sum-operation with it.
The code to reproduce both dataframes and the dictionary are attached:
# import libraries
import pandas as pd
### create df
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
### create dictionary
dictionary = {'Object_new':['Object'],
'Action_new':['Action'],
'Total_Cost' : ['Cost1', 'Cost2']}
### create df_new
# data_df_new = pd.DataFrame(columns=['Object_new', 'Action_new', 'Total_Cost' ])
data_df_new = {'Object_new': [123, 456, 789],
'Action_new': ['renovate', 'do something', 'review'],
'Total_Cost': [12000, 10, 1050],
}
df_new = pd.DataFrame(data_df_new)
A play with groupby:
inv_dict = {x:k for k,v in dictionary.items() for x in v}
df_new = df.groupby(df.columns.map(inv_dict),
axis=1).sum()
Output:
Action_new Object_new Total_Cost
0 renovate 123 12000
1 do something 456 10
2 review 789 1050
Given the complexity of your algorithm, I would suggest performing a Series addition operation to solve this problem.
Why? In Pandas, every column in a DataFrame works as a Series under the hood.
data_df_new = {
'Object_new': df['Object'],
'Action_new': df['Action'],
'Total_Cost': (df['Cost1'] + df['Cost2']) # Addition of two series
}
df_new = pd.DataFrame(data_df_new)
Running this code will map every value contained in your dataset, which will be stored in our dictionary.
You can use an empty data frame to copy the new column and use the to_dict to convert it to a dictionary.
import pandas as pd
import numpy as np
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
print(df)
MyEmptydf = pd.DataFrame()
MyEmptydf['Object_new']=df['Object']
MyEmptydf['Action_new']=df['Action']
MyEmptydf['Total_Cost'] = df['Cost1'] + df['Cost2']
print(MyEmptydf)
dictionary = MyEmptydf.to_dict(orient="index")
print(dictionary)
you can run the code here:https://repl.it/repls/RealisticVillainousGlueware
If you trying to entirely avoid pandas and only use the dictionary this should solve it
Object = []
totalcost = []
action = []
for i in range(0,3):
Object.append(data_df['Object'][i])
totalcost.append(data_df['Cost1'][i]+data_df['Cost2'][i])
action.append(data_df['Action'][i])
dict2 = {'Object':Object, 'Action':action, 'TotalCost':totalcost}

Categories

Resources