Creating a new column based on another column - python

I have a dataframe (called msg_df) that has a column called "messages". This column has, for each row, a list of dictionaries as values
(example:
msg_df['messages'][0]
output:
[{'id': 1, 'date': '2018-12-04T16:26:13Z', 'type': 'b'},
{'id': 2, 'date': '2018-12-11T15:28:49Z', 'type': 'i'},
{'id': 3, 'date': '2018-12-04T16:26:13Z', 'type': 'c'}] )
What I need to do is to create a new column, let's call it "filtered_messages", which only contains the dictionaries that have 'type': 'b' and 'type': 'i'.
The problem is, when I apply a list comp to a single value, it works, for example:
test = msg_df['messages'][0]
keys_list = ['b','i']
filtered = [d for d in test if d['type'] in keys_list]
filtered
output:
[{'id': 1, 'date': '2018-12-04T16:26:13Z', 'type': 'b'},
{'id': 2, 'date': '2018-12-11T15:28:49Z', 'type': 'i'}]
the output is the filtered list, however, I am not being able to:
1. apply the same concept to the whole column, row by row
2. obtain a new column with the values being the filtered list
New to Python, really need some help over here.
PS: Working on Jupyter, have pandas, numpy, etc.

As a general remark, this looks like a weird pandas structure. The underlying containers from pandas are numpy arrays, which means that pandas is very good at numeric processing, and can store other type elements. And storing containers is pandas cell is not better...
That being said, you can use apply to apply a Python function to every element of a pandas Series, said differently to a DataFrame column:
keys_list = ['b','i']
msg_df['filtered_messages'] = msg_df['messages'].apply(lambda x:
[d for d in test if d['type'] in keys_list])

Related

Why does my df save one dictionary as two duplicate rows?

I have a dictionary:
import pandas as pd
d = {'id': 1, 'name': 'Pizza', 'calories': 234}
print(pd.dataFrame(d))
When I try to turn it into a dataframe using pd.DataFrame(d), I get a dataframe with two duplicate rows of the same entry:
id
name
calories
0
1
Pizza
234
1
1
Pizza
234
I want the outcome to be only one row for each entry, not two.
I have tried using pd.DataFrame(d)and pd.DataFrame.from_dict(d). I know I can just use df.iloc[0] or remove duplicates and solve this issue, but why is duplicate even saved at all?
Pandas version is 1.4.2
Not sure if this is version dependent, but I can't even make a DataFrame just using the two lines you've mentioned. That said, you should be able to resolve this by making each value a list.
d = {'id': [1], 'name': ['Pizza'], 'calories': [234]}
pd.DataFrame(d)
I don't know if it's this obvious to everyone, but personally I feel like a doofus finding this out now. the solution was to put the dict inside list brackets.
pd.DataFrame([d])

How do I create a nested dictionary to pair a dataframe's column's categories with its corresponding frequency of occurence?

I have a dataset comprising of categorial data as the one shown below:
How do I create a nested dictionary using this dataframe in such a way that the "key" will be the column, and the nested "key":"value" will be the "category":"number of times said categories occur"?
You can use collections.Counter to count the number of occurrences of each category. When fed an iterable (such as a DataFrame column), this will return a dict-like object of the type "category": count, like your inner dict.
To get this for each one of the columns, you could iterate over the columns, like so:
from collections import Counter
all_counts = {}
for column in df.columns:
all_counts[column] = Counter(df[column])
Try as follows:
import pandas as pd
# sample
data = {'gender': ['Male','Female','Male'],
'heart_disease':[0,1,1]}
df = pd.DataFrame(data)
a_dict = {}
for x in df.columns:
a_dict[x] = df[x].value_counts().to_dict()
print(a_dict)
{'gender': {'Male': 2, 'Female': 1}, 'heart_disease': {1: 2, 0: 1}}

Sorting a Dataframe in Python

Currently I have the following:
a data file called "world_bank_projects.json"
projects = json.load((open('data/world_bank_projects.json'))
Which I made a dataframe on the column "mjtheme_namecode"
proj_norm = json_normalize(projects, 'mjtheme_namecode')
After which I removed the duplicated entries
proj_norm_no_dup = proj_norm.drop_duplicates()
However, when I tried to sort the dataframe by the 'code' column, it somehow doesn't work:
proj_norm_no_dup.sort_values(by = 'code')
My question is why doesn't the sort function sort 10 and 11 to the bottom of the dataframe? it sorted everything else correctly.
Edit1: mjtheme_namecode is a list of dictionaries containing the keys 'code' and 'name'. Example: 'mjtheme_namecode': [{'code': '5', 'name': 'Trade and integration'}, {'code': '4', 'name': 'Financial and private sector development'}]
After normalization, the 'code' column is a series type.
type(proj_norm_no_dup['code'])
pandas.core.series.Series

Use python/pandas to combine rows where duplicate values exist in one column

I want to use python to determine if the first instance of an ID value in the "Id" column has a match on a later row in that same column. If it does, then I want to take the value from the "Avail" column for the rows which match that initial "Id" value. Then I want to delete the rows with the duplicate Ids.
Here's my sample data:
I have a CSV file that has data like this:
Id,First,Last,Avail
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7
Desired output (v1) (Please note that I don't care about the "First" or "Last" columns from the duplicate rows. I only care about the "Avail" data from those:
Id,First,Last,Avail
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate
abcdefg,Nancy,Adams,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7
Then I'd like to delete the "duplicate" rows, leaving this:
Id,First,Last,Avail
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate
import pandas as pd
df = pd.DataFrame(data=[
[1, 'John', 'Smith', 'a'],
[1, 'John', 'Smith', 'b'],
[2, 'Kate', 'Smith', 'c'],
],
columns=['ID', 'First', 'Last', 'Avail']
)
output = (df
.groupby(['ID', 'First', 'Last'], as_index=False)
.agg({'Avail': lambda x: ';'.join(x)}))
You can use groupby as #Sphinx suggested. An example with the style of output you requested is above.

How to subset a pandas data-frame based on a dictionary using case-insensitive matching

I have a dataframe which contains various products and their descriptions as shown in the image below:
I have a dict which contains the key-value pairs based on which the filtering has to be done:
ent_dict
{'brand': 'Dexter', 'color': 'brown', 'product': 'footwear', 'size': '32'}
As can be seen the dict and dataframe might contain values in different cases and hence I need to do case-insensitive matching here. Also there might be columns that are numeric for which normal matching will do.
So can someone please help me in this.
The above works for string matches. You can further change the final statement to match the integers too.
import numpy as np
import pandas as pd
import re
df = pd.DataFrame({'Product': np.array(['Footwear' for i in range(4)]), 'Category': np.array(['Women' for i in range(4)]), 'Size': np.array([7, 7, 7, 8]), 'Color': np.array(['black', 'brown', 'blue', 'black'])})
ent_dict = {'Category': 'Women', 'Color': 'black', 'Product': 'Footwear'}
values = [i for i in ent_dict.values()]
columns = [df.filter(regex=re.compile(i, re.IGNORECASE)).columns[0] for i in ent_dict]
df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond))
for col, cond in zip(columns, values)]))]
The case insensitive search can be accomplished using str.contains of DataFrame object.
df[eval(" & ".join(["(df['{0}'].str.contains({1}, case=False))".format(col, repr(cond))
for col, cond in zip(columns, values)]))]

Categories

Resources