Sorting a Dataframe in Python - python

Currently I have the following:
a data file called "world_bank_projects.json"
projects = json.load((open('data/world_bank_projects.json'))
Which I made a dataframe on the column "mjtheme_namecode"
proj_norm = json_normalize(projects, 'mjtheme_namecode')
After which I removed the duplicated entries
proj_norm_no_dup = proj_norm.drop_duplicates()
However, when I tried to sort the dataframe by the 'code' column, it somehow doesn't work:
proj_norm_no_dup.sort_values(by = 'code')
My question is why doesn't the sort function sort 10 and 11 to the bottom of the dataframe? it sorted everything else correctly.
Edit1: mjtheme_namecode is a list of dictionaries containing the keys 'code' and 'name'. Example: 'mjtheme_namecode': [{'code': '5', 'name': 'Trade and integration'}, {'code': '4', 'name': 'Financial and private sector development'}]
After normalization, the 'code' column is a series type.
type(proj_norm_no_dup['code'])
pandas.core.series.Series

Related

Flatten a Dataframe that is pivoted

I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired
This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])

List sorting except specific element

I made nested list
information = [['name', 'age', 'sex', 'height', 'weight'], ['sam', '17', 'm', 155, 55], [...]] to make table.
and I want to sort data according to height. but when I use .sort() method, because of information[0], this error message comes out
"TypeError: '<' not supported between instances of 'float' and 'str'"
how can I sort data except first element?
As you can see first picture, i made table to see in excel file.
ans I want to sort elements according to G column. (this is the code of movie info)
but when I tried sorting using sort(), because the top element is "러닝타임+평점"(str type), sorting is not able.
You can use sorted with the 4th element as key on a slice of information that doesn't hold the headers
information[1:] = sorted(information[1:], key=lambda x: x[3])
if using of pandas library is acceptable then you can do this
create as dataframe.
import pandas as pd
df = pd.DataFrame(information)
# make the first row as column names
headers = df.iloc[0]
# create a updated dataframe with those new column names
updated_df = pd.DataFrame(df.values[1:], columns=headers)
# sort the values based on height
updated_df.sort_values('height', inplace=True)

Filtering pandas DataFrame with column of dictionaries using a particular key

I have a pandas DataFrame df that has one column feats made up of dictionaries. I am trying to filter the DataFrame to retain only rows that have a particular key called color in the column feats. Given below are few sample rows from that column.
{'color':'blue', 'width':'20','height':'100'}
{'color':'red', 'width':'15','height':'80'}
{'width':'25','height':'75'}
I tried a few things as shown below:
1. sub_df = df[['color' in x.keys() for x in df.feats]]
I got the following error:
AttributeError: 'NoneType' object has no attribute 'keys'
2. sub_df = df['color' in df['feats'].keys()]
I got this error:
KeyError: False
Using the suggestion in this link, I tried the following, because I know all the possible values that the key takes.
3. sub_df1 = df[df.feats.apply(lambda x: x['color'] == 'blue')]
This is the error I get:
KeyError: 'color'
I believe this is all happening because some rows do not have the key color. So the question is how do I filter the DataFrame by overcoming this problem?
Your feats column doesn't just contain dictionary, it has missing data masked as None. For example:
df = pd.DataFrame({'feats':[{'color':'blue', 'width':'20','height':'100'},
{'color':'red', 'width':'15','height':'80'},
{'width':'25','height':'75'},
None]})
To make sure we only check on dict type, we can use isinstance:
df[[isinstance(x, dict) and ('color' in x.keys()) for x in df.feats]]
Output:
feats
0 {'color': 'blue', 'width': '20', 'height': '100'}
1 {'color': 'red', 'width': '15', 'height': '80'}

Creating a new column based on another column

I have a dataframe (called msg_df) that has a column called "messages". This column has, for each row, a list of dictionaries as values
(example:
msg_df['messages'][0]
output:
[{'id': 1, 'date': '2018-12-04T16:26:13Z', 'type': 'b'},
{'id': 2, 'date': '2018-12-11T15:28:49Z', 'type': 'i'},
{'id': 3, 'date': '2018-12-04T16:26:13Z', 'type': 'c'}] )
What I need to do is to create a new column, let's call it "filtered_messages", which only contains the dictionaries that have 'type': 'b' and 'type': 'i'.
The problem is, when I apply a list comp to a single value, it works, for example:
test = msg_df['messages'][0]
keys_list = ['b','i']
filtered = [d for d in test if d['type'] in keys_list]
filtered
output:
[{'id': 1, 'date': '2018-12-04T16:26:13Z', 'type': 'b'},
{'id': 2, 'date': '2018-12-11T15:28:49Z', 'type': 'i'}]
the output is the filtered list, however, I am not being able to:
1. apply the same concept to the whole column, row by row
2. obtain a new column with the values being the filtered list
New to Python, really need some help over here.
PS: Working on Jupyter, have pandas, numpy, etc.
As a general remark, this looks like a weird pandas structure. The underlying containers from pandas are numpy arrays, which means that pandas is very good at numeric processing, and can store other type elements. And storing containers is pandas cell is not better...
That being said, you can use apply to apply a Python function to every element of a pandas Series, said differently to a DataFrame column:
keys_list = ['b','i']
msg_df['filtered_messages'] = msg_df['messages'].apply(lambda x:
[d for d in test if d['type'] in keys_list])

selecting subset of pandas dataframe based on specified list of values [duplicate]

This question already has answers here:
Filter dataframe rows if value in column is in a set list of values [duplicate]
(7 answers)
Closed 6 years ago.
I would like to know if there is a way to select rows based on a list of values. That is, create a subset from a dataframe based on the values from a list.
To explain further, I take an example of a dataframe from Chris Albon. Suppose I have the following dataframe:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
I only wish to choose rows based on the 'first_name' from the list below:
fnames = ['Alex', 'Alice', 'Ayoung']
What I have always done is to a run loop over the fnames with the condition:
for fn in fnames:
df_name = df_a[(df_a['first_name'] == fn
and then append/concat each row to a new data frame to create what I desire. Is there a better way to subset a dataframe based on values from a list?
Use the isin method:
df_name = df_a[(df_a['first_name'].isin(fn)

Categories

Resources