How to sort strings with numbers in Pandas? - python

I have a Python Pandas Dataframe, in which a column named status contains three kinds of possible values: ok, must read x more books, does not read any books yet, where x is an integer higher than 0.
I want to sort status values according to the order above.
Example:
name status
0 Paul ok
1 Jean must read 1 more books
2 Robert must read 2 more books
3 John does not read any book yet
I've found some interesting hints, using Pandas Categorical and map but I don't know how to deal with variable values modifying strings.
How can I achieve that?

Use:
a = df['status'].str.extract('(\d+)', expand=False).astype(float)
d = {'ok': a.max() + 1, 'does not read any book yet':-1}
df1 = df.iloc[(-df['status'].map(d).fillna(a)).argsort()]
print (df1)
name status
0 Paul ok
2 Robert must read 2 more books
1 Jean must read 1 more books
3 John does not read any book yet
Explanation:
First extract integers by regex \d+
Then dynamically create dictionary for map non numeric values
Replace NaNs by fillna for numeric Series
Get positions by argsort
Select by iloc for sorted values

You can use sorted with a custom function to calculate the indices which would be sort an array (much like numpy.argsort). Then feed to pd.DataFrame.iloc:
df = pd.DataFrame({'name': ['Paul', 'Jean', 'Robert', 'John'],
'status': ['ok', 'must read 20 more books',
'must read 3 more books', 'does not read any book yet']})
def sort_key(x):
if x[1] == 'ok':
return -1
elif x[1] == 'does not read any book yet':
return np.inf
else:
return int(x[1].split()[2])
idx = [idx for idx, _ in sorted(enumerate(df['status']), key=sort_key)]
df = df.iloc[idx, :]
print(df)
name status
0 Paul ok
2 Robert must read 3 more books
1 Jean must read 20 more books
3 John does not read any book yet

Related

Creating list based on pandas data frame conditions - Python

I have a table:
Player
Team
GS
Jack
A
NaN
John
B
1
Mike
A
1
James
A
1
And would like to make 2 separate lists (TeamA & TeamB) so that they players are split by team and also filters so that the players that have a '1' in GS are only part of the list. The final lists would look like:
TeamA = Mike, James
TeamB = John
In this case, Jack was excluded from the TeamA list because he did not have a 1 value in the GS column.
Any direction would help. Thanks!
You can use:
out = (df.loc[df['GS'].eq(1)] # filter rows
.groupby('Team')['Player'].agg(list) # aggregate as lists
.to_dict() # convert to dict
)
output:
{'A': ['Mike', 'James'], 'B': ['John']}
Think of this as "filtering" the data based on certain conditions.
dataframe = pd.DataFrame(...)
# Create a mask that will be used to select team A rows
mask_team_a = dataframe['Team'] == 'A'
# Create a second mask for the GS filter
mask_gs = dataframe['GS'] == 1
# Use the .loc accessor to get the rows by combining masks with '&'
team_a_df = dataframe.loc[mask_team_a & mask_gs, :]
# You can use the same masks, but use the '~' to say 'not team A'
team_b_df = dataframe.loc[(~mask_team_a) & mask_gs, :]
team_a_list = list(team_a_df['Player'])
team_b_list = list(team_b_df['Player'])
This might be a bit verbose, but it allows for the most flexibility in the future if you need to tweak your selections.

how to get a single value from dataframe only in Python

I have dataframe df_my that looks like this
id name age major
----------------------------------------
0 1 Mark 34 English
1 2 Tom 55 Art
2 3 Peter 31 Science
3 4 Mohammad 23 Math
4 5 Mike 47 Art
...
I am trying to get the value of major (only)
I used this and it works fine when I know the id of the record
df_my["major"][3]
returns
"Math"
great
but I want to get the major for a variable record
I used
i = 3
df_my.loc[df_my["id"]==i]["major"]
and also used
i = 3
df_my[df_my["id"]==i]["major"]
but they both return
3 Math
it includes the record index too
how can I get the major only and nothing else?
You could use squeeze:
i = 3
out = df.loc[df['id']==i,'major'].squeeze()
Another option is iat:
out = df.loc[df['id']==i,'major'].iat[0]
Output:
'Science'
I also stumbled over this problem, from a little different angle:
df = pd.DataFrame({'First Name': ['Kumar'],
'Last Name': ['Ram'],
'Country': ['India'],
'num_var': 1})
>>> df.loc[(df['First Name'] == 'Kumar'), "num_var"]
0 1
Name: num_var, dtype: int64
>>> type(df.loc[(df['First Name'] == 'Kumar'), "num_var"])
<class 'pandas.core.series.Series'>
So it returns a Series (although it is only a series with only 1 element). If you access through the index, you receive the integer.
df.loc[0, "num_var"]
1
type(df.loc[0, "num_var"])
<class 'numpy.int64'>
The answer on how to select the respective, single value was already given above. However, I think it is interesting to note that accessing through an index always gives the single value whereas accessing through a condition returns a series. This is, b/c accessing with index clearly returns only one value whereas accessing through a condition can return several values.
If one of the columns of your dataframe is the natural primary index for those data, then it's usually a good idea to make pandas aware of it by setting the index accordingly:
df_my.set_index('id', inplace=True)
Now you can easily get just the major value for any id value i:
df_my.loc[i, 'major']
Note that for i = 3, the output is 'Science', which is expected, as noted in the comments to your question above.

Lookup a value and if it is present in another df, return text in two new columns

I have gotten to a point with the following script and I wasn't able to take it all the way even with functions as the "is.in" possibly because I never used it prior to that.
Input df1:
ID Alternative ID
0 152503 009372
1 249774 249774
2 062005 196582
3 185704 185704
4 081231 081231
5 081231 062085
6 912568 222416
7 196782 195122
Input df2:
New_ID
0 498109
1 081231
2 231051
3 062005
4 152503
5 967272
6 875612
The idea is that I want to check whether the values of "ID" match with the ones on "Alternative ID" from df1. If they do it should return "Match" & "Correct" on two new columns named "Result_1" & "Result_2" respectively. For the ones that do not match, lookup whether they are present AT ALL in the "New_ID" column from df2. If they are return on those two columns mentioned above the values of "NEW Match" & "Good" respectively. If they are not there then return, "Not Match" & "Error".
For the first part of this task, this is the code that I used:
def compl(df1):
if (df1['ID'] == df1['Alternative ID']):
return 'Match', 'Correct'
elif (df1['ID'] != df1['ID']):
Here I cannot find the next step to basically check if the values that didn't match are in the df2, etc.
df1[['Result_1', 'Result_2']] = df1.apply(compl, axis = 1, result_type = 'expand')
Desirable output ->
ID Alternative ID Result_1 Result_2
0 152503 009372 NEW Match Good
1 249774 249774 Match Correct
2 062005 196582 NEW Match Good
3 185704 185704 Match Correct
4 081231 062085 Match Correct
5 912568 222416 Not Match Error
6 196782 195122 Not Match Error
Any suggestions/approaches would be greatly appreciated
Use np.select with your conditions and desired values. For the trueness of each condition, the function select will map the value given.
import numpy as np
conditions = [
df1['ID'] == df1['Alternative ID'],
df1['ID'].isin(df2['New_ID'])
]
values_result1 = ['Match', 'New match']
values_result2 = ['Correct', 'Good']
df1['Result_1'] = np.select(conditions, values_result1, 'No match')
df1['Result_2'] = np.select(conditions, values_result2, 'Error')
Output
ID Alternative ID Result_1 Result_2
0 152503 9372 New match Good
1 249774 249774 Match Correct
2 62005 196582 New match Good
3 185704 185704 Match Correct
4 81231 81231 Match Correct
5 81231 62085 New match Good
6 912568 222416 No match Error
7 196782 195122 No match Error
Note: Your answer with some more tweaks will work. But it will be a lot slower than the above-vectorized approach. Try to refrain from using apply until there's no other way.

looping though values based on a condition

I am trying to create a function in python that checks if the data in the dataframe is following a certain structure
in my case i need to ensure that the id column is structured like this ID0101-10
here is my code but it is not working, i keep getting an indexing error:
i = 0
for i in df["id"]:
if ('-' in df["id"]):
df["id"].iloc[i] = df["id"].iloc[i]
i += 1
else:
df.drop(df["id"].iloc[i])
i += 1
if you're curious about my data, its like this:
id name
ID0101-10 John
ID0101-11 Mary
8454 Test
MMMM MMMM
ID0101-01 Ben
MN87876 00.00
i am trying to clean my data by dropping the dummy values
EDIT: i get this error
TypeError: Cannot index by location index with a non-integer key
Any help is appreciated thanks
If I understand correctly, you can do this:
import pandas as pd
df = pd.DataFrame({'id':['ID0101-10', 'ID0101-11', '8454', 'MMMM', 'ID0101-01', 'MN87876'],
'name':['John', 'Mary', 'Test', 'MMMM', 'Ben', '00.00']})
result = df[df['id'].str.startswith('ID0101-')]
print(result)
Output:
id name
0 ID0101-10 John
1 ID0101-11 Mary
4 ID0101-01 Ben
As a general rule, you rarely need to loop over pandas dataframes, it's almost always faster to use native pandas functions.
For more complex matches you can use regular expressions: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.match.html

Make more than one column based on existing

I currently have a column which has data I want to parse, and then put this data on other columns. Currently the best I can get is from using the apply method:
def parse_parent_names(row):
split = row.person_with_parent_names.split('|')[2:-1]
return split
df['parsed'] = train_data.apply(parse_parent_names, axis=1).head()
The data is a panda df with a column that has names separated by a pipe (|):
'person_with_parent_names'
|John|Doe|Bobba|
|Fett|Bobba|
|Abe|Bea|Cosby|
Being the rightmost one the person and the leftmost the "grandest parent". I'd like to transform this to three columns, like:
'grandfather' 'father' 'person'
John Doe Bobba
Fett Bobba
Abe Bea Cosby
But with apply, the best I can get is
'parsed'
[John, Doe,Bobba]
[Fett, Bobba]
[Abe, Bea, Cosby]
I could use apply three times, but it would not be efficient to read the entire dataset three times.
Your function should be changed by compare number of | and split by ternary operator, last pass to DataFrame constructor:
def parse_parent_names(row):
m = row.count('|') == 4
split = row.split('|')[1:-1] if m else row.split('|')[:-1]
return split
cols = ['grandfather','father','person']
df1 = pd.DataFrame([parse_parent_names(x) for x in df.person_with_parent_names],
columns=cols)
print (df1)
grandfather father person
0 John Doe Bobba
1 Fett Bobba
2 Abe Bea Cosby

Categories

Resources