I currently have a column which has data I want to parse, and then put this data on other columns. Currently the best I can get is from using the apply method:
def parse_parent_names(row):
split = row.person_with_parent_names.split('|')[2:-1]
return split
df['parsed'] = train_data.apply(parse_parent_names, axis=1).head()
The data is a panda df with a column that has names separated by a pipe (|):
'person_with_parent_names'
|John|Doe|Bobba|
|Fett|Bobba|
|Abe|Bea|Cosby|
Being the rightmost one the person and the leftmost the "grandest parent". I'd like to transform this to three columns, like:
'grandfather' 'father' 'person'
John Doe Bobba
Fett Bobba
Abe Bea Cosby
But with apply, the best I can get is
'parsed'
[John, Doe,Bobba]
[Fett, Bobba]
[Abe, Bea, Cosby]
I could use apply three times, but it would not be efficient to read the entire dataset three times.
Your function should be changed by compare number of | and split by ternary operator, last pass to DataFrame constructor:
def parse_parent_names(row):
m = row.count('|') == 4
split = row.split('|')[1:-1] if m else row.split('|')[:-1]
return split
cols = ['grandfather','father','person']
df1 = pd.DataFrame([parse_parent_names(x) for x in df.person_with_parent_names],
columns=cols)
print (df1)
grandfather father person
0 John Doe Bobba
1 Fett Bobba
2 Abe Bea Cosby
Related
I have a table:
Player
Team
GS
Jack
A
NaN
John
B
1
Mike
A
1
James
A
1
And would like to make 2 separate lists (TeamA & TeamB) so that they players are split by team and also filters so that the players that have a '1' in GS are only part of the list. The final lists would look like:
TeamA = Mike, James
TeamB = John
In this case, Jack was excluded from the TeamA list because he did not have a 1 value in the GS column.
Any direction would help. Thanks!
You can use:
out = (df.loc[df['GS'].eq(1)] # filter rows
.groupby('Team')['Player'].agg(list) # aggregate as lists
.to_dict() # convert to dict
)
output:
{'A': ['Mike', 'James'], 'B': ['John']}
Think of this as "filtering" the data based on certain conditions.
dataframe = pd.DataFrame(...)
# Create a mask that will be used to select team A rows
mask_team_a = dataframe['Team'] == 'A'
# Create a second mask for the GS filter
mask_gs = dataframe['GS'] == 1
# Use the .loc accessor to get the rows by combining masks with '&'
team_a_df = dataframe.loc[mask_team_a & mask_gs, :]
# You can use the same masks, but use the '~' to say 'not team A'
team_b_df = dataframe.loc[(~mask_team_a) & mask_gs, :]
team_a_list = list(team_a_df['Player'])
team_b_list = list(team_b_df['Player'])
This might be a bit verbose, but it allows for the most flexibility in the future if you need to tweak your selections.
I have a Python Pandas Dataframe, in which a column named status contains three kinds of possible values: ok, must read x more books, does not read any books yet, where x is an integer higher than 0.
I want to sort status values according to the order above.
Example:
name status
0 Paul ok
1 Jean must read 1 more books
2 Robert must read 2 more books
3 John does not read any book yet
I've found some interesting hints, using Pandas Categorical and map but I don't know how to deal with variable values modifying strings.
How can I achieve that?
Use:
a = df['status'].str.extract('(\d+)', expand=False).astype(float)
d = {'ok': a.max() + 1, 'does not read any book yet':-1}
df1 = df.iloc[(-df['status'].map(d).fillna(a)).argsort()]
print (df1)
name status
0 Paul ok
2 Robert must read 2 more books
1 Jean must read 1 more books
3 John does not read any book yet
Explanation:
First extract integers by regex \d+
Then dynamically create dictionary for map non numeric values
Replace NaNs by fillna for numeric Series
Get positions by argsort
Select by iloc for sorted values
You can use sorted with a custom function to calculate the indices which would be sort an array (much like numpy.argsort). Then feed to pd.DataFrame.iloc:
df = pd.DataFrame({'name': ['Paul', 'Jean', 'Robert', 'John'],
'status': ['ok', 'must read 20 more books',
'must read 3 more books', 'does not read any book yet']})
def sort_key(x):
if x[1] == 'ok':
return -1
elif x[1] == 'does not read any book yet':
return np.inf
else:
return int(x[1].split()[2])
idx = [idx for idx, _ in sorted(enumerate(df['status']), key=sort_key)]
df = df.iloc[idx, :]
print(df)
name status
0 Paul ok
2 Robert must read 3 more books
1 Jean must read 20 more books
3 John does not read any book yet
I have a dataframe df of thousands of items where the value of the column "group" repeats from two to ten times. The dataframe has seven columns, one of them is named "url"; another one "flag". All of them are strings.
I would like to use Pandas in order to traverse through these groups. For each group I would like to find the longest item in the "url" column and store a "0" or "1" in the "flag" column that corresponds to that item. I have tried the following but I can not make it work. I would like to 1) get rid of the loop below, and 2) be able to compare all items in the group through df.apply(...)
all_groups = df["group"].drop_duplicates.tolist()
for item in all_groups:
df[df["group"]==item].apply(lambda x: Here I would like to compare the items within one group)
Can apply() and lambda be used in this context? Any faster way to implement this?
Thank you!
Using groupby() and .transform() you could do something like:
df['flag'] = df.groupby('group')['url'].transform(lambda x: x.str.len() == x.map(len).max())
Which provides a boolean value for df['flag']. If you need it as 0, 1 then just add .astype(int) to the end.
Unless you write code and find it's running slowly don't sweat optimizing it. In the words of Donald Knuth "Premature optimization is the root of all evil."
If you want to use apply and lambda (as mentioned in the question):
df = pd.DataFrame({'url': ['abc', 'de', 'fghi', 'jkl', 'm'], 'group': list('aaabb'), 'flag': 0})
Looks like:
flag group url
0 0 a abc
1 0 a de
2 0 a fghi
3 0 b jkl
4 0 b m
Then figure out which elements should have their flag variable set.
indices = df.groupby('group')['url'].apply(lambda s: s.str.len().idxmax())
df.loc[indices, 'flag'] = 1
Note this only gets the first url with maximal length. You can compare the url lengths to the maximum if you want different behavior.
So df now looks like:
flag group url
0 0 a abc
1 0 a de
2 1 a fghi
3 1 b jkl
4 0 b m
I have data that looks like:
Identifier Category1 Category2 Category3 Category4 Category5
1000 foo bat 678 a.x ld
1000 foo bat 78 l.o op
1000 coo cat 678 p.o kt
1001 coo sat 89 a.x hd
1001 foo bat 78 l.o op
1002 foo bat 678 a.x ld
1002 foo bat 78 l.o op
1002 coo cat 678 p.o kt
What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:
First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.
One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).
Any help is appreciated, I am very new to Python. Thanks in advance!
You could do something like the following:
import pandas as pd
input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']
duplicate_entries = {}
for group in input_file.groupby('Identifier'):
# transforming to tuples so that it can be used as keys on a dict
lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]
key = tuple(lines)
if key not in duplicate_entries:
duplicate_entries[key] = []
duplicate_entries[key].append(group[0])
Then the duplicate_entries values will have the list of duplicate Identifiers
duplicate_entries.values()
> [[1000, 1002], [1001]]
EDIT:
To get only the entries that have duplicates, you could have something like:
all_dup = [dup for dup in duplicate_entries if len(dup) > 1]
Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.
We could separate in groups with groupby, then sort all groups (so we can detect equals even when rows are in different order) by all columns except for "Identifier" and compare the groups:
Suppose that columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]
We can do:
groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
pure_groups += [group]
g_idfless = group[group.columns.difference(["Identifier"])]
groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]
And to compare them:
for i in range(len(groups)):
for j in range(i + 1, len(groups)):
id1 = str(pure_groups[i]["Identifier"].iloc[0])
id2 = str(pure_groups[j]["Identifier"].iloc[0])
print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))
#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False
EDIT: Added code to print the identifiers of the groups that match
Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.