I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.
In my dataframe I have some Name and I want to split it based on some words.
Dataframe (dff):
id name
1 Midian Almeida(Last)
2 Robert(ASA)(first)
3 Nikole John (middle)
4 Nikole John (first)
5 Raça Negra (last)
I want to split them based one first,last,middle
I tried the below part
dff['name'].str.split('(first)|(last)|(middle)', expand=True).add_prefix('name_')
It gives the below output:
name_0
Midian Almeida
Robert(ASA)
Nikole John
Nikole John
Raça Negra
but I want to put the split words in another column.
desired output is:
id name split option
1 Midian Almeida (Last)
2 Robert(ASA) (first)
3 Nikole John (middle)
4 Nikole John (first)
5 Raça Negra (last)
How can i do that?
This contains what you need: Pandas split on regex.
The following should work:
df.name.str.split(r'(\(Last\)|\(first\)|\(middle\))', expand=True)[[0, 1]]
The reason you need the regex is because you need the capture group, in this case the parentheses around the whole matching string. If you want to play around with regex to get a better feel for it, you can use the following: https://regex101.com/
I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?
So my google-fu doesn't seem to be doing me justice with what seems like should be a trivial procedure.
In Pandas for Python I have 2 datasets, I want to merge them. This works fine using .concat. The issue is, .concat reorders my columns. From a data retrieval point of view, this is trivial. From a "I just want to open the file and quickly see the most important column" point of view, this is annoying.
File1.csv
Name Username Alias1
Tom Tomfoolery TJZ
Meryl MsMeryl Mer
Timmy Midsize Yoda
File2.csv
Name Username Alias 1 Alias 2
Bob Firedbob Fire Gingy
Tom Tomfoolery TJZ Awww
Result.csv
Alias1 Alias2 Name Username
0 TJZ NaN Tom Tomfoolery
1 Mer NaN Meryl MsMeryl
2 Yoda NaN Timmy Midsize
0 Fire Gingy Bob Firedbob
1 TJZ Awww Tom Tomfoolery
The result is fine, but in the data-file I'm working with I have 1,000 columns. The 2-3 most important are now in the middle. Is there a way, in this toy example, I could've forced "Username" to be the first column and "Name" to be the second column, preserving the values below each all the way down obviously.
Also as a side note, when I save to file it also saves that numbering on the side (0 1 2 0 1). If theres a way to prevent that too, that'd be cool. If not, its not a big deal since it's a quick fix to remove.
Thanks!
Assuming the concatenated DataFrame is df, you can perform the reordering of columns as follows:
important = ['Username', 'Name']
reordered = important + [c for c in df.columns if c not in important]
df = df[reordered]
print df
Output:
Username Name Alias1 Alias2
0 Tomfoolery Tom TJZ NaN
1 MsMeryl Meryl Mer NaN
2 Midsize Timmy Yoda NaN
0 Firedbob Bob Fire Gingy
1 Tomfoolery Tom TJZ Awww
The list of numbers [0, 1, 2, 0, 1] is the index of the DataFrame. To prevent them from being written to the output file, you can use the index=False option in to_csv():
df.to_csv('Result.csv', index=False, sep=' ')
I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
EDIT:
One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
In case you are still looking for an answer for this:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
I think you want to perform an outer merge:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:
what = names.set_index('names').join(info.set_index('names'), how='left')
resp.
what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")
set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:
# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')
what = names.join(info, how='left')
To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here
Think of it as an SQL join operation. You need a left-outer join[1].
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names for which there is no classification, a left-outer join will do the job.
a = pd.merge(names, info, how='left', on='names')
The result is ...
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN results are ok if you look at both the tables.
Cheers!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging