Pivot table rank by Name(Index) and Title(Column) - python

I have a dataset that looks like this:
The count represents the number of times they worked.
Title Name Count
Coach Bob 4
teacher sam 5
driver mark 8
Coach tina 10
teacher kate 3
driver frank 2
I want to create a table which I think will have to be a pivot, that sorts by count times worked, the name and title, so for example the output would look like this:
coach teacher driver
tina 10 sam 5 mark 8
bob 4 kate 3 drank 2
I am familiar with general pivot table code but I think Im going to need to use something a little bit more comprehensive.
DF_PIV = pd.pivot_table(DF, values=['count'], index=['title','Name'], columns=['title']
aggfunc=np.max)
I get an error ValueError: Grouper for 'view_title' not 1-dimensional, but I do not even think I on the right track here.

You can try:
(df.set_index(['Title', df.groupby('Title').cumcount()])
.unstack(0)
.astype(str)
.T
.groupby(level=1).agg(' '.join)
.T)
Output:
Title Coach driver teacher
0 Bob 4 mark 8 sam 5
1 tina 10 frank 2 kate 3

Related

Pandas, Dataframe, conditional sum of column for each row

I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!
Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto
Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto

select rows with except sqllite3

I have a database with a dataframe that contains the columns: Name, Award, Winner(1 means won and 0 means did not win) and some other things that are irrelevant for this question.
I want to make a dataframe with the names of people that were selected for the award actress(al awards with the name actress in them count), but never won, using sqlite 3 in python.
These are the first five rows of the dataframe:
Unnamed: 0 CeremonyNumber CeremonyYear CeremonyMonth CeremonyDay FilmYear Award Winner Name FilmDetails
0 0 1 1929 5 16 1927 Actor 1 Emil Jannings The Last Command
1 1 1 1929 5 16 1927 Actor 0 Richard Barthelmess The Noose
2 2 1 1929 5 16 1927 Actress 1 Janet Gaynor 7th Heaven
3 3 1 1929 5 16 1927 Actress 0 Louise Dresser A Ship Comes In
4 4 1 1929 5 16 1927 Actress 0 Gloria Swanson Sadie Thompson
I tried it with this query, but this resulted not in the correct result.
query = '''
select Name
from oscars
where Award like "Actress%"
except select Name
from oscars
where Award like "Actress%" and Winner == 1
'''
The outcome of this query should be a dataframe like this:
Name
0 Abigail Breslin
1 Adriana Barraza
2 Agnes Moorehead
3 Alfre Woodard
4 Ali MacGraw
In order to select all the actresses who were selected for the award and never won, you should use AND rather than EXCEPT. Something like this should work:
SELECT Name from Oscars WHERE Award LIKE "Actress%" AND Winner = 0
Refer to the sqlite docs at https://www.sqlite.org/index.html for more information.

How to combine two dataframes and have unique key column using Pandas?

I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe

pandas - DataFrame expansion with outer join

First of all I am very new at pandas and am trying to lean so thorough answers will be appreciated.
I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster where tag subtoken means anything in the set {hashtagA} U {i | i in split('_', hashtagA)} from a table matching poster -> tweet
For example:
In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
In [2]: df
Out[2]:
0 1
0 jim i was like #yolo_omg to her
1 jack You are so #yes_omg #best_place_ever
2 neil Yo #rofl_so_funny
And from that I want to get something like
0 1
0 jim yolo_omg
1 jim yolo
2 jim omg
3 jack yes_omg
4 jack yes
5 jack omg
6 jack best_place_ever
7 jack best
8 jack place
9 jack ever
10 neil rofl_so_funny
11 neil rofl
12 neil so
13 neil funny
I managed to construct this mostrosity that actually does the job:
In [143]: df[1].str.findall('#([^\s]+)') \
.apply(pd.Series).stack() \
.apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
.apply(pd.Series).stack().to_frame().reset_index(level=0) \
.join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]
Out[143]:
0 0_l
0 0 jim yolo_omg
1 jim yolo
2 jim omg
0 jack yes_omg
1 jack yes
2 jack omg
1 0 jack best_place_ever
1 jack best
2 jack place
3 jack ever
0 0 neil rofl_so_funny
1 neil rofl
2 neil so
3 neil funny
But I have a very strong feeling that there are much better ways of doing this, especially given that the real dataset set is huge.
pandas indeed has a function for doing this natively.
Series.str.findall()
This basically applies a regex and captures the group(s) you specify in it.
So if I had your dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
What I would do is first to set the names of your columns, like this:
df.columns = ['user', 'tweet']
Or do it on creation of the dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]], columns=['user', 'tweet'])
Then I would simply apply the extract function with a regex:
df['tag'] = df["tweet"].str.findall("(#[^ ]*)")
And I would use the negative character group instead of a positive one, this is more likely to survive special cases.
How about using list comprehensions in python and then reverting back to pandas? Requires a few lines of code but is perhaps more readable.
import re
get the hash tags
tags = [re.findall('#([^\s]+)', t) for t in df[1]]
make lists of the tags with subtokens for each user
st = [[t] + [s.split('_') for s in t] for t in tags]
subtokens = [[i for s in poster for i in s] for poster in st]
put back into DataFrame with poster names
df2 = pd.DataFrame(subtokens, index=df[0]).stack()
In [250]: df2
Out[250]:
jim 0 yolo_omg
1 yolo
2 omg
jack 0 yes_omg
1 best_place_ever
2 yes
3 omg
4 best
5 place
6 ever
neil 0 rofl_so_funny
1 rofl
2 so
3 funny
dtype: object

Categories

Resources