I have the following pandas DataFrame: (currently ~500 rows):
merged_verified =
Last Verified Verified by
0 2016-07-11 John Doe
1 2016-07-11 John Doe
2 2016-07-12 John Doe
3 2016-07-11 Mary Smith
4 2016-07-12 Mary Smith
I am attempting to pivot_table() it to receive the following:
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Currently I'm running
merged_verified = merged_verified.pivot_table(index=['Verified by'], values=['Last Verified'], aggfunc='count')
which gives me close to what I need, but not exactly:
Last Verified
Verified by
John Doe 3
Mary Smith 2
I've tried a variety of things with the parameters, but none of it worked. The result above is the closest I've come to what I need. I read somewhere I would need to add an additional column that uses dummy values (1's) that I can then add but that seems counter-intuitive for a what I believe to be simple DataFrame layout.
You can add parameter columns and aggragate by len:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
values=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last 2016-07-11 2016-07-12
Verified by
Doe 2 1
Smith 1 1
Or you also omit values:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Use groupby, value_counts, and unstack:
merged_verified.groupby('Last Verified')['Verified by'].value_counts().unstack(0)
Timing
Example dataframe
Large dataframe 1 million rows
idx = pd.MultiIndex.from_product(
[
pd.date_range('2016-03-01', periods=100),
pd.DataFrame(np.random.choice(letters, (10000, 10))).sum(1)
], names=['Last Verified', 'Verified by'])
merged_verified = idx.to_series().reset_index()[idx.names]
Related
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5
I'm having trouble formulating this statement in Pandas that would be very simple in excel. I have a dataframe sample as follows:
colA colB colC
10 0 27:15 John Doe
11 0 24:33 John Doe
12 1 29:43 John Doe
13 Inactive John Doe None
14 N/A John Doe None
Obviously the dataframe is much larger than this, with 10,000+ rows, so I'm trying to find an easier way to do this. I want to create a column that checks if colA is equal to 0 or 1. If so, then equals colC. If not, then equals colC. In excel, I would simply create a new column (new_col) and write
=IF(OR(A2<>0,A2<>1),B2,C2)
And then drag fill the entire sheet.
I'm sure this is fairly simple but I cannot for the life of me figure this out.
Result should look like this
colA colB colC new_col
10 0 27:15 John Doe John Doe
11 0 24:33 John Doe John Doe
12 1 29:43 John Doe John Doe
13 Inactive John Doe None John Doe
14 N/A John Doe None John Doe
np.where should do the trick.
df['new_col'] = np.where(df['colA'].isin([0, 1]), df['colB'], df['colC'])
Here is a solution that adds your results to a list given your conditions, then adds the list back in the dataframe as D column.
your_results=[]
for i,data in enumerate(df["colA"]):
if data==0 or data==1:
your_results.append(df["colC"][i])
else:
your_results.append(df["colB"][i])
df["colD"]=your_results
I have a question regarding how to fill missint date values in a pandas dataframe.
I found a similar question ( pandas fill missing dates in time series )
but this doesn't answer my actual question.
I have a dataframe looking something like this:
date amount person country
01.01.2019 10 John IT
01.03.2019 5 Jane SWE
01.05.2019 3 Jim SWE
01.05.2019 10 Jim SWE
02.01.2019 10 Bob UK
02.01.2019 10 Jane SWE
02.03.2019 10 Sue IT
As you can see, there are missing values in the dates.
What I need to do is to fill the missing date-values and fill remaining column values with the values from the previous line, EXCEPT for the column 'amount', which I need to be a 0, otherwise I would falsify my amounts.
I know there is a command for that in Pandas ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html ) but I'm not sure how to apply that to filling missing values.
data = data.reindex(pd.date_range("2019-01-01", "2019-01-03"))
method='backfill') , fill_value="0") ?
The expected output would be as follows:
date amount person country
01.01.2019 10 John IT
01.02.2019 0 Jane SWE
01.03.2019 5 Jane SWE
01.04.2019 0 Jane SWE
01.05.2019 3 Jim SWE
01.05.2019 10 Jim SWE
02.01.2019 10 Bob UK
02.01.2019 10 Jane SWE
02.02.2019 0 Jane SWE
02.03.2019 10 Sue IT
I would appreciate any help on that regard.
Thank you and BR
I think simpliest is replace missing values in 2 steps:
data = data.reindex(pd.date_range("2014-11-01", "2019-10-31"))
data['amount'] = data['amount'].fillna(0)
data = data.bfill()
I would like to sort a Pandas dataframe twice the same way excel does. Given the following df:
Name Date
John 13/01
Mike 13/01
John 15/01
John 14/01
Mike 12/01
When adding the following code:
df=df.sort_values(['Date','Name'], ascending=[True, True])
I would expect the following result:
Name Date
John 13/01
John 14/01
John 15/01
Mike 12/01
Mike 13/01
I'm getting nothing close to this result with the code above. Any idea where's the mistake?
Many thanks!
You need swap columns, because first sort by Name and then by Date, ascending=[True, True] should be removed, because default parameter:
df = df.sort_values(['Name','Date'])
print (df)
Name Date
0 John 13/01
3 John 14/01
2 John 15/01
4 Mike 12/01
1 Mike 13/01