I have a data frame representing the customers ratings of restaurants. rating_year is the year the rating was made, first_year is the year the restaurant opened and last_year is the last business year of a restaurant.
What i want to do is calculate the number of restaurants that opened in the same year as the restaurant in question, so with the same first_year.
The problem from what i did here is that i group restaurant_id and first_year and do the count, but i dont exclude the rest with the same id's. I dont know the syntax do to this.
Can anyone help?
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
df['nb_rating'] = (
df.groupby('restaurant_id')['rating_id'].transform('count')
)
#here
df['nb_opened_sameYear'] = (
df.groupby('restaurant_id')['first_year']
.transform('count')
)
df.head(10)
IIUC, you want to groupby first_year and transform with nunique on the column restaurant_id. try:
df['nb_opened_sameYear'] = (
df.groupby('first_year')['restaurant_id']
.transform('nunique')
)
Related
I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??
I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})
I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.
EDIT: Thanks to Scott Boston for advising me on to correctly post.
I have a dataframe containing clock in/out date and times from work for all employees. Sample df input is below, but the real data set has a year of data for many employees.
Question:
What I would like to do is to calculate the time spent in work for each employee over the year.
df = pd.DataFrame({'name': ['Joe Bloggs', 'Joe Bloggs', 'Joe Bloggs',
... 'Joe Bloggs', 'Jane Doe', 'Jane Doe', 'Jane Doe',
... 'Jane Doe'],
... 'Date': ['2020-06-19','2020-06-19' , '2020-06-18', '2020-06-18', '2020-06-19',
... '2020-06-19', '2020-06-18', '2020-06-18'],
... 'Time': ["17:30:06", "09:00:00", "17:44:00", "08:34:02", "16:30:06",
... "10:00:02", "15:45:33", "09:30:33"],
... 'type': ["Logout", "Login", "Logout",
... "Login", "Logout", "Login",
... "Logout", "Login"]})```
You can do it this way:
#Create a datetime column combining both date and time also create year column
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%Y-%m-%d %H:%M:%S')
df['year'] = df['datetime'].dt.year
#Sort the dataframe by datetime
df = df.sort_values('datetime')
#Create "sessions" worked by Login records
session = (df['type'] == 'Login').groupby(df['name']).cumsum().rename('Session_No')
#Reshape the dataframe to get login and logouts for a session on one row
#The use diff to calculate worked during that session
df_time = df.set_index(['name', 'year', session, 'type'])['datetime']\
.unstack().diff(axis=1).dropna(axis=1, how='all')\
.rename(columns={'Logout':'TimeLoggedIn'})
#Sum on Name and Year
df_time.sum(level=[0,1])
Output:
name year TimeLoggedIn
0 Jane Doe 2020 12:45:04
1 Joe Bloggs 2020 17:40:04
Note: #warped solution works and works well, however, if you had an employee who worked overnight, I think that code breaks down. This answer should capture where an employee works past midnight.
df['Time'] = pd.to_timedelta(df['Time'])
df['Date'] = pd.to_datetime(df['Date'])
df['time_complete'] = df['Time'] + df['Date']
df.groupby(['name', 'Date']).apply(lambda x: (x.sort_values('type', ascending=True)['time_complete'].diff().dropna()))
how it works:
Convert the dates to datetime, to allow grouping.
Convert the times to timedelta, to allow subtraction.
Create a complete time, to incorporate potential nighshifts (as spotted by #ScottBoston)
Then, group by date and employee to isolate those.
So, each group now corresponds to one employee at a specific date.
The individual groups have three columns, 'type' and 'Time', 'time_complete'.
Sorting the columns by 'type' will cause logout to come before login.
Then, we take the difference (column-(n) - column-(n+1)) of column 'time_complete' within each sorted group, which gives the time spent between login and logout.
Finally, we remove null values that arise through None - column-(n).
I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1
I am trying to compare two dataframes and return different result sets based on whether a value from one dataframe is present in the other.
Here is my sample code:
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
pmdf_issn = pmdf['ISSN'].values.tolist()
This line gets me the rows from dataframe jcrdf that contain the issn from dataframe pmdf
pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I wanted the following line to create a new dataframe of values from pmdf where the ISSN is not in jcfdf so I negated the previous statement and chose the first dataframe.
pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I get an error: "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match"
I don't find a lot about this specific error, at least nothing that is helping me toward a solution.
Is "str.contains" not the best way of sorting items that are and aren't in the second dataframe?
You are trying to apply the boolean index of one dataframe to another. This is only possible if the length of both dataframes match. In your case you should use isin.
# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]
EDIT
Let's try another approach:
First i'd create a lookup for all you ISSNs and then create the diff by isolating the matches:
import pandas as pd
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
index=jcrdf.ISSN).stack(level=0).reset_index(level=0)
# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]