I have a data set in a dataframe that's almost 9 million rows and 30 columns. As the columns count up, the data becomes more specific thus leading the data in the first columns to be very repetitive. See example:
park_code
camp_ground
parking_lot
acad
campground1
parking_lot1
acad
campground1
parking_lot2
acad
campground2
parking_lot3
bisc
campground3
parking_lot4
I'm looking to feed that information in to a result set like an object for example:
park code: acad
campgrounds: campground 1, campground 2
parking lots: parking_lot1, parking_lot2, parking_lot3
park code: bisc
campgrounds: campground3, ....
.......
etc.
I'm completely at a loss how to do this with pandas, and I'm learning as I go as I'm used to working in SQL and databases not with pandas. If you want to see the code that's gotten me this far, here it is:
function call:
data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National
Park'], ['ME'])
def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
isin(activities_selection)].drop_duplicates()
amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
isin(amenities_selection)].drop_duplicates()
states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
isin(states_selection)].drop_duplicates()
parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
isin(parks_selection)].drop_duplicates()
data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
activities_selection_df.isin(states_selection_df) & activities_selection_df.
isin(parks_selection_df)].drop_duplicates()
pandas_select_df = pd.DataFrame(data, columns=['park_code'])
results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
'campground_road', 'campground_classification',
'campground_general_ADA',
'campground_wheelchair_access',
'campground_rv_info', 'campground_description',
'campground_cell_reception', 'campground_camp_store',
'campground_internet', 'campground_potable_water',
'campground_toilets',
'campground_campsites_electric',
'campground_staff_volunteer']], on='park_code',
how='left')
results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.parking_lot_df[
['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
"parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
"parking_lots_number_ADA_spaces",
"parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
"parking_lots_description"]], on='park_code', how='left')
# print(self.campgrounds_df.to_string(max_rows=20))
print(results_df.to_string(max_rows=40))
Any help will be appreciated.
In general, you can group by park_code and collect other columns into lists, then - transform to a dictionary:
df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')
Sample result:
{'acad ': {'camp_ground': ['campground1 ', 'campground1 ', 'campground2 '],
'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
'bisc ': {'camp_ground': ['campground3 '], 'parking_lot': ['parking_lot4']}}
Related
I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.
I'm learning Pandas method chaining and having trouble using str.conains and str.split in a chain. The data is one week's worth of information scraped from a Wikipedia page, I will be scraping several years worth of weekly data.
This code without chaining works:
#list of data scraped from web:
list = ['Unnamed: 0','Preseason-Aug 11','Week 1-Aug 26','Week 2-Sep 2',
'Week 3-Sep 9','Week 4-Sep 23','Week 5-Sep 30','eek 6-Oct 7','Week 7-Oct 14',
'Week 8-Oct 21','Week 9-Oct 28','Week 10-Nov 4','Week 11-Nov 11','Week 12-Nov 18',
'Week 13-Nov 25','Week 14Dec 2','Week 15-Dec 9','Week 16 (Final)-Jan 4','Unnamed: 18']
#load to dataframe:
df = pd.DataFrame(list)
#rename column 0 to text:
df = df.rename(columns = {0:"text"})
#remove ros that contain "Unnamed":
df = df[~df['text'].str.contains("Unnamed")]
#split column 0 into 'week' and 'released' at the hyphen:
df[['week', 'released']] = df["text"].str.split(pat = '-', expand = True)
Here's my attempt to rewrite it as a chain:
#load to dataframe:
df = pd.DataFrame(list)
#function to remove rows that contain "Unnamed"
def filter_unnamed(df):
df = df[~df["text"].str.contains("Unnamed")]
return df
clean_df = (df
.rename(columns = {0:"text"})
.pipe(filter_unnamed)
#[['week','released']] = lambda df_:df_["text"].str.split('-', expand = True)
)
The first line of the clean_df chain to rename column 0 works.
The second line removes rows that contain "Unnamed"; it works, but is there a better way than using pipe and a function?
I'm having the most trouble with str.split in the 3rd row (doesn't work, commented out). I tried assign for this and think it should work, but I don't know how to pass in the new column names ("week" and "released") with the str.split function.
Thanks for the help.
I also couldn't figure out how to create two columns in one go from the split... but I was able to do it by splitting twice and accessing parts 1 and 2 in succession (not ideal), df.assign(week = ...[0], released = ...[1]).
Note also I reset the index.
df.assign(week = df[0].str.split(pat = '-', expand=True)[0], released = df[0].str.split(pat = '-', expand=True)[1])[~df[0].str.contains("Unnamed")].reset_index(drop=True).rename(columns = {0: "text"})
I'm sure there's a sleeker way, but this may help.
So, I'm new to Python and I have this dataframe with company names, country information and activities description. I'm trying to group all this information by names, concatenating the countries and activities strings.
First, I did something like this:
df3_['Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
df4_ = df3_.drop_duplicates()
df4_['Activity'] = df4_.groupby(['Name', 'Country'])['Activity'].transform(lambda x: ','.join(x))
This way, I got a 'SettingWithCopyWarning', so I read a little bit about this error and tried copying the dataframe before applying the functions (didn't work) and using .loc (didn't work as well):
df3_.loc[:, 'Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
Any idea how to fix this?
Edit: I was asked to post an example of my data. The first pic is what I have, the second one is what it should look like
You want to group by the Company Name and then use some aggregating functions for the other columns, like:
df.groupby('Company Name').agg({'Country Code':', '.join, 'Activity':', '.join})
You were trying it the other way around.
Note that the empty string value ('') gets ugly with this aggregation, so you could make it more difficult with an aggregation like such:
df.groupby('Company Name').agg({'Country Code':lambda x: ', '.join(filter(None,x)), 'Activity':', '.join})
Following should work,
import pandas as pd
data = {
'Country Code': ['HK','US','SG','US','','US'],
'Company Name': ['A','A','A','A','B','B'],
'Activity': ['External services','Commerce','Transfer','Others','Others','External services'],
}
df = pd.DataFrame(data)
#grouping
grp = df.groupby('Company Name')
#custom function for replacing space and adding ,
def str_replace(ser):
s = ','.join(ser.values)
if s[0] == ',':
s = s[1:]
if s[len(s)-1] == ',':
s = s[:len(s)-1]
return s
#using agg functions
res = grp.agg({'Country Code':str_replace,'Activity':str_replace}).reset_index()
res
Output:
Company Name Country Code Activity
0 A HK,US,SG,US External services,Commerce,Transfer,Others
1 B US Others,External services
Another approach this time using transform()
# group the companies and concatenate the activities
df['Activities'] = df.groupby(['Company Name'])['Activity'] \
.transform(lambda x : ', '.join(x))
# group the companies and concatenate the country codes
df['Country Codes'] = df.groupby(['Company Name'])['Country Code'] \
.transform(lambda x : ', '.join([i for i in x if i != '']))
# the list comprehension deals with missing country codes (that have the value '')
# take this, drop the original columns and remove all the duplicates
result = df.drop(['Activity', 'Country Code'], axis=1) \
.drop_duplicates().reset_index(drop=True)
# reset index isn't really necessary
Result is
Company Name Activitys Country Codes
0 A External services, Commerce, Transfer, Others HK, US, SG, US
1 B Others, External services US
I have to 2 dfs:
dfMiss:
and
dfSuper:
I need to create a final output that summarises the data in the 2 tables which I am able to do shown in the code below:
dfCity = dfSuper \
.groupby(by='City').count() \
.drop(columns='Superhero ID') \
.rename(columns={'Superhero': 'Total count'})
print("This is the df city : ")
print(dfCity)
## Convert column MissionEndDate to DateTime format
for df in dfMiss:
# Dates are interpreted as MM/dd/yyyy by default, dayfirst=False
df['Mission End date'] = pd.to_datetime(df['Mission End date'], dayfirst=True)
# Get Year and Quarter, given Q1 2020 starts in April
date = df['Mission End date'] - pd.DateOffset(months=3)
df['Mission End quarter'] = date.dt.year.astype(str) + ' Q' + date.dt.quarter.astype(str)
## Get no. Superheros working per City per Quarter
dfCount = []
for dfM in dfMiss:
# Merge DataFrames
df = dfSuper.merge(dfM, left_on='Superhero ID', right_on='SID')
df = df.pivot_table(index=['City', 'Superhero'], columns='Mission End quarter', aggfunc='nunique')
# Get the first group (all the groups have the same values)
df = df[df.columns[0][0]]
# Group the values by City (effectively "collapsing" the 'Superhero' column)
df = df.groupby(by=['City']).count()
dfCount += [df]
## Get no. Superheros available per City per Quarter
dfFree = []
for dfC in dfCount:
# Merge DataFrames
df = dfCity.merge(right=dfC, on='City', how='outer').fillna(0) # convert NaN values to 0
# Subtract no. working superheros from total no. superheros per city
for col in df.columns[1:]:
df[col] = df['Total count'] - df[col]
dfFree += [df.astype(int)]
print(dfFree)
dfResult = pd.DataFrame(dfFree)
The problem is when I try to convert DfFree into a dataframe I get the error:
"ValueError: Must pass 2-d input. shape=(1, 4, 5) "
The line that raises the error is
dfResult = pd.DataFrame(dfFree)
Anyone have any idea what this means and how I can convert the list into a df?
Thanks :)
separate your code using SOLID. separation of concerns. It is not easy to read
sid=[665544,665544,2121,665544,212121,123456,666666]
mission_end_date=["10/10/2020", "03/03/2021", "02/02/2021", "05/12/2020", "15/07/2021", "03/06/2021", "12/10/2020"]
superherod_sid=[212121,364331,678523,432432,665544,123456,555555,666666,432432]
hero=["Spiderman","Ironman","Batman","Dr. Strange","Thor","Superman","Nightwing","Loki","Wolverine"]
city=["New York","New York","Gotham","New York","Asgard","Metropolis","Gotham","Asgard","New York"]
df_mission=pd.DataFrame({'sid':sid,'mission_end_date':mission_end_date})
df_super=pd.DataFrame({'sid':superherod_sid,'hero':hero, 'city':city})
df=df_super.merge(df_mission,on="sid", how="left")
df['mission_end_date']=pd.to_datetime(df['mission_end_date'])
df['mission_end_date_quarter']=df['mission_end_date'].dt.quarter
df['mission_end_date_year']=df['mission_end_date'].dt.year
print(df.head(20))
pivot = df.pivot_table(index=['city', 'hero'], columns='mission_end_date_quarter', aggfunc='nunique').fillna(0)
print(pivot.head())
I have a sheet that looks like this.
Fleet Risk Control
Communication
Interpersonal relationships
Demographic
Demographic
Q_21086
Q_21087
Q_21088
AGE
GENDER
1
3
4
27
Male
What I'm trying to achieve is where there is a row with 'Q_' inside of it, merge that cell with the top row and return a new dataframe.
So the existing data above would become something like this:
Fleet Risk Control - Q_21086
Communication - Q_21087
Interpersonal relationships - Q_21088
1
3
4
I honestly have no idea where to even begin with something like this.
You could try this one. This is for input:
import pandas as pd
df = pd.DataFrame({'Fleet Risk Control': ['Q_21086', 1],
'Communication': ['Q_21087', 3],
'Interpersonal relationships': ['Q_21088', 4],
'Demographic': ['AGE', 27],
'Demographic 2': ['Gender', 'Male']})
Now concat the header line with the first line of df:
df.columns = df.columns + ' - ' + df.iloc[0, :]
Extract every line without the first and dropping the last both columns
df = df.iloc[1:, :-2]
# rename columns
df.columns = [x + ' - ' + y if y.startswith('Q_') else x for x, y in zip(df.columns, df.iloc[0])]
#drop not matching columns
to_drop = [c for c, _ in df.iloc[0].apply(lambda x: not x.startswith('Q_')).items() if _]
df.drop(to_drop, axis=1)[1:]