Error trying to concatenate strings with groupby in Python

Error trying to concatenate strings with groupby in Python - python

So, I'm new to Python and I have this dataframe with company names, country information and activities description. I'm trying to group all this information by names, concatenating the countries and activities strings.
First, I did something like this:
df3_['Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
df4_ = df3_.drop_duplicates()
df4_['Activity'] = df4_.groupby(['Name', 'Country'])['Activity'].transform(lambda x: ','.join(x))
This way, I got a 'SettingWithCopyWarning', so I read a little bit about this error and tried copying the dataframe before applying the functions (didn't work) and using .loc (didn't work as well):
df3_.loc[:, 'Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
Any idea how to fix this?
Edit: I was asked to post an example of my data. The first pic is what I have, the second one is what it should look like

You want to group by the Company Name and then use some aggregating functions for the other columns, like:
df.groupby('Company Name').agg({'Country Code':', '.join, 'Activity':', '.join})
You were trying it the other way around.
Note that the empty string value ('') gets ugly with this aggregation, so you could make it more difficult with an aggregation like such:
df.groupby('Company Name').agg({'Country Code':lambda x: ', '.join(filter(None,x)), 'Activity':', '.join})

Following should work,
import pandas as pd
data = {
'Country Code': ['HK','US','SG','US','','US'],
'Company Name': ['A','A','A','A','B','B'],
'Activity': ['External services','Commerce','Transfer','Others','Others','External services'],
}
df = pd.DataFrame(data)
#grouping
grp = df.groupby('Company Name')
#custom function for replacing space and adding ,
def str_replace(ser):
s = ','.join(ser.values)
if s[0] == ',':
s = s[1:]
if s[len(s)-1] == ',':
s = s[:len(s)-1]
return s
#using agg functions
res = grp.agg({'Country Code':str_replace,'Activity':str_replace}).reset_index()
res
Output:
Company Name Country Code Activity
0 A HK,US,SG,US External services,Commerce,Transfer,Others
1 B US Others,External services

Another approach this time using transform()
# group the companies and concatenate the activities
df['Activities'] = df.groupby(['Company Name'])['Activity'] \
.transform(lambda x : ', '.join(x))
# group the companies and concatenate the country codes
df['Country Codes'] = df.groupby(['Company Name'])['Country Code'] \
.transform(lambda x : ', '.join([i for i in x if i != '']))
# the list comprehension deals with missing country codes (that have the value '')
# take this, drop the original columns and remove all the duplicates
result = df.drop(['Activity', 'Country Code'], axis=1) \
.drop_duplicates().reset_index(drop=True)
# reset index isn't really necessary
Result is
Company Name Activitys Country Codes
0 A External services, Commerce, Transfer, Others HK, US, SG, US
1 B Others, External services US

Related

how to transform dataframe into data set/object

I have a data set in a dataframe that's almost 9 million rows and 30 columns. As the columns count up, the data becomes more specific thus leading the data in the first columns to be very repetitive. See example:
park_code
camp_ground
parking_lot
acad
campground1
parking_lot1
acad
campground1
parking_lot2
acad
campground2
parking_lot3
bisc
campground3
parking_lot4
I'm looking to feed that information in to a result set like an object for example:
park code: acad
campgrounds: campground 1, campground 2
parking lots: parking_lot1, parking_lot2, parking_lot3
park code: bisc
campgrounds: campground3, ....
.......
etc.
I'm completely at a loss how to do this with pandas, and I'm learning as I go as I'm used to working in SQL and databases not with pandas. If you want to see the code that's gotten me this far, here it is:
function call:
data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National
Park'], ['ME'])
def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
isin(activities_selection)].drop_duplicates()
amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
isin(amenities_selection)].drop_duplicates()
states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
isin(states_selection)].drop_duplicates()
parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
isin(parks_selection)].drop_duplicates()
data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
activities_selection_df.isin(states_selection_df) & activities_selection_df.
isin(parks_selection_df)].drop_duplicates()
pandas_select_df = pd.DataFrame(data, columns=['park_code'])
results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
'campground_road', 'campground_classification',
'campground_general_ADA',
'campground_wheelchair_access',
'campground_rv_info', 'campground_description',
'campground_cell_reception', 'campground_camp_store',
'campground_internet', 'campground_potable_water',
'campground_toilets',
'campground_campsites_electric',
'campground_staff_volunteer']], on='park_code',
how='left')
results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.parking_lot_df[
['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
"parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
"parking_lots_number_ADA_spaces",
"parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
"parking_lots_description"]], on='park_code', how='left')
# print(self.campgrounds_df.to_string(max_rows=20))
print(results_df.to_string(max_rows=40))
Any help will be appreciated.

In general, you can group by park_code and collect other columns into lists, then - transform to a dictionary:
df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')
Sample result:
{'acad ': {'camp_ground': ['campground1 ', 'campground1 ', 'campground2 '],
'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
'bisc ': {'camp_ground': ['campground3 '], 'parking_lot': ['parking_lot4']}}

Unhashable type: 'Series' in Pandas using DataFrame.query

I'm trying to do a twitter sentiment analysis between Johnny Depp and Amber Heard. I've extracted the data during the period of 2021 and the Pandas DataFrame for both individuals are stored in df_dict dictionary described below. The error I am receiving is Unhashable type: 'Series'.
As far as I've learnt is that this error happens when you have a dictionary that does not have a list or anything. I first tested it with a single key but I got the same error. I'm on a roadblock and don't know how to solve this issue.
This is my preprocess method
def preprocess(df_dict, remove_rows, keep_rows):
for key, df in df_dict.items():
print(key)
initial_count = len(df_dict[key])
df_dict[key] = (
df
# Make everything lower case
.assign(Text=lambda x: x['Text'].str.lower())
# Keep the rows that mention name
.query(f'Text.str.contains("{keep_rows[key]}")')
# Remove the rows that mentioned the other three people.
.query(f'~Text.str.contains("{remove_rows[key]}")')
# Remove all the URLs
.assign(Text=lambda x:x['Text'].apply(lambda s: re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', s)))
)
final_count = len(df_dict[key])
print("%d tweets kept out of %d" % (final_count, initial_count))
return df_dict
This is the code I'm using to call preprocess method
df_dict = {
'johnny depp': johnny_data,
"amber heard": amber_data
}
remove_rows = {
'johnny depp': 'amber|heard|camila|vasquez|shannon|curry',
"amber heard": 'johnny|depp|camila|vasquez|shannon|curry'
}
keep_rows = {
'johnny depp': 'johnny|depp',
"amber heard": 'amber|heard'
}
df_test_data = preprocess(df_dict, remove_rows, keep_rows)
I hope I've cleared up my issue on this forum and since this is my first post here, so I also hope I've followed all the regular protocols regarding posting.
I am attaching the the error message I received:
Code error
Error part 1 Error part 2
The link to the code is down below:
Colab link

Since DataFrame.query is really for simple logical operations, you cannot access Series methods of columns. As workaround, consider assign of flags to then query against. Consider also Series.str.replace for regex clean.
df_dict[key] = (
df
# Make everything lower case
.assign(
Text = lambda x: x['Text'].str.lower(),
keep_flag = lambda x: x['Text'].str.contains(keep_rows[key]),
drop_flag = lambda x: x['Text'].str.contains(remove_rows[key])
)
# Keep the rows that mention name
.query("keep_flag == True")
# Remove the rows that mentioned the other three people.
.query("drop_flag == False")
# Remove all the URLs
.assign(
Text = lambda x: x['Text'].str.replace(
r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',
'',
regex=True)
)
)
.drop(["keep_flag", "drop_flag"], axis="columns")
)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.

Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).

You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])

Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])

you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

How to include two lambda operations in transform function?

I have a dataframe like as given below
df = pd.DataFrame({
'date' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 12:00:00','2173-04-04 11:30:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00'],
'subject_id':[1,1,1,1,1,1,1,1,1],
'val' :[5,5,5,10,10,5,5,8,8]
})
I would like to apply couple of logic (logic_1 on val column and logic_2 on date column) to the code. Please find below the logic
logic_1 = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
credit to SO users for helping me with logic
This is what I tried below
df['label'] = ''
df['date'] = pd.to_datetime(df['date'])
df['tdiff'] = df['date'].shift(-1) - df['date']
df['tdiff'] = df['tdiff'].dt.total_seconds()/3600
df['lo_1'] = df.groupby('subject_id')['val'].transform(logic_1).map({True:'1',False:''})
df['lo_2'] = df.groupby('subject_id')['tdiff'].transform(logic_2).map({True:'1',False:''})
How can I make both the logic_1 and logic_2 be part of one logic statement? Is it even possible? I might have more than 2 logics as well. Instead of writing one line for each logic, is it possible to couple all logics together in one logic statement.
I expect my output to be with label column being filled with 1 when both logic_1 and logic_2 are satisfied

You have a few things to fix
First, in logic_2, you have lambda x but use y, so, you got to change that as below
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
Then you can use the logic's together as below'
No need to create a blank column label. You can create the '`label' column directly as below.
df['label'] = ((df.groupby('subject_id')['val'].transform(logic_1))
& (df.groupby('subject_id')['tdiff'].transform(logic_2))).map({True:'0',False:'1'})
Note: You logic produces all False values. So, you will get 1's if False is mapped to 1, not True

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!

Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error trying to concatenate strings with groupby in Python - python

Related

how to transform dataframe into data set/object

Unhashable type: 'Series' in Pandas using DataFrame.query

Compare entire rows for equality if some condition is satisfied

How to include two lambda operations in transform function?

Python Pandas filtering dataframe on date

Categories

Resources