I am trying to expand a column of a pandas dataframe
(see column segments in example below.)
I am able to break it out into the components seperated by ;
However, as you can see, some of the rows in the columns do
not have all the elements. So, what is happening is that the
data which should go into the Geo column ends up going into the
BusSeg column, since there was no Geo column; or the data
that should be in ProdServ column ends up in the Geo column.
Ideally I would like to have only the data and not the indicator
in each cell correctly placed. So,
In the Geo column it should say 'NonUs'. Not 'Geo=NonUs.'
That is after seperating correctly, I would like to remove the text
upto and including the '=' sign in each. How can I do this?
Code below:
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()
Your Data
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
df:
company clv date line segments
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3 Rev 400 20181231 1 Subseg=Tr1
4 Rev 10 20181231 3 BusSeg=Pharma
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4;
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs;
Comment this line df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) in your code, and add theese two lines
d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)
df:
company clv date line segments BusSeg Geo Prd Subseg
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1 Pharma NonUs Alpha Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1 Dev NaN Alpha Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2 Pharma US Alpha Tr2
3 Rev 400 20181231 1 Subseg=Tr1 NaN NaN NaN Tr1
4 Rev 10 20181231 3 BusSeg=Pharma Pharma NaN NaN NaN
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4; NaN China Alpha Tr4
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1 NaN NaN Beta Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1; Pharma US Delta Tr1
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs; Pharma NonUs NaN NaN
I sugest, to fill the columns one by one instead of using split, something like the followin code:
col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')
Here's a suggestion:
df1.segments = (df1.segments.str.split(';')
.apply(lambda s:
dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
for col in set().union(*df1.segments)},
index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')
Result:
company clv date line Subseg Geo BusSeg Prd
0 Rev 500 20191231 1 Tr1 NonUs Pharma Alpha
1 Rev 200 20191231 3 Tr1 Dev Alpha
2 Rev 3000 20191231 2 Tr2 US Pharma Alpha
3 Rev 400 20181231 1 Tr1
4 Rev 10 20181231 3 Pharma
5 Rev 300 20181231 2 Tr4 China Alpha
6 Rev 560 20171231 1 Tr1 Beta
7 Rev 500 20171231 3 Tr1 US Pharma Delta
8 Rev 600 20171231 2 NonUs Pharma
I have 2 dataframes. rdf is the reference dataframe I am trying to use to define the interval (top and bottom) to calculate an average between (all of the depths between this interval), but use ldf to actually run that calculation since it contains the values. rdf defines the top and bottom for each id number an average should be run for. There are multiple intervals for each id.
rdf is formatted as such:
ID Top Bottom
1 2010 3000
1 4300 4500
1 4550 5000
1 7100 7700
2 3200 4100
2 4120 4180
2 4300 5300
2 5500 5520
3 2300 2380
3 3200 4500
ldf is fromated as such:
ID Depth(ft) Value1 Value2 Value3
1 2000 45 .32 423
1 2000.5 43 .33 500
1 2001 40 .12 643
1 2001.5 28 .10 20
1 2002 40 .10 34
1 2002.5 23 .11 60
1 2003 34 .08 900
1 2003.5 54 .04 1002
2 2000 40 .28 560
2 2000 38 .25 654
...
3 2000 43 .30 343
I want to use rdf to define the top and bottom of the interval to calculate the average for Value1, Value2, and Value3. I would also like to have a count documented as well (not all of the values between the intervals necessarily exist, so it could be less than the difference of Bottom - Top). This will then modify rdf to make a new file:
new_rdf is formatted as such:
ID Top Bottom avgValue1 avgValue2 avgValue3 ThicknessCount(ft)
1 2010 3000 54 .14 456 74
1 4300 4500 23 .18 632 124
1 4550 5000 34 .24 780 111
1 7100 7700 54 .19 932 322
2 3200 4100 52 .32 134 532
2 4120 4180 16 .11 111 32
2 4300 5300 63 .29 872 873
2 5500 5520 33 .27 1111 9
3 2300 2380 63 .13 1442 32
3 3200 4500 37 .14 1839 87
I've been going back and forth on the best way to do this. I tried mimicking this time series example: Sum set of values from pandas dataframe within certain time frame
but it doesn't seem translatable:
import pandas as pd
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
I get "TypeError: Invalid comparison between dtype=float64 and str"
And it works if I use the samples they made in the post, but it doesn't work with my data. I'm also hoping there's a more, simple way to do this.
Edit # 2A:
Note:
Sample DataFrame below is not exactly the same as posted in question
Posting a new code here that does uses Top and Bottom from rdf to check for DEPTH in ldf to calculate .mean() for each group using for-loop. A range_key is created in rdf that is unique to each row, assuming that the DataFrame rdf does not have any duplicates.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)
# Flag each row for a range
for i in range(ldf.shape[0]):
for j in range(rdf.shape[0]):
d = ldf['DEPTH'][i]
if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
rkey = rdf['range_key'][j]
ldf['range_key'][i]=rkey
break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string
# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)
# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf
Output:
ID Top Bottom Value1 Value2 Value3
0 1 2000 2500 39.0 0.2175 396.5
1 1 4300 4500 NaN NaN NaN
2 1 4500 5000 NaN NaN NaN
3 1 7100 7700 NaN NaN NaN
4 2 3200 4100 NaN NaN NaN
5 2 4120 4180 NaN NaN NaN
6 2 4300 5300 NaN NaN NaN
7 2 5500 5520 NaN NaN NaN
8 3 2300 2380 NaN NaN NaN
9 3 3200 4500 NaN NaN NaN
Edit # 1:
Code below seems to work. Added an if-statement to the return from the code posted in question above. Not sure if this is what you were looking to get. It calculates the .sum(). The first value in rdf is changed to a lower the range to match the data in ldf.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
##### Code from the question (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
if (n.shape[0]>0):
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
Output:
test
top bottom ID Value1
0 2000 2500 1.0 14014.0
1 4300 4500 NaN NaN
2 4500 5000 NaN NaN
3 7100 7700 NaN NaN
4 3200 4100 NaN NaN
5 4120 4180 NaN NaN
6 4300 5300 NaN NaN
7 5500 5520 NaN NaN
8 2300 2380 NaN NaN
9 3200 4500 NaN NaN
Sample data and imports
import pandas
import numpy
import random
# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}
dfr = pd.DataFrame(rdata)
# display(dfr.head())
ID Top Bottom
0 1 2010 3000
1 1 4300 4500
2 1 4550 5000
3 1 7100 7700
4 2 3200 4100
# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
'v1': [np.random.randint(40, 50) for _ in range(rows)],
'v2': np.random.rand(rows),
'v3': [np.random.randint(20, 1000) for _ in range(rows)]}
df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)
# display(df.head())
id depth v1 v2 v3
0 1 2004 48 0.517014 292
1 1 2004 41 0.997347 859
2 1 2006 42 0.278217 851
3 1 2006 49 0.570363 32
4 1 2009 43 0.462985 409
Use each row of dfr to filter and extract stats from df
There are plenty of answers on SO dealing with "TypeError: Invalid comparison between dtype=float64 and str". The numeric columns need to be cleaned of any value that can't be converted to a numeric type.
This code deals with using one dataframe to filter and return metrics for another dataframe.
For each row in dfr:
Filter df
Aggregate the mean and count for v1, v2 and v3
.T to transpose the mean and count rows to columns
Convert to a numpy array
Slice the array for the 3 means and append the array to the v_mean
Slice the array for the max count and append the value to count
They could be all the same, if there are no NaNs in the data
Convert the list of arrays, v_mean to a dataframe, and join it to dfr_new
Add counts a column in dfr_new
v_mean = list()
counts = list()
for idx, (i, t, b) in dfr.iterrows(): # iterate through each row of dfr
data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy() # apply filters and get stats
v_mean.append(data[:, 0]) # get the 3 means
counts.append(data[:, 1].max()) # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data
# copy dfr to dfr_new
dfr_new = dfr.copy()
# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts
# display(dfr_new)
ID Top Bottom v1_mean v2_mean v3_mean count
0 1 2010 3000 44.577491 0.496768 502.068266 542.0
1 1 4300 4500 44.555556 0.518066 530.968254 126.0
2 1 4550 5000 44.446281 0.538855 482.818182 242.0
3 1 7100 7700 44.348083 0.489983 506.681416 339.0
4 2 3200 4100 44.804040 0.487011 528.707071 495.0
5 2 4120 4180 45.096774 0.526687 520.967742 31.0
6 2 4300 5300 44.476980 0.529476 523.095764 543.0
7 2 5500 5520 46.000000 0.608876 430.500000 12.0
8 3 2300 2380 44.512195 0.456632 443.195122 41.0
9 3 3200 4500 44.554755 0.516616 501.841499 694.0
I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')
For school I have to make a project about wifisignals and I am trying put the data in a dataframe.
There are 208.000 rows of data.
And when it comes to the code below, the code does not complete. The code is like it is stuck in an infinite loop.
But when I use only a 1000 rows my program works. So I think that my list are to small if that is possible.
Do bigger Lists exist in phython? Or is it because I use bad coding?
Thanks in advance.
edit 1:
(data is the original dataframe and wifiinfo is a column of that)
i have this format:
df = pd.DataFrame(columns=['Sender','Time','Date','Place','X','Y','Bezetting','SSID','BSSID','Signal'])
And i am trying to fill SSID, BSSID and Signal from the Column WifiInfo for this i have to split the data.
this is how 1 WifiInfo looks like:
ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88-1d-fc-2c-c0-00:-72,ODISEE#88-1d-fc-41-d2-d0:-82,CiscoC5976#58-6d-8f-19-14-38:-78,CiscoC5959#58-6d-8f-19-13-f4:-93,SNB#c8-d7-19-6f-be-b7:-99,ODISEE#88-1d-fc-2c-c5-70:-94,HackingDemo#58-6d-8f-19-11-48:-156,ODISEE#88-1d-fc-30-d4-40:-85,ODISEE#88-1d-fc-41-ac-50:-100
My current approach looks like:
for index, row in data.iterrows():
bezettingList = list()
ssidList = list()
bssidList = list()
signalList = list()
#WifiInfo splitting
wifis = row.WifiInfo.split(',')
for wifi in wifis:
#split wifi and add to List
ssid, bssid = wifi.split('#')
bssid, signal = bssid.split(':')
ssidList.append(ssid)
bssidList.append(bssid)
signalList.append(int(signal))
#add bezettingen to List
bezettingen = row.Bezetting.split(',')
for bezetting in bezettingen:
bezettingList.append(bezetting)
#add list to dataframe
df.loc[index,'SSID'] = ssidList
df.loc[index,'BSSID'] = bssidList
df.loc[index,'Signal'] = signalList
df.loc[index,'Bezetting'] = bezettingList
df.head()
IIUC, you need to first explode the row by commas so that this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ...
becomes this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83
1 NaN NaN NaN ODISEE#88-1d-fc-2c-c0-00:-72
2 NaN NaN NaN ODISEE#88-1d-fc-41-d2-d0:-82
3 NaN NaN NaN CiscoC5976#58-6d-8f-19-14-38:-78
4 NaN NaN NaN CiscoC5959#58-6d-8f-19-13-f4:-93
5 NaN NaN NaN SNB#c8-d7-19-6f-be-b7:-99
6 NaN NaN NaN ODISEE#88-1d-fc-2c-c5-70:-94
7 NaN NaN NaN HackingDemo#58-6d-8f-19-11-48:-156
8 NaN NaN NaN ODISEE#88-1d-fc-30-d4-40:-85
9 NaN NaN NaN ODISEE#88-1d-fc-41-ac-50:-100
# use `.explode`
data = data.assign(WifiInfo=data.WifiInfo.str.split(',')).explode('WifiInfo')
Now you could use .str.extract:
data['SSID'] = data['WifiInfo'].str.extract(r'(.*)#')
data['BSSID'] = data['WifiInfo'].str.extract(r'#(.*):')
data['Signal'] = data['WifiInfo'].str.extract(r':(.*)')
SSID BSSID Signal WifiInfo
0 ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83
1 ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72
2 ODISEE 88-1d-fc-41-d2-d0 -82 ODISEE#88-1d-fc-41-d2-d0:-82
3 CiscoC5976 58-6d-8f-19-14-38 -78 CiscoC5976#58-6d-8f-19-14-38:-78
4 CiscoC5959 58-6d-8f-19-13-f4 -93 CiscoC5959#58-6d-8f-19-13-f4:-93
5 SNB c8-d7-19-6f-be-b7 -99 SNB#c8-d7-19-6f-be-b7:-99
6 ODISEE 88-1d-fc-2c-c5-70 -94 ODISEE#88-1d-fc-2c-c5-70:-94
7 HackingDemo 58-6d-8f-19-11-48 -156 HackingDemo#58-6d-8f-19-11-48:-156
8 ODISEE 88-1d-fc-30-d4-40 -85 ODISEE#88-1d-fc-30-d4-40:-85
9 ODISEE 88-1d-fc-41-ac-50 -100 ODISEE#88-1d-fc-41-ac-50:-100
If you want to keep data grouped after column explosion, I'd assign an ID for each group of entries first:
data['Group'] = pd.factorize(data['WifiInfo'])[0]+1
SSID BSSID Signal WifiInfo Group
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ... 1
1 NaN NaN NaN ASD#22-1d-fc-41-dc-50:-83,QWERTY#88- ... 2
# after you explode the column
SSID BSSID Signal WifiInfo Group
ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83 1
ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72 1
...
...
ASD 22-1d-fc-41-dc-50 -83 ASD#88-1d-fc-41-dc-50:-83 2
QWERTY 88-1d-fc-2c-c0-00 -72 QWERTY#88-1d-fc-2c-c0-00:-72 2