Using df.isin() function over a column of tuples

Using df.isin() function over a column of tuples | Pandas - python

I have a dataframe consisting of Wikipedia articles with geocoordinates and some statistics. The column 'Availability' contains a tuple of the languages that article is available in (out of a selection).
What I'm trying to do is plot a bubble map with plotly, and the legend being the availability in those languages. For example, out of ['ca','es'] you would have [],['ca'],['es'],['ca','es'] meaning not available, only in catalan, only in spanish or available in both respectively.
The problem is that when trying to use those combinations to create a dataframe with only the matching rows using Dataframe.isin(), it always returns an empty df.
The columns of the dataframe are:
Columns: [French Title, Qitem, Pageviews, page_title_1, page_title_2, Availability, Lat, Lon, Text]
Here is my code:
fig = go.Figure()
scale = 500
for comb in combinations:
df_sub = df[df['Availability'].isin(tuple(comb))] #The problem is here. This returns an empty DF
if(len(df_sub.index)) == 0: continue #There are no occurrencies with that comb
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name = comb #Here is the underlying restriction. I need to separate the traces according to their availability.
))
PS: I guess it has something to do with pandas not working very good with using lists or tuples as a column value, but didn't figure out how to achieve what I want. Any of you has any idea? Thank you in advance. Comb appears as a string or a tuple of strings: ('es','ca') , but the values in df['Availability] when I print them appear like (es,ca)
Sample dataframe (sorry for the style I'm new to Stack overflow)**
French Title Qitem Pageviews \
0 Liban Q822 53903
1 France Q142 25728
2 BiÃ©lorussie Q184 21688
3 ÃŽle Maurice Q2656389 20478
4 Affaire Dupont de LigonnÃ¨s Q16010109 16075
page_title_1 page_title_2 \
0 LÃbano LÃban
1 Francia FranÃ§a
2 Bielorrusia BielorÃºssia
3 Isla de Mauricio Illa Maurici
4 Asesinatos y desapariciones de Dupont de LigonnÃ¨s
Availability Lat Lon \
0 (es, ca) 33.90000000 35.53330000
1 (es, ca) 48.86700000 2.32650000
2 (es, ca) 53.528333333333 28.046666666667
3 (es, ca) -20.30084200 57.58209200
4 (es,) 47.23613230 -1.56848610
Text
0 Liban<br>(33.90000000, 35.53330000)<br>Q822
1 France<br>(48.86700000, 2.32650000)<br>Q142
2 BiÃ©lorussie<br>(53.528333333333, 28.046666666667)<br>Q184
3 ÃŽle Maurice<br>(-20.30084200, 57.58209200)<br>Q2656389
4 Affaire Dupont de LigonnÃ¨s<br>(47.23613230, -1.56848610)<br>Q16010109

You can use Series.apply() to achieve your goal:
df['Availability'].apply(lambda x: 'ca' in x)
That will return True if 'ca' is in the tuple. It can easily be modified to return some label, eg. Catalan.

In the end I turned the tuple into a list because due to not using df.isin() it doesn't raise the Unhashable Type Error, and was able to separate the traces via combinations using df.apply() (thanks to mkos for the idea):
for comb in combinations:
if len(comb) ==0:
name ='Not available'
df_sub = df[df['Availability'].apply(lambda x: len(x)==0)]
else:
df_sub = df[df['Availability'].apply(lambda x: set(comb) == set(x))]
name = ','.join(comb)
if(len(df_sub.index)) == 0: continue
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name =name
))
You can see the result here.

Related

Better and Efficient way to Iterate through Data

So, I've got two dataframes, one with 54k rows and 1 column and another with 139k rows and 3 columns, I need to check weather the values of a column from first dataframe lies in between values of two columns in second dataframe, and if they match, I need to replace that particular value with corresponding string value in the second dataframe into first dataframe.
I tried doing it with simple for loops and if else statements, but the number of iteration are huge and my cell is taking forever to run. I've attached some snippets down below, If there is any better way to rewrite that particular part of code, It would be great help. Thanks in advance.
First DataFrame:
ip_address_to_clean
IP_Address_clean
0 815237196
1 1577685417
2 979279225
3 3250268602
4 2103448748
... ...
54208 4145673247
54209 1344187002
54210 3156712153
54211 1947493810
54212 2872038579
54213 rows × 1 columns
Second DataFrame:
ip_boundaries_file
country lower_bound_ip_address_clean upper_bound_ip_address_clean
0 Australia 16777216 16777471
1 China 16777472 16777727
2 China 16777728 16778239
3 Australia 16778240 16779263
4 China 16779264 16781311
... ... ... ...
138841 Hong Kong 3758092288 3758093311
138842 India 3758093312 3758094335
138843 China 3758095360 3758095871
138844 Singapore 3758095872 3758096127
138845 Australia 3758096128 3758096383
138846 rows × 3 columns
Code I've written :
ip_address_to_clean_copy = ip_address_to_clean.copy()
o_ip = ip_address_to_clean['IP_Address_clean'].values
l_b = ip_boundaries_file['lower_bound_ip_address_clean'].values
for i in range(len(o_ip)):
for j in range(len(l_b)):
if (ip_address_to_clean['IP_Address_clean'][i] > ip_boundaries_file['lower_bound_ip_address_clean'][j]) and (ip_address_to_clean['IP_Address_clean'][i] < ip_boundaries_file['upper_bound_ip_address_clean'][j]):
ip_address_to_clean_copy['IP_Address_clean'][i] = ip_boundaries_file['country'][j]
#print(ip_address_to_clean_copy['IP_Address_clean'][i])
#print(i)

This works (I tested it on small tables).
replacement1 = [None]*3758096384
replacement2 = []
for _, row in ip_boundaries_file.iterrows():
a,b,c = row['lower_bound_ip_address_clean'], row['upper_bound_ip_address_clean'], row['country']
replacement1[a+1:b]=[len(replacement2)]*(b-a-1)
replacement2.append(c)
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean_copy['IP_Address_clean'].apply(lambda x:replacement2[replacement1[x]] if (x < len(replacement1) and replacement1[x]!=None) else x)
I tweaked the lambda function to keep the original ip if it's not in the replacement table.
Notes:
Compared to my comment, I added the replacement2 table to hold the actual strings, and put the indexes in replacement1 to make it more memory efficient.
This is based on one of the methods to sort a list in O(n) when you know the contained values are bounded.
Example:
Inputs:
ip_address_to_clean = pd.DataFrame([10,33,2,179,2345,123], columns = ['IP_Address_clean'])
ip_boundaries_file = pd.DataFrame([['China',1,12],
['Australia', 20,40],
['China',2000,3000],
['France', 100,150]],
columns = ['country', 'lower_bound_ip_address_clean',
'upper_bound_ip_address_clean'])
Output:
ip_address_to_clean_copy
# Out[13]:
# IP_Address_clean
# 0 China
# 1 Australia
# 2 China
# 3 179
# 4 China
# 5 France
As I mentioned in another comment, here's another script that performs a dichotomy search on the 2nd DataFrame; it works in O(n log(p)), which is slower than the above script, but consumes far less memory!
def replace(n, df):
if len(df) == 0:
return n
i = len(df)//2
if df.iloc[i]['lower_bound_ip_address_clean'] < n < df.iloc[i]['upper_bound_ip_address_clean']:
return df.iloc[i]['country']
elif len(df) == 1:
return n
else:
if n <= df.iloc[i]['lower_bound_ip_address_clean']:
return replace(n, df.iloc[:i-1])
else:
return replace(n, df.iloc[i+1:])
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean['IP_Address_clean'].apply(lambda x: replace(x,ip_boundaries_file))

How to select rows values starting by specific letters by group in a python dataframe?

I have the following dataframe "data" composed of ID and associated cluster number :
ID cluster
FP_101 1
FP_102 1
SP_209 3
SP_300 3
SP_209 1
FP_45 90
SP_50 90
FP_398 100
...
I would like to print clusters which contain more than one ID starting by SP and/or FP.
I think that I have the two parts of the answer but just do not know of to combine them in propre way :
data = data[data['ID'].str.startswith('FP')] (same for SP)
selection fonction : data = data.groupby(['cluster']).filter(lambda x: x['ID'].nunique() > 1)
The result should give from the previous example :
ID cluster
FP_101 1
FP_102 1
SP_209 1
SP_209 3
SP_300 3
How can I combine arrange these fonction to obtain this result ?

This is my understanding of your question; let me know if it helps:
Separating SP & FP
df['Prefix'] = df['ID'].apply(lambda x: x.split('_')[0])
Grouping by clusters
df2 = df.groupby(['cluster', 'Prefix'], as_index = False).agg({'ID':['nunique','unique']})
Filtering
df2.columns = df2.columns.to_flat_index().str.join('')
df2[df2['IDnunique']>1]

How to add entries in Pandas DataFrame?

Basically I have census data of US that I have read in Pandas from a csv file.
Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.
MY TRY:
What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.
CODE:
#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
#idxl.append(census_df.index[idx])
#st.append(census_df.iloc[idx]['STNAME'])
#cty.append(census_df.iloc[idx]['CTYNAME'])
cty_reg.index[idx2] = census_df.index[idx]
cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
idx2 = idx2 + 1
cty_reg
CENSUS TABLE PIC:
SAMPLE TABLE:
REGION STNAME CTYNAME
0 2 "Wisconsin" "Washington County"
1 2 "Alabama" "Washington County"
2 1 "Texas" "Atauga County"
3 0 "California" "Washington County"
SAMPLE OUTPUT:
STNAME CTYNAME
0 Wisconsin Washington County
1 Alabama Washington County
I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.

There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop
Option 1 - original answer
print df.loc[
(df.REGION.isin([1,2])) & \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington')), \
['REGION', 'STNAME', 'CTYNAME']]
Option 2 - using and with pd.eval
q = pd.eval("(df.REGION.isin([1,2])) and \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
(df.CTYNAME.str.startswith('Washington'))", \
engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]
Option 3 - using and with df.query
regions_list = [1,2]
dfq = df.query("(REGION==#regions_list) and \
(POPESTIMATE2015 > POPESTIMATE2014) and \
(CTYNAME.str.startswith('Washington'))", \
engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]

If I'm reading the logic in your code right, you want to select rows according to the following conditions:
REGION should be 1 or 2
POPESTIMATE2015 > POPESTIMATE2014
CTYNAME needs to start with "Washington"
In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:
df = census_df[
((df.REGION == 1) | (df.REGION == 2)) & \
(df.POPESTIMATE2015 > POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington'))
]

Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important
And then you wanted the STNAME and CTYNAME of that row.
So here's what you would do:
your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)
This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...

What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.

I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

new column based on specific string info from two different columns Python Pandas

I have 2 columns, I need to take specific string information from each column and create a new column with new strings based on this.
In column "Name" I have wellnames, I need to look at the last 4 characters of each wellname and if it Contains "H" then call that "HZ" in a new column.
I need to do the same thing if the column "WELLTYPE" contains specific words.
Using a Data Analysis program Spotfire I can do this all in one simple equation. (see below).
case
When right([UWI],4)~="H" Then "HZ"
When [WELLTYPE]~="Horizontal" Then "HZ"
When [WELLTYPE]~="Deviated" Then "D"
When [WELLTYPE]~="Multilateral" Then "ML"
else "V"
End
What would be the best way to do this in Python Pandas?
Is there a simple clean way you can do this all at once like in the spotfire equaiton above?
Here is the datatable with the two columns and my hopeful outcome column. (it did not copy very well into this), I also provide the code for the table below.
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V
6 HB-005 Source V
7 BB-007 Water V
Here is the code to create the dataframe
# Dataframe with hopeful outcome
raw_data = {'Name': ['HH-001HST2', 'HH-001HST', 'HB-002H', 'HB-002', 'HB-002','HB-004','HB-005','BB-007'],
'WELLTYPE':['Oil Horizontal', 'Oil_Horizontal', 'Oil', 'Water_Deviated', 'Oil_Multilateral','Oil','Source','Water'],
'What I Want': ['HZ', 'HZ', 'HZ', 'D', 'ML','V','V','V']}
df = pd.DataFrame(raw_data, columns = ['Name','WELLTYPE','What I Want'])
df

Nested 'where' variant:
df['What I Want'] = np.where(df.Name.str[-4:].str.contains('H'), 'HZ',
np.where(df.WELLTYPE.str.contains('Horizontal'),'HZ',
np.where(df.WELLTYPE.str.contains('Deviated'),'D',
np.where(df.WELLTYPE.str.contains('Multilateral'),'ML',
'V'))))

Using apply by row:
def criteria(row):
if row.Name[-4:].find('H') > 0:
return 'HZ'
elif row.WELLTYPE.find('Horizontal') > 0:
return 'HZ'
elif row.WELLTYPE.find('Deviated') > 0:
return 'D'
elif row.WELLTYPE.find('Multilateral') > 0:
return 'ML'
else:
return 'V'
df['want'] = df.apply(criteria, axis=1)

This feels more natural to me. Obviously subjective
from_name = df.Name.str[-4:].str.contains('H').map({True: 'HZ'})
regex = '(Horizontal|Deviated|Multilateral)'
m = dict(Horizontal='HZ', Deviated='D', Multilateral='ML')
from_well = df.WELLTYPE.str.extract(regex, expand=False).map(m)
df['What I Want'] = from_name.fillna(from_well).fillna('V')
print(df)
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V V
6 HB-005 Source V
7 BB-007 Water V

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using df.isin() function over a column of tuples | Pandas - python

You can use Series.apply() to achieve your goal: df['Availability'].apply(lambda x: 'ca' in x) That will return True if 'ca' is in the tuple. It can easily be modified to return some label, eg. Catalan.

Related

Better and Efficient way to Iterate through Data

How to select rows values starting by specific letters by group in a python dataframe?

How to add entries in Pandas DataFrame?

Building complex subsets in Pandas DataFrame

new column based on specific string info from two different columns Python Pandas

Categories

Resources