Selecting rows based on multiple conditions using Python pandas - python

Hi I am trying to find a row that satisfies multiple user inputs, I want the result to return a single line that matches the flight date and destination, with origin airport being Atlanta. If they input anything else, it gives back an error and quits.
The input data is a CSV that looks like this:
FL_DATE ORIGIN DEST DEP_TIME
5/1/2017 ATL IAD 1442
5/1/2017 MCO EWR 932
5/1/2017 IAH MIA 1011
5/1/2017 EWR TPA 1646
5/1/2017 RSW EWR 1054
5/1/2017 IAD RDU 2216
5/1/2017 IAD BDL 1755
5/1/2017 EWR RSW 1055
5/1/2017 MCO EWR 744
My current code:
import pandas as pd
df=pd.read_csv("flights.data.csv") #import data frame
input1 = input ('Enter your flight date in MM/DD/YYYY: ') #input flight date
try:
date = str(input1) #flight date is a string
except:
print('Invalid date') #error message if it isn't a string
quit()
input2 = input('Enter your destination airport code: ') #input airport code
try:
destination = str(input2) #destination is a string
except:
print('Invalid destination airport code') #error message if it isn't a string
quit()
df.loc[df['FL_DATE'] == date] & df[df['ORIGIN'] == 'ATL'] & df[df['DEST'] == destination]
#matches flight date, destination, and origin has to equal to GNV
Ideal output is just returning the first row, if I input 5/1/2017 as 'date' and 'IAD' as destination.

You should be able to resolve your issue with below example. The syntax of yours was wrong for multiple conditions
import pandas as pd
df=pd.DataFrame({'FL_DATE':['5/1/2017'],'ORIGIN':['ATL'],'DEST':['IAD'],'DEP_TIME':[1442]})
df.loc[(df['FL_DATE'] == '5/1/2017') & (df['ORIGIN'] == 'ATL') & (df['DEST'] == 'IAD')]
Gives
DEP_TIME DEST FL_DATE ORIGIN
1442 IAD 5/1/2017 ATL
You should change your code to something like this
df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]

In your loc statement, you need to fix your brackets and add parentheses between conditions:
df.loc[(df['FL_DATE'] == input1) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == input2)]
Then it works:
>>> df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]
FL_DATE ORIGIN DEST DEP_TIME
0 5/1/2017 ATL IAD 1442

Related

ASSERTION ERROR: Issue in running SQL query

Question #1
List all the directors who directed a 'Comedy' movie in a leap year. (You need to check that the genre is 'Comedy’ and year is a leap year) Your query should return director name, the movie name, and the year.
%%time
def grader_1(q1):
q1_results = pd.read_sql_query(q1,conn)
print(q1_results.head(10))
assert (q1_results.shape == (232,3))
#m as movie , m_director as md,Genre as g,Person as p
query1 ="""SELECT m.Title,p.Name,m.year
FROM Movie m JOIN
M_director d
ON m.MID = d.MID JOIN
Person p
ON d.PID = p.PID JOIN
M_Genre mg
ON m.MID = mg.MID JOIN
Genre g
ON g.GID = mg.GID
WHERE g.Name LIKE '%Comedy%'
AND ( m.year%4 = 0
AND m.year % 100 <> 0
OR m.year % 400 = 0 ) LIMIT 2"""
grader_1(query1)
ERROR:
title Name year
0 Mastizaade Milap Zaveri 2016
1 Harold & Kumar Go to White Castle Danny Leiner 2004
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-17-a942fcc98f72> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'def grader_1(q1):\n q1_results = pd.read_sql_query(q1,conn)\n print(q1_results.head(10))\n assert (q1_results.shape == (232,3))\n\n#m as movie , m_director as md,Genre as g,Person as p\nquery1 ="""SELECT m.Title,p.Name,m.year\nFROM Movie m JOIN \n M_director d\n ON m.MID = d.MID JOIN \n Person p\n ON d.PID = p.PID JOIN\n M_Genre mg\n ON m.MID = mg.MID JOIN\n Genre g \n ON g.GID = mg.GID\n WHERE g.Name LIKE \'%Comedy%\'\nAND ( m.year%4 = 0\nAND m.year % 100 <> 0\nOR m.year % 400 = 0 ) LIMIT 2"""\ngrader_1(query1)')
2 frames
<decorator-gen-53> in time(self, line, cell, local_ns)
/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
1191 else:
1192 st = clock2()
-> 1193 exec(code, glob, local_ns)
1194 end = clock2()
1195 out = None
<timed exec> in <module>()
<timed exec> in grader_1(q1)
AssertionError:
I have run this SQL query on IMDB DATASET without grad_1 function, I am able to run this query. However when I try to run within grader_1 function. I am getting assertion error.
How can I fix this?
Your query has a LIMIT clause, which prevents the SQL engine to fetch all data.
Just run it again without this clause.
query1 = """ SELECT M.title,Pe.Name,M.year FROM Movie M JOIN M_Director MD ON M.MID = MD.MID JOIN M_Genre MG ON M.MID = MG.MID JOIN Genre Ge ON MG.GID = Ge.GID JOIN Person Pe ON MD.PID = Pe.PID WHERE Ge.Name LIKE '%Comedy%' AND CAST(SUBSTR(TRIM(M.year),-4) AS INTEGER) % 4 = 0 AND (CAST(SUBSTR(TRIM(M.year),-4) AS INTEGER) % 100 <> 0 OR CAST(SUBSTR(TRIM(M.year),-4) AS INTEGER) % 400 = 0) """
Run this query all your problem resolves.

Label dataframe column with regular expression

I have a dataframe(original_df) with column description, and I want to create another column Label by searching for keyword in the description using regular expression e.g
description Label
fund trf 0049614823 transfers
alat transfer transfers
data purchase via airtime
alat pos buy pos
alat web buy others
atm wd rch debit money withdrawals
alat pos buy pos
salar alert charges salary
mtn purchase via airtime
top- up purchase via airtime
The psedocode I came up with was
Input- description column and regular expression
Use the regular expression to search for patterns in the description column
loop through the description and create a label based on the
description keyword
return the full dataframe with label column
I tried implementing that here but I didn't get the logic right and I am getting a keyword error
I have also tried all that I could possibly do at the moment but still can't come up with the right logic
df = original_df['description'].sample(100)
position = 0
while position < len(df):
if any(re.search(r"(tnf|trsf|trtr|trf|transfer)",df[position])):
original_df['Label'] == 'transfers'
elif any(re.search(r'(airtime|data|vtu|recharge|mtn|glo|top-up)',df[position])):
original_df['Label'] == 'aitime
elif any(re.search(r'(pos|web pos|)',df[position])):
original_df['Label'] == 'pos
elif any(re.search(r'(salary|sal|salar|allow|allowance)',df[position])):
original_df['Label'] == 'salary'
elif any(re.search(r'(loan|repayment|lend|borrow)',df[position])):
original_df['Label'] == 'loan'
elif any(re.search(r'(withdrawal|cshw|wdr|wd|wdl|withdraw|cwdr|cwd|cdwl|csw)',df[position])):
return 'withdrawals'
position += 1
return others
print(df_sample)
You can put your regex logic into a function and apply that to the DataFrame. This way you can avoid your manual looping pseudocode.
Code:
import pandas as pd
df = pd.DataFrame({ 'description': [
'fund trf 0049614823',
'alat transfer',
'data purchase via',
'alat pos buy',
'alat web buy',
'atm wd rch debit money',
'alat pos buy',
'salar alert charges',
'mtn purchase via',
'top- up purchase via',
]})
description
0
fund trf 0049614823
1
alat transfer
2
data purchase via
3
alat pos buy
4
alat web buy
5
atm wd rch debit money
6
alat pos buy
7
salar alert charges
8
mtn purchase via
9
top- up purchase via
Create a label() function based on your regex code:
import re
def label(row):
if re.search(r'(tnf|trsf|trtr|trf|transfer)', row.description):
result = 'transfers'
elif re.search(r'(airtime|data|vtu|recharge|mtn|glo|top-up)', row.description):
result = 'airtime'
elif re.search(r'(pos|web pos)', row.description):
result = 'pos'
elif re.search(r'(salary|sal|salar|allow|allowance)', row.description):
result = 'salary'
elif re.search(r'(loan|repayment|lend|borrow)', row.description):
result = 'loan'
elif re.search(r'(withdrawal|cshw|wdr|wd|wdl|withdraw|cwdr|cwd|cdwl|csw)', row.description):
result = 'withdrawals'
else:
result = 'other'
return result
Then apply the label() function to the rows of df:
df['label'] = df.apply(label, axis=1)
description
label
0
fund trf 0049614823
transfers
1
alat transfer
transfers
2
data purchase via
airtime
3
alat pos buy
pos
4
alat web buy
pos
5
atm wd rch debit money
pos
6
alat pos buy
pos
7
salar alert charges
pos
8
mtn purchase via
airtime
9
top- up purchase via
pos

Get postal code from full address column in dataframe by regex str.extract() and add as new column in pandas

I have a dataframe with full addresses in a column, and I need to create a separate column with just the postal code of 5 digits starting by 7 in the same dataframe. Some of the addresses may be empty or postal code not found.
How do I split the column to just get the postal code?
the postal code start with 7 for example 76000 is the postal code in index 0
MedicalCenters["Postcode"][0]
Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
Example Data
Venue Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora 20.595621 -100.392677 Medical Center Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
I tried using regex but I get and error
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
ERROR
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
1 # get zipcode from full address
2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1950 )
1951 raise TypeError(msg)
-> 1952 return func(self, *args, **kwargs)
1953
1954 wrapper.__name__ = func_name
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
3037 #forbid_nonstring_types(["bytes"])
3038 def extract(self, pat, flags=0, expand=True):
-> 3039 return str_extract(self, pat, flags=flags, expand=expand)
3040
3041 #copy(str_extractall)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
1010 return _str_extract_frame(arr._orig, pat, flags=flags)
1011 else:
-> 1012 result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
1013 return arr._wrap_result(result, name=name, expand=expand)
1014
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
871
872 regex = re.compile(pat, flags=flags)
--> 873 groups_or_na = _groups_or_na_fun(regex)
874
875 if regex.groups == 1:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
835 """Used in both extract_noexpand and extract_frame"""
836 if regex.groups == 0:
--> 837 raise ValueError("pattern contains no capture groups")
838 empty_row = [np.nan] * regex.groups
839
ValueError: pattern contains no capture groups
time: 39.5 ms
You need to add parentheses to get make it a group
MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
You can try to split the string first, then it will be easier to match the postcode:
address = '75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0'
matches = list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', '))) # ['76000']
So you can populate your DataFrame by:
df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', ')))[0])
Data of Address were an object thats why the regex was not working
MedicalCenters.dtypes
Venue object
Venue Latitude float64
Venue Longitude float64
Venue Category object
Health System object
geom object
Address object
Postcode object
dtype: object
time: 6.41 ms
after convert object to string :
MedicalCenters['Address'] = MedicalCenters['Address'].astype('str')
I was able to apply the regex modified thanks to glam
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")

Using MapReducer MRJob and my mapper function gives me an indexerror: list index out of range

I am new to MapReduce MRJob (and also to Python to be honest). I am trying to use MRJob to count the number of combinations of pairs of letters in different columns, from "A" to "E", that I have in a text file, i.e. "A", "A" = 10 occurences, "A", "B" = 13 occurences, "C", "E"= 6 occurences, etc. The error I get when I run it is a "list index out of range" and for the life of me, I can't figure out why.
Here is a sample of the text file used in conjunction with the python mapreduce file with the mapper and reducer functions (by the way, the string has a date, a time, the duration of a phone call, a customer ID of the person making a call that begins with a letter from "A" to "E" where the letter designates a country, another customer ID of the person receiving a call and key words in the conversation). I broke down the string into a list and in my mapper indicated the index I am interested in, but I am not sure if this approach is correct:
Details
2020-03-05 # 19:28 # 5:10 # A-466 # C-563 # tendremos lindo ahi fuimos derecho carajo junto acabar
2020-03-10 # 05:08 # 5:14 # C-954 # D-353 # carajo calle película acaso voz creía irá san montón ambos hablas empieza estaremos parecía mitad estén vuelto música anoche tendremos tenían dormir habitación encuentra ésa
2020-01-15 # 09:47 # 4:46 # C-413 # B-881 # pudiera dejes querido maestro hacerle llamada paz estados estuviera hablo decirle bonito linda blanco negro querida hacerte dormir empieza mayoría
2020-01-10 # 20:54 # 4:58 # E-027 # A-549 # estuviera tuviste vieja volvió solía alrededor decía maestro estaremos línea sigues
2020-03-17 # 21:38 # 5:21 # C-917 # D-138 # encima música barco tuvimos dejes damas boca
Here is the entire code of the python file:
from mrjob.job import MRJob
class MRduracion_llamadas(MRJob):
def mapper(self, _, line):
"""
First we need to convert the string from the text file into a list and eliminate the
unnecessary characters, such as "#", "-", ":", which I have substituted with a ";" to
facilitate the "split"part of this process.
"""
table = {35 : 59, 45 : 59, 58 : 59}
llamadas2020_text_line = [column.strip() for column in \
(line.translate(table)).split(";")]
#Now we can assign values to "Key" and "Values"
print(line)
pais_emisor = llamadas2020_text_line[7]
pais_receptor = llamadas2020_text_line[9]
minutos = ""
#If a call is "x" minutes and "y" secs long, where y > 0, then we can round up
#the minutes by 1 minute.
if int(llamadas2020_text_line[6]) > 0:
minutos = int(llamadas2020_text_line[5]) + 1
else:
minutos = int(llamadas2020_text_line[5])
yield (pais_emisor, pais_receptor), minutos
def reducer(self, key, values):
yield print(key, sum(values))
if __name__ == "__main__":
MRduracion_llamadas.run()

Python: Geocoding and extracting longitude, latitude, city

I have a dataframe with addresses in a column and I am using the code below to extract longitude and latitude separately. Is there a way to extract longitude and latitude together as well as extract city using this same approach?
In address column of my "add" dataframe, I have addresses in the following format: "35 Turkey Hill Rd Rt 21 Ste 101,Belchertown, MA 01007"
I am using Python 3, Spyder IDE and Windows 10 desktop.
Sample data:
Address
415 E Main St, Westfield, MA 01085
200 Silver St, Agawam, MA 01001
35 Turkey Hill Rd Rt 21 Ste 101,Belchertown, MA 01007
from geopy.geocoders import Nominatim, ArcGIS, GoogleV3
#from geopy.exc import GeocoderTimedOut
arc=ArcGIS(timeout=100)
nom = Nominatim(timeout=100)
goo = GoogleV3(timeout=100)
geocoders = [arc,nom, goo]
# Capture Longitude
def geocodelng(address):
i = 0
try:
while i < len(geocoders):
# try to geocode using a service
location = geocoders[i].geocode(address)
# if it returns a location
if location != None:
# return those values
return location.longitude
else:
# otherwise try the next one
i += 1
except:
# catch whatever errors, likely timeout, and return null values
print sys.exc_info()[0]
return ['null','null']
# if all services have failed to geocode, return null values
return ['null','null']
#Extract co-ordinates
add['longitude']=add['Address'].apply(geocodelng)
# Capture Latitude
def geocodelat(address):
i = 0
try:
while i < len(geocoders):
# try to geocode using a service
location = geocoders[i].geocode(address)
# if it returns a location
if location != None:
# return those values
return location.latitude
else:
# otherwise try the next one
i += 1
except:
# catch whatever errors, likely timeout, and return null values
print sys.exc_info()[0]
return ['null','null']
# if all services have failed to geocode, return null values
return ['null','null']
#Extract co-ordinates
add['latitude']=add['Address'].apply(geocodelat)

Categories

Resources