Related
I have a file with lines, each line is split on "|", I want to compare arguments 5 from each line and if intersect, then proceed. This gets to a second part:
first arguments1,2 are compared by dictionary, and if they are same AND
if arguments5,6 are overlapping,
then those lines get concatenated.
How to compare intersection of values under the same key? The code below works cross-key but not within same key:
from functools import reduce
reduce(set.intersection, (set(val) for val in query_dict.values()))
Here is an example of lines:
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7|
text9|text10|text3|text4 text 5| text 11| text 12 text 8|
The output should be:
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6
In other words, only those lines that are matching by 1st,2nd arguments (cells equal) and if 5th,6th arguments are overlapping (intersection) are concatenated.
Here is input file:
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
The output should look like:
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
Angela Darvill gets stacked because two records share same name, same country and same city(-ies).
Based on your improved question :
import itertools
data = """\
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
"""
lines = tuple(tuple(line.split('|')) for line in data.splitlines())
results = []
for line_a_index, line_a in enumerate(lines):
# we want to compare each line with each other, so we start at index+1
for line_b_index, line_b in enumerate(lines[line_a_index+1:], start=line_a_index+1):
assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
assert all(isinstance(cell, str) for cell in line_a)
assert all(isinstance(cell, str) for cell in line_b)
columns0_are_equal = line_a[0] == line_b[0]
columns1_are_equal = line_a[1] == line_b[1]
columns3_are_overlap = set(line_a[3]).issubset(set(line_b[3])) or set(line_b[3]).issubset(set(line_a[3]))
columns4_are_overlap = set(line_a[4]).issubset(set(line_b[4])) or set(line_b[4]).issubset(set(line_a[4]))
print(f"between lines index={line_a_index} and index={line_b_index}, {columns0_are_equal=} {columns1_are_equal=} {columns3_are_overlap=} {columns4_are_overlap=}")
if (
columns0_are_equal
# and columns1_are_equal
and (columns3_are_overlap or columns4_are_overlap)
):
print("MATCH!")
results.append(
(line_a_index, line_b_index,) + tuple(
((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b
else cell_a
for cell_a, cell_b in itertools.zip_longest(line_a, line_b)
)
)
print("Fancy output :")
lines_to_display = set(itertools.chain.from_iterable((lines[result[0]], lines[result[1]], result[2:]) for result in results))
columns_widths = (max(len(str(index)) for result in results for index in (result[0], result[1])),) + tuple(
max(len(cell) for cell in column)
for column in zip(*lines_to_display)
)
for width in columns_widths:
print("-" * width, end="|")
print("")
for result in results:
for line_index, original_line in zip((result[0], result[1]), (lines[result[0]], lines[result[1]])):
for column_index, cell in zip(itertools.count(), (str(line_index),) + original_line):
if cell:
print(cell.ljust(columns_widths[column_index]), end='|')
print("", end='\n') # explicit newline
for column_index, cell in zip(itertools.count(), ("=",) + result[2:]):
if cell:
print(cell.ljust(columns_widths[column_index]), end='|')
print("", end='\n') # explicit newline
for width in columns_widths:
print("-" * width, end="|")
print("")
expected_outputs = """\
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
""".splitlines()
for result, expected_output in itertools.zip_longest(results, expected_outputs):
actual_output = "|".join(result[2:])
assert actual_output.startswith(expected_output[:-3]) # minus the "..."
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
0|Angela Darvill|19036321 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['GB','US'] |['Salford', 'Eccles', 'Manchester'] |
2|Angela Darvill|190323121 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['US'] |['Brighton', 'Eccles', 'Manchester'] |
=|Angela Darvill|19036321;190323121 |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK. |['GB','US'];['US']|['Salford', 'Eccles', 'Manchester'];['Brighton', 'Eccles', 'Manchester']|
1|Helen Stanley |19036320 |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
3|Helen Stanley |19576876320 |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
=|Helen Stanley |19036320;19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US'] |['Brighton', 'Brighton'] |
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
You can see that the lines index 0 and 2 have been merged, same for the lines index 1 and 3.
from itertools import zip_longest
data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""
lines = tuple(line.split('|') for line in data.splitlines())
number_of_lines = len(lines)
print(f"number of lines : {number_of_lines}")
print(f"number of cells in line 1 : {len(lines[0])}")
print(f"number of cells in line 2 : {len(lines[1])}")
print(f"{lines[0]=}")
print(f"{lines[1]=}")
result = []
# we want to compare each line with each other :
for line_a_index, line_a in enumerate(lines):
for line_b_index, line_b in enumerate(lines[line_a_index+1:]):
assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
assert all(isinstance(cell, str) for cell in line_a)
assert all(isinstance(cell, str) for cell in line_b)
if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
line_a[5] in line_b[5] or line_a[6] in line_b[6] # A in B
or line_b[5] in line_a[5] or line_b[6] in line_a[6] # B in A
):
result.append(tuple(
((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1]) # <-- here I truncated the lines
))
# I decided to have a fancy output, but I made some simplifying assumptions to make it simple
if len(result) > 1:
raise NotImplementedError
widths = tuple(max(len(a) if a is not None else 0, len(b) if b is not None else 0, len(c) if c is not None else 0)
for a, b, c in zip_longest(lines[0], lines[1], result[0]))
length = max(len(lines[0]), len(lines[1]), len(result[0]))
for line in (lines[0], lines[1], result[0]):
for index, cell in zip_longest(range(length), line):
if cell:
print(cell.ljust(widths[index]), end='|')
print("", end='\n') # explicit newline
original_expected_output = "text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6"
print(f"{original_expected_output} <-- expected")
lenormju_expected_output = "text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7"
print(f"{lenormju_expected_output} <-- fixed")
output
number of lines : 2
number of cells in line 1 : 7
number of cells in line 2 : 13
lines[0]=['text1', 'text2', 'text3', 'text4 text 5', ' text 6', ' text 7 text 8', '']
lines[1]=['text1', 'text2', 'text12', 'text4 text 5', ' text 6', ' text 7', ' text9', 'text10', 'text3', 'text4 text 5', ' text 11', ' text 12 text 8', '']
text1|text2|text3 |text4 text 5| text 6| text 7 text 8 |
text1|text2|text12 |text4 text 5| text 6| text 7 | text9|text10|text3|text4 text 5| text 11| text 12 text 8|
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7|
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6 <-- expected
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7 <-- fixed
EDIT:
from dataclasses import dataclass
from itertools import zip_longest
data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""
#dataclass
class Match: # a convenient way to store the solutions
line_a_index: int
line_b_index: int
line_result: tuple
lines = tuple(line.split('|') for line in data.splitlines())
results = []
for line_a_index, line_a in enumerate(lines):
for line_b_index, line_b in enumerate(lines[line_a_index+1:], line_a_index+1):
assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
assert all(isinstance(cell, str) for cell in line_a)
assert all(isinstance(cell, str) for cell in line_b)
if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
line_a[5] in line_b[5] or line_a[6] in line_b[6] # A in B
or line_b[5] in line_a[5] or line_b[6] in line_a[6] # B in A
):
line_result = tuple(
((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1]) # <-- here I truncated the lines
)
results.append(Match(line_a_index=line_a_index, line_b_index=line_b_index, line_result=line_result))
# simple output of the solution
for result in results:
print(f"line n°{result.line_a_index} matches with n°{result.line_b_index} : {result.line_result}")
line n°0 matches with n°1 : ('text1', 'text2', 'text3;text12', 'text4 text 5', ' text 6', ' text 7 text 8; text 7')
I try to get the data from pyOWM package using city name but in some cases because of city typo error
not getting data & it breaks the process.
I want to get the weather data using lat-long but don't know how to set function for it.
Df1:
-----
User City State Zip Lat Long
-----------------------------------------------------------------------------
A Kuala Lumpur Wilayah Persekutuan 50100 5.3288907 103.1344397
B Dublin County Dublin NA 50.2030506 14.5509842
C Oconomowoc NA NA 53.3640384 -6.1953066
D Mumbai Maharashtra 400067 19.2177166 72.9708833
E Mratin Stredocesky kraj 250 63 40.7560585 -5.6924778
.
.
.
----------------------------------
Code:
--------
import time
from tqdm.notebook import tqdm
import pyowm
from pyowm.utils import config
from pyowm.utils import timestamps
cities = Df1["City"].unique().tolist()
cities1 = cities [:5]
owm = pyowm.OWM('bee8db7d50a4b777bfbb9f47d9beb7d0')
mgr = owm.weather_manager()
'''
Step-1 Define list where save the data
'''
list_wind_Speed =[]
list_tempreture =[]
list_max_temp =[]
list_min_temp =[]
list_humidity =[]
list_pressure =[]
list_city = []
list_cloud=[]
list_status =[]
list_rain =[]
'''
Step-2 Fetch data
'''
j=0
for city in tqdm(cities1):
j=+1
if j < 60:
# one_call_obs = owm.weather_at_coords(52.5244, 13.4105).weather
# one_call_obs.current.humidity
observation = mgr.weather_at_place(city)
l = observation.weather
list_city.append(city)
list_wind_Speed.append(l.wind()['speed'])
list_tempreture.append(l.temperature('celsius')['temp'])
list_max_temp.append(l.temperature('celsius')['temp_max'])
list_min_temp.append(l.temperature('celsius')['temp_min'])
list_humidity.append(l.humidity)
list_pressure.append(l.pressure['press'])
list_cloud.append(l.clouds)
list_rain.append(l.rain)
else:
time.sleep(60)
j=0
'''
Step-3 Blank data frame and store data in that
'''
df2 = pd.DataFrame()
df2["City"] = list_city
df2["Temp"] = list_tempreture
df2["Max_Temp"] = list_max_temp
df2["Min_Temp"] = list_min_temp
df2["Cloud"] = list_cloud
df2["Humidity"] = list_humidity
df2["Pressure"] = list_pressure
df2["Status"] = list_status
df2["Rain"] = list_status
df2
From the above code, I get the result as below,
City | Temp |Max_Temp|Min_Temp|Cloud |Humidity|Pressure |Status | Rain
------------------------------------------------------------------------------------------
Kuala Lumpur|29.22 |30.00 |27.78 | 20 |70 |1007 | moderate rain | moderate rain
Dublin |23.12 |26.43 |22.34 | 15 |89 | 978 | cloudy | cloudy
...
Now because of some city typo error processes getting stop,
Looking for an alternate solution of it and try to get weather data from Lat-Long but don't know how to set function for pass lat & long column data.
Df1 = {'User':['A','B','C','D','E'],
'City':['Kuala Lumpur','Dublin','Oconomowoc','Mumbai','Mratin'],
'State':['Wilayah Persekutuan','County Dublin',NA,1'Maharashtra','Stredocesky kraj'],
'Zip': [50100,NA,NA,400067,250 63],
'Lat':[5.3288907,50.2030506,53.3640384,19.2177166,40.7560585],
'Long':[103.1344397,14.5509842,-6.1953066,72.9708833,-5.6924778]}
# Try to use this code to get wather data
# one_call_obs = owm.weather_at_coords(52.5244, 13.4105).weather
# one_call_obs.current.humidity
Expected Result
--------------
User | City | Lat | Long | Temp | Cloud | Humidity | Pressure | Rain | Status
-----------------------------------------------------------------------------
Catch the error if a city is not found, parse the lat/lon from the dataframe. Use that lat/lon to create a bounding box and use weather_at_places_in_bbox to get a list of observations in that area.
import time
from tqdm.notebook import tqdm
import pyowm
from pyowm.utils import config
from pyowm.utils import timestamps
import pandas as pd
from pyowm.commons.exceptions import NotFoundError, ParseAPIResponseError
df1 = pd.DataFrame({'City': ('Kuala Lumpur', 'Dublin', 'Oconomowoc', 'Mumbai', 'C airo', 'Mratin'),
'Lat': ('5.3288907', '50.2030506', '53.3640384', '19.2177166', '30.22', '40.7560585'),
'Long': ('103.1344397', '14.5509842', '-6.1953066', '72.9708833', '31', '-5.6924778')})
cities = df1["City"].unique().tolist()
owm = pyowm.OWM('bee8db7d50a4b777bfbb9f47d9beb7d0')
mgr = owm.weather_manager()
for city in cities:
try:
observation = mgr.weather_at_place(city)
# print(city, observation)
except NotFoundError:
# get city by lat/lon
lat_top = float(df1.loc[df1['City'] == city, 'Lat'])
lon_left = float(df1.loc[df1['City'] == city, 'Long'])
lat_bottom = lat_top - 0.3
lon_right = lon_left + 0.3
try:
observations = mgr.weather_at_places_in_bbox(lon_left, lat_bottom, lon_right, lat_top, zoom=5)
observation = observations[0]
except ParseAPIResponseError:
raise RuntimeError(f"Couldn't find {city} at lat: {lat_top} / lon: {lon_right}, try tweaking the bounding box")
weather = observation.weather
temp = weather.temperature('celsius')['temp']
print(f"The current temperature in {city} is {temp}")
Fill in the code to check if the text passed includes a possible U.S. zip code, formatted as follows: exactly 5 digits, and sometimes, but not always, followed by a dash with 4 more digits. The zip code needs to be preceded by at least one space, and cannot be at the start of the text.
Couldn't produce the required output.
import re
def check_zip_code (text):
result = re.search(r"\w+\d{5}-?(\d{4})?", text)
return result != None
print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
You could use
(?!\A)\b\d{5}(?:-\d{4})?\b
Full code:
import re
def check_zip_code (text):
m = re.search(r'(?!\A)\b\d{5}(?:-\d{4})?\b', text)
return True if m else False
print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
Meanwhile I found that there's a package called zipcodes which might be of additional help.
import re
def check_zip_code (text):
return bool(re.search(r" (\b\d{5}(?!-)\b)| (\b\d{5}-\d{4}\b)", text))
assert check_zip_code("The zip codes for New York are 10001 thru 11104.") is True
assert check_zip_code("90210 is a TV show") is False
assert check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.") is True
assert check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.") is False
assert check_zip_code("x\n90201") is False
assert check_zip_code("the zip somewhere is 98230-0000") is True
assert check_zip_code("the zip somewhere else is not 98230-00000000") is False
import re
def check_zip_code (text):
result = re.search(r"\d{5}[-\.d{4}]", text)
return result != None
print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
Try this one
The first character must have a space.
This is an example of this problem only
r"[\s][\d]{5}"
try this code you will get
import re
def check_zip_code (text):
result = re.search(r"\s+\d{5}-?(\d{4})?", text)
return result != None
import re
def check_zip_code (text):
result = re.search(r" \d{5}|\d[5]-\d{4}", text)
return result != None
I think this covers all the conditions
result = re.search(r"\s\d{5}?(-\d{4})?", text)
This code works perfect and generates required output.
import re
def check_zip_code (text):
result = re.search(r" \d{5}(-\d{4})?", text)
return result != None
print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
True
False
True
False
I has a data like below:
Colindale London
London Borough of Bromley
Crystal Palace, London
Bermondsey, London
Camden, London
This is my code:
def clean_whitespace(s):
out = str(s).replace(' ', '')
return out.lower()
My code now just return the string that has been remove white space. How can I select the first word the the string. For example:
Crystal Palace, London -> crystal-palace
Bermondsey, London -> bermondsey
Camden, London -> camden
You can try this code:
s = 'Bermondsey, London'
def clean_whitespace(s):
out = str(s).split(',', 1)[0]
out = out.strip()
out = out.replace(' ', '-')
return out.lower()
print(clean_whitespace(s))
Output:
bermondsey
Try this below :
s = "Crystal Palace, London"
output = s.split(',')[0].replace(' ', '-').lower()
print(output)
I'm trying to split string into address, city, state and zip code but unable to split successfully.
Here is my code:
address = "4502 150th Pl SE, Bellevue, WA 98006"
my_add = address.split(',')
street = my_add[0]
city = my_add[1]
state_zip = my_add[2]
state_zip = state_zip
state = state_zip.split(' ')
print(street)
print(city)
print(state_zip)
print(state)
# 4502 150th Pl SE
# Bellevue
# WA 98006
# ['', 'WA', '98006']
I expect that address will be split as:
address : 4502 150th Pl SE
city : Bellevue
state: WA
zipcode: 98006
can anyone help me to find best possible solution. Thanks
If you are sure that a comma is always followed by a space, you can do this:
address = "4502 150th Pl SE, Bellevue, WA 98006"
street, city, state_info = address.split(", ")
state, zipcode = state_info.split(" ")
print("address:", street)
print("city:", city)
print("state:", state)
print("zipcode:", zipcode)
You are getting some extra spaces in there, and since you are splitting on spaces, you end up with my_add[2] containing three elements: an empty string (comes before the first space), your state, and your zip code. You can add .strip() to your code to fix this:
street = my_add[0].strip()
city = my_add[1].strip()
state_zip = my_add[2].strip() # remove extra spaces
state_zip = state_zip.split(' ') # now split on space to get state and zip
state = state_zip[0] # first element: state
zip_code = state_zip[1] # second element: zip
print(street)
print(city)
print(state_zip)
print(state)
print(zip_code)
# 4502 150th Pl SE
# Bellevue
# ['WA', '98006']
# WA
# 98006
I think your solution would be the code below:
address = "4502 150th Pl SE, Bellevue, WA 98006"
my_add = address.split(',')
street = my_add[0]
city = my_add[1]
state_zip = my_add[2]
state_zip_split = state_zip.split(' ')
state_zip = state_zip_split[2]
state = state_zip_split[1]
print("Street: ", street)
print("City: ", city)
print("State Zip: ", state_zip)
print("State: ", state)
You defined state_zip as an array, you needed to split it one more time to get the state and zip code
You can try this.
>>> address = "4502 150th Pl SE, Bellevue, WA 98006"
>>> my_add = address.split(',')
>>> street = my_add[0]
>>> street
'4502 150th Pl SE'
>>> city = my_add[1].strip()
>>> city
'Bellevue'
>>> state_zip = my_add[2].split()[1]
>>> state_zip
'98006'
>>> state = my_add[2].split()[0]
>>> state
'WA'
Hope it helps.
One way to solve this
import re
re.split(', ', address)
*add1, city, state, zipcode = [x for x in re.split('[ ,]', address) if x!='']
add1 = ' '.join(add1)