I am trying to do the following:
Read through a specific portion of a text file (there is a known starting point and ending point)
While reading through these lines, check to see if a word matches a word that I have included in a list
If a match is detected, then add that specific word to a new list
I have been able to read through the text and grab other data from it that I need, but I've been unable to do the above mentioned thus far.
I have tried to implement the following example: Python - Search Text File For Any String In a List
But I have failed to make it read correctly.
I have also tried to adapt the following: https://www.geeksforgeeks.org/python-finding-strings-with-given-substring-in-list/
But I was equally unsuccessful.
Here is some of my code:
import re
from itertools import islice
import os
# list of all countries
oneCountries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica,, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = oneCountries.split(",")
path = "C:/Users/me/Desktop/read.txt"
thefile = open(path, errors='ignore')
countryParsing = False
for line in thefile:
line = line.strip()
# if line.startswith("Submitting Author:"):
# if re.match(r"Submitting Author:", line):
# print("blahblah1")
# countryParsing = True
# if countryParsing == True:
# print("blahblah2")
#
# res = [x for x in line if re.search(countries, x)]
# print("blah blah3: " + str(res))
# elif re.match(r"Running Head:", line):
# countryParsing = False
# if countryParsing == True:
# res = [x for x in line if re.search(countries, x)]
# print("blah blah4: " + str(res))
# for x in countries:
# if x in thefile:
# print("a country is: " + x)
# if any(s in line for s in countries):
# listOfAuthorCountries = listOfAuthorCountries + s + ", "
# if re.match(f"Submitting Author:, line"):
The #commented out lines are versions of the code that I've tried and failed to make work properly.
As requested, this is an example of the text file that I'm trying to grab the data from. I've modified it to remove sensitive information, but in this particular case, the "new list" should be appended with a certain number of "France" entries:
txt above....
Submitting Author:
asdf, asdf (proxy)
France
asdfasdf
blah blah
asdfasdf
asdf, Provence-Alpes-Côte d'Azu 13354
France
blah blah
France
asdf
Running Head:
...more text below
Based on the three points you stated on what you want to accomplish and what I understand from your code (which may not be what you intended), I propose:
# list of all countries
countries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = countries.split(",")
countries = [c.strip() for c in countries]
filename = "read.txt"
filehandle = open(filename, errors='ignore')
my_other_list = []
toParse = False
for line in filehandle:
line = line.strip()
if line.startswith("Submitting Author:"):
toParse = True
continue
elif line.startswith("Running Head:"):
toParse = False
continue
elif toParse:
for c in countries:
if c in line:
my_other_list.append(c)
EDIT SUMMARY
Adapted code to work on the text sample provided.
Fixed the list of countries (originally there were two commas after Costa Rica).
I think your main problem is that, in oneCountries, the country-names are separated by comma+space, but you're only splitting on comma, so for instance the second entry in countries is " Albania", with a space in front. You need to change:
oneCountries.split(",")
to:
oneCountries.split(", ")
After that, it looks like there's enough useful stuff in your commented-out code to achieve what you want.
I have a csv file with 2 fields, store_name and city. There can be multiple stores in a city.
I want an output csv with 5 fields, store_name, city, address, latitude, longitude.
For example, if one entry of the csv is Starbucks, Chicago, I want the output csv to contain all the information in the 5 fields (mentioned above) as:
Starbucks, Chicago, "200 S Michigan Ave, Chicago, IL 60604, USA", 41.8164613, -87.8127855,
Starbucks, Chicago, "8 N Michigan Ave, Chicago, IL 60602, USA", 41.8164613, -87.8127855
and so on for the rest of the results.
I was trying to work this through GeoPy using Nomanitim, before making it work with Google Maps API. Although I do not know what is the best way to approach this. Do note that there are a million of such entries in the source csv, but buying an API key is not an issue once it works.
I did try only geocoding with Nominatim using pandas, but this only creates one result in the output csv for each entry. I want to grab each result as explained in the example above. Not sure how to implement it.
from geopy.geocoders import Nominatim
import csv, sys
import pandas as pd
import keys
in_file = str(sys.argv[1])
out_file = str('gc_' + in_file)
timeout = int(sys.argv[2])
nominatim = Nominatim(user_agent=your_key_here, timeout=timeout)
def gc(address):
name = str(address['store_name'])
city = str(address['city'])
add_concat = name + ", " + city
location = nominatim.geocode(add_concat)
if location != None:
print(f'geocoded record {address.name}: {city}')
located = pd.Series({
'lat': location.latitude,
'lng': location.longitude,
})
else:
print(f'failed to geolocate record {address.name}: {city}')
located = pd.Series({
'lat': 'null',
'lng': 'null',
})
return located
print('opening input.')
reader = pd.read_csv(in_file, header=0)
print('geocoding addresses.')
reader = reader.merge(reader.apply(lambda add: gc(add), axis=1), left_index=True, right_index=True)
print(f'writing to {out_file}.')
reader.to_csv(out_file, encoding='utf-8', index=False)
print('done.')
You can use reverse geocoding for that purpose. As per the official documentation here, it's a way of converting geographic coordinates into a human-readable address.
I used the below function in one of my projects and it's still working. You can probably modify it as per your requirements.
import requests
GCODE_URL = 'https://maps.googleapis.com/maps/api/geocode/json?'
GCODE_KEY = 'YOUR API KEY'
def reverse_gcode(location):
location = str(location).replace(' ','+')
nav_req = 'address={}&key={}'.format(location,GCODE_KEY)
request = GCODE_URL + nav_req
result = requests.get(request)
data = result.json()
status = data['status']
geo_location = {}
if str(status) == "OK":
sizeofjson = len(data['results'][0]['address_components'])
for i in range(sizeofjson):
sizeoftype = len(data['results'][0]['address_components'][i]['types'])
if sizeoftype == 3:
geo_location[data['results'][0]['address_components'][i]['types'][2]] = data['results'][0]['address_components'][i]['long_name']
else:
if data['results'][0]['address_components'][i]['types'][0] == 'administrative_area_level_1':
geo_location['state'] = data['results'][0]['address_components'][i]['long_name']
elif data['results'][0]['address_components'][i]['types'][0] == 'administrative_area_level_2':
geo_location['city'] = data['results'][0]['address_components'][i]['long_name']
geo_location['town'] = geo_location['city']
else:
geo_location[data['results'][0]['address_components'][i]['types'][0] ]= data['results'][0]['address_components'][i]['long_name']
formatted_address = data['results'][0]['formatted_address']
geo_location['lat'] = data['results'][0]['geometry']['location']['lat']
geo_location['lang']= data['results'][0]['geometry']['location']['lng']
geo_location['formatted_address']=formatted_address
return geo_location
print(reverse_gcode("Starbucks, Chicago"))
Output will be in a json format, looks something like this:
{'street_number': '8', 'town': 'Cook County', 'locality': 'Chicago', 'city': 'Cook County', 'lat': 41.882413, 'neighborhood': 'Chicago Loop', 'route': 'North Michigan Avenue', 'lang': -87.62468799999999, 'postal_code': '60602', 'country': 'United States', 'formatted_address': '8 N Michigan Ave, Chicago, IL 60602, USA', 'state': 'Illinois'}
I have just started coding this semester, so if you can use simple methods to help me find my answer I'd appreciate it. Basically, I just want it to print the name of each dictionary and then list it's contents. Oh, and just so you know, I don't actually even like sports this was just a previous homework assignment that I wanted to improve upon. Here's what I've got and yes, I know it doesn't work the way I want it to:
football = {
'favorite player': 'Troy Aikman',
'team': 'Dallas Cowboys',
'number': '8',
'position': 'quarterback'
}
baseball = {
'favorite player': 'Jackie Robinson',
'team': 'Brooklyn Dodgers',
'number': '42',
'position': 'second baseman'
}
hockey = {
'favorite player': 'Wayne Gretzky',
'team': 'New York Rangers',
'number': '99',
'position': 'center'
}
sports = [football, baseball, hockey]
my_sports = ['Football', 'Baseball', 'Hockey']
for my_sport in my_sports:
print(my_sport)
for sport in sports:
for question, answer in sport.items():
print(question.title + ": " + answer)
print("\n")
I want it to print:
Football
Favorite player: Troy Aikman
Team: Dallas Cowboys
Number: 8
Position: quarterback
Baseball:
Favorite player: Jackie Robinson
Team: Brooklyn Dodgers
Number: 42
Position: second baseman
...and so forth. How do I achieve the results I want? The simpler the better and please use Python 3, I know nothing of Python 2.
my_sports = {'Football': football, 'Baseball' : baseball, 'Hockey' : hockey}
for key,value in my_sports.items():
print(key)
for question, answer in value.items():
print(question + ": " + answer)
print("\n")
You can try this:
sports = {"football":football, "baseball":baseball, "hockey":hockey}
for a, b in sports.items():
print(a)
for c, d in b.items():
print("{}: {}".format(c, d))
Output:
football
position: quarterback
favorite player: Troy Aikman
number: 8
team: Dallas Cowboys
baseball
position: second baseman
favorite player: Jackie Robinson
number: 42
team: Brooklyn Dodgers
hockey
position: center
favorite player: Wayne Gretzky
number: 99
team: New York Rangers
UPDATED:
I edit my answer and now the code below works:
my_sports = {'Football': football, 'Baseball' : baseball, 'Hockey' : hockey}
for key,value in my_sports.items():
print(key)
for question, answer in value.items():
print(question + ": " + answer)
print("\n")
This is the result:
Football
Favorite Player: Troy Aikman
Team: Dallas Cowboys
Number: 8
Position: quarterback
Baseball
Favorite Player: Jackie Robinson
Team: Brooklyn Dodgers
Number: 42
Position: second baseman
Hockey
Favorite Player: Wayne Gretzky
Team: New York Rangers
Number: 99
Position: center
Code here:
https://repl.it/MOBO/3
The built-in zip function seems like the easiest way to combine and pair-up elements from the two lists. Here's how to use it:
sports = [football, baseball, hockey]
my_sports = ['Football', 'Baseball', 'Hockey']
for my_sport, sport in zip(my_sports, sports):
print('\n'.join((
'{}', # name of sport
'Favorite player: {favorite player}',
'Team: {team}',
'Number: {number}',
'Position: {position}')).format(my_sport, **sport) + '\n'
)