How to lookup data from one CSV in another CSV? - python

In the crq_data file I have cities and states from a user uploaded *.csv file
In the cityDoordinates.csv file I have a library of American cities and states along with their coordinates, I would like this to be a sort of "look up tool" to compare an uploaded .csv file to find their coordinates to map in Folium
Right now, it reads line by line so it appends the coordinates one at a time (n seconds) I would like it to run much faster so that if there are 6000 lines the user doesn't have to wait for 6000 seconds.
Here is part of my code:
crq_file = askopenfilename(filetypes=[('CSV Files', '*csv')])
crq_data = pd.read_csv(crq_file, encoding="utf8")
coords = pd.read_csv("cityCoordinates.csv")
for crq in range(len(crq_data)):
task_city = crq_data.iloc[crq]["TaskCity"]
task_state = crq_data.iloc[crq]["TaskState"]
for coordinates in range(len(coords)):
cityCoord = coords.iloc[coordinates]["City"]
stateCoord = coords.iloc[coordinates]["State"]
latCoord = coords.iloc[coordinates]["Latitude"]
lngCoord = coords.iloc[coordinates]["Longitude"]
if task_city == cityCoord and task_state == stateCoord:
crq_data["CRQ Latitude"] = latCoord
crq_data["CRQ Longitude"] = lngCoord
print(cityCoord, stateCoord, latCoord, lngCoord)
This is an example of the current Terminal Output
Example of uploaded .csv file

I see this not as a problem w/optimizing Pandas, but finding a good data structure for fast lookups: and a good data structure for fast lookups is the dict. The dict takes memory, though; you'll need to evaluate that cost for yourself.
I mocked up what your cityCoordinates CSV could look like:
| City | State | Latitude | Longitude |
|----------|-------|------------|-------------|
| Portland | OR | 45°31′12″N | 122°40′55″W |
| Dallas | TX | 32°46′45″N | 96°48′32″W |
| Portland | ME | 43°39′36″N | 70°15′18″W |
import csv
import pprint
def cs_key(city_name: str, state_name: str) -> str:
"""Make a normalized City-State key."""
return city_name.strip().lower() + "--" + state_name.strip().lower()
# A dict of { "City_name-State_name": (latitude, longitude), ... }
coords_lookup = {}
with open("cityCoordinates.csv", newline="") as f:
reader = csv.DictReader(f) # your coords file appears to have a header
for row in reader:
city = row["City"]
state = row["State"]
lat = row["Latitude"]
lon = row["Longitude"]
key = cs_key(city, state)
coords_lookup[key] = (lat, lon)
pprint.pprint(coords_lookup, sort_dicts=False)
When I run that, I get:
{'portland--or': ('45°31′12″N', '122°40′55″W'),
'dallas--tx': ('32°46′45″N', '96°48′32″W'),
'portland--me': ('43°39′36″N', '70°15′18″W')}
Now, iterating the task data looks pretty much the same: we take a pair of City and State, make a normalized key out of them, then try to look up that key for known coordinates.
I mocked up some task data:
| TaskCity | TaskState |
|------------|-----------|
| Portland | OR |
| Fort Worth | TX |
| Dallas | TX |
| Boston | MA |
| Portland | ME |
and when I run this:
with open("crq_data.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
city = row["TaskCity"]
state = row["TaskState"]
key = cs_key(city, state)
coords = coords_lookup.get(key, (None, None))
if coords != (None, None):
print(city, state, coords[0], coords[1])
I get:
Portland OR 45°31′12″N 122°40′55″W
Dallas TX 32°46′45″N 96°48′32″W
Portland ME 43°39′36″N 70°15′18″W
This solution is going to be much faster in principle because you're not doing a cityCoordinates-ROWS x taskData-ROWS quadratic loop. And, in practice, Pandas suffers when doing row iteration^1, I'm not sure if the same holds for indexing (iloc), but in general Pandas is for manipulating columns of data, and I would say is not for row-oriented problems/solutions.

Related

CSV file to an array to a table? (Python 3.10.4)

I'm new to Python and doing some project based learning.
I have a CSV file that I've put into an array but I'd like present it in PrettyTable
Here's what I have so far:
import csv
import numpy as np
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
Output is this:
['Loud Lullaby,Aggressive,Moon,Kinetic,120,Legendary,hand_cannon']
['Pribina-D,Aggressive,Gunsmith,Kinetic,120,Legendary,hand_cannon']
['True Prophecy,Aggressive,World,Kinetic,120,Legendary,hand_cannon']
['Igneous Hammer,Aggressive,Trials,Solar,120,Legendary,hand_cannon']
But I'd like to get it into this:
from prettytable import PrettyTable
myTable = PrettyTable(['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type'])
myTable.add_row(['Loud Lullaby', 'Aggressive', 'Moon', 'Kinetic', '120', 'Legendary', 'Hand Cannon'])
myTable.add_row(["Pribina-D", "Aggressive", "Gunsmith", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["True Prophecy", "Aggressive", "World", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["Igneous Hammer", "Aggressive", "Trials", "Solar", "120", "Legendary", "Hand Cannon"])
So it can look like this:
Gun Name | Archetype | Source | Element | Rounds Per Minute | Rarity | Weapon Type |
+---------------------------------+--------------+---------------+---------+-------------------+-----------+-------------+
| Loud Lullaby | Aggressive | Moon | Kinetic | 120 | Legendary | Hand Cannon |
| Pribina-D | Aggressive | Gunsmith | Kinetic | 120 | Legendary | Hand Cannon |
| True Prophecy | Aggressive | World | Kinetic | 120 | Legendary | Hand Cannon |
| Igneous Hammer | Aggressive | Trials | Solar | 120 | Legendary | Hand Cannon |
Thoughts on the best way to get the data set incorporated into the table without having to copy and paste every line into myTable.add_row? Because there's hundreds of lines...
[Credit to vishwasrao99 at Kaggle for this CSV file]
I just combined your two pieces of script:
import csv
import numpy as np
from prettytable import PrettyTable
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
columns = ['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type']
myTable = PrettyTable(columns)
for row in data:
list = row[0].split(",")
myTable.add_row(list)
print(myTable)
Note that I used split(",") to split the strings you get in your numpy array at every comma, creating identical lists as what you feed in manually in your example.

Grouping CSV Rows By The Names of Users

I have a table on Python with the following data from a CSV:
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| sandbox#aol.com |Tyler | Porter |
I want to be able to group the data by the Name and have all of the other cells come with it.
It should end up looking like this.
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| sandbox#aol.com |Tyler | Porter |
Furthermore, I want to create a new csv file for every table that is created. Can anyone help? I have tried to loop it through but have failed too many times. I am currently back to the base and only have the file propagting in its normal table. Can anyone help?
import csv
f = open('work.csv')
csv_f = csv.reader(f)
for row in csv_f:
print (row)
When you are trying to group variables based on a certain key (the name in this case) a hashmap is usually a good data structure to try.
As a general solution for future readers:
Create an empty dictionary.
Choose the key that you want to group your data.
Iterate over the data and parse the key and related items.
Add the related items to dict[key].
Now each key in dict will have a list of all the items related to it.
Tailored more specifically to the OP's question:
import collections
def write_csv(name, lines):
with open(f"{name}_work.csv", "w") as f:
for line in lines:
f.write(','.join(item for item in line))
f.write('\n')
if __name__ == "__main__":
# LOAD DATA
with open("work.csv", 'r') as f:
lines = []
for line in f.readlines():
lines.append(line.strip('\n').split(','))
# GROUP DATA BY NAME INTO A DICTIONARY
names = collections.defaultdict(list)
for email, name, job in lines[1:]:
names[name].append((email, job))
# WRITE A NEW .csv FILE FOR EACH NAME
for name in names:
new_lines = lines[:1]
for email, job in names[name]:
new_lines.append([name, email, job])
write_csv(name, new_lines)

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Converting geographic coordinates from GEOSTAT to lat and lng

I've found an interesting datasource of European Population which I think could help me in achieving such a map:
The source document GEOSTAT_grid_POP_1K_2011_V2_0_1.csv looks like this:
| TOT_P | GRD_ID | CNTR_CODE | METHD_CL | YEAR | DATA_SRC | TOT_P_CON_DT |
|-------|---------------|-----------|----------|------|----------|--------------|
| 8 | 1kmN2689E4337 | DE | A | 2011 | DE | other |
| 7 | 1kmN2689E4341 | DE | A | 2011 | DE | other |
Geographic coordinates look to be coded in the GRD_ID column this document indicates Appendix1_WP1C_production-procedures-bottom-up.pdf:
Grid cell identification codes are based on grid cell’s lower left-hand corner coordinates truncated by grid
cell size (e.g. 1kmN4534E5066 is result from coordinates Y=4534672, X=5066332 and the cell size 1000)
I thought I could get lat and long by parsing the strings. For example in Python:
import re
string = "1kmN2691E4341"
lat = float(re.sub('.*N([0-9]+)[EW].*', '\\1', string))/100
lng = float(re.sub('.*[EW]([0-9]+)', '\\1', string))/100
print lat, ",", lng
Output 26.91 , 43.41
but it makes no sense, it does not correspond to a location in Europe !
It may be that it refers to a geographic coordinate system I'm not aware of.
Thanks to Viktor's comment, I found out that the coordinate system used in my file was EPSG:3035
Based on python's implementation of Proj4, I could achieve a convincing result with the following code:
#! /usr/bin/python
# coding: utf-8
import re
from pyproj import Proj, transform
string = "1kmN2326E3989"
x1 = int(re.sub('.*[EW]([0-9]+)', '\\1', string))*1000
y1 = int(re.sub('.*N([0-9]+)[EW].*', '\\1', string))*1000
inProj = Proj(init='EPSG:3035')
outProj = Proj(init='epsg:4326')
lng,lat = transform(inProj,outProj,x1,y1)
print lat,lng
Output : 43.9613760836 5.870517281

Categories

Resources