How to perform a CName Lookup on a csv file - python

I am in the processes of automating the CName lookup process via python and would like some help / thought on my current draft.
The goal is for the script to take in each site under the column 'site' and provide the cname of the site in another column named 'CName'
Here is what I have now:
# pip install pandas
from tkinter.dnd import dnd_start
import pandas as pd
# pip install dnspython
from dns import resolver, reversename
# pip install xlrd, pip install xlsxwriter, pip install socket
from pandas.io.excel import ExcelWriter
import time
import socket
import dns.resolver
startTime = time.time()
# Import excel called logs.xlsx as dataframe
# if CSV change to pd.read_csv('logs.csv', error_bad_lines=False)
logs = pd.read_csv(path to file)
# Create DF with dupliate sites filtered for check
logs_filtered = logs.drop_duplicates(['site']).copy()
def cNameLookup(site):
name = str(site).strip()
try:
cname = socket.AddressInfo(site)[0]
for val in cname:
print('CNAME Record : ', val.target)
except:
return 'N/A'
# Create CName column with the CName Lookup result
logs_filtered['cname'] = logs_filtered['site'].apply(cNameLookup)
# Merge DNS column to full logs matching IP
logs_filtered = logs.merge(logs_filtered[['site', 'cname']], how='left', on=['site'])
# output as Excel
writer = ExcelWriter('validated_logs.xlsx', engine='xlsxwriter', options={
'strings_to_urls': False})
logs_filtered.to_excel(writer, index=False)
writer.save()
print('File Succesfully written as validated_logs.xlsx')
print('The script took {0} second !'.format(time.time() - startTime))
As of now, when I run the script, all I get for the CName colum is 'N/A' all the way down, it seems as though the cname lookup portion of the code is not working as intended.
Thank you in advance for any help / suggestions!

Related

Difficulty trying to export query results as a CSV, uploaded to SharePoint (PySpark)

I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)
Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)

File Not Found Error while Downloading Image files

I am using Windows 8.1, so I have been web scraping a lot recently and have been very successful in finding out some errors as well, but now I am stuck in downloading the files as they will not download and giving me a
FileNotFoundError.
I have removed all the unknown characters from the name files but still, get this error. any help.
I have also made the names lowercase just in case. The error happens when I download the 22nd item, other items download fine before the 22nd one .
My Code and also the Excel file For reference:
import time
import pandas as pd
import requests
Final1 = pd.read_excel("Sneakers.xlsx")
Final1.index+=1
a = Final1.index.tolist()
Images = Final1["Images"].tolist()
Name = Final1["Name"].str.lower().tolist()
Brand = Final1["Brand"].str.lower().tolist()
s = requests.Session()
for i,n,b,l in zip(a,Name,Brand,Images):
r = s.get(l).content
with open("Images//" + f"{i}-{n}-{b}.jpg","wb") as f:
f.write(r)
Excel File (Google Drive) : Excel File
It seems like you don't have Images folder in your path.
It's better way to use os.path.join() function for joining path in python.
Try Below:
import os
import time
import pandas as pd
import requests
Final1 = pd.read_excel("Sneakers.xlsx")
Final1.index+=1
a = Final1.index.tolist()
Images = Final1["Images"].tolist()
Name = Final1["Name"].str.lower().tolist()
Brand = Final1["Brand"].str.lower().tolist()
# Added
if not os.path.exists("Images"):
os.mkdir("Images")
s = requests.Session()
for i,n,b,l in zip(a,Name,Brand,Images):
r = s.get(l).content
# with open("Images//" + f"{i}-{n}-{b}.jpg","wb") as f:
with open(os.path.join("Images", f"{i}-{n}-{b}.jpg"),"wb") as f:
f.write(r)

Python on Hadoop read blocks

I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()

Unable to save csv file with Pandas

Sorry for the dummy question but I read lots of topics but my code still do not create and save a .csv file.
import pandas as pd
def save_csv(lista):
try:
print("Salvando...")
name_path = time.strftime('%d%m%y') + '01' + '.csv'
df = pd.DataFrame(lista, columns=["column"])
df.to_csv(name_path, index=False)
except:
pass
dados = [-0.9143399074673653, -1.0944355744868517, -1.1022400576621294]
save_csv(dados)
Path name is 'DayMonthYear01.csv' (20121701.csv).
When I run the code it finishes but no file is saved.
The output of the code is just:
>>>
RESTART: C:\Users\eduhz\AppData\Local\Programs\Python\Python36-32\testeCSV.py
Salvando...
>>>
Does anyone knows what am I missing?
First, as answered by #Abdou I changed the code to provide me what was the error.
import pandas as pd
import time
def save_csv(lista):
try:
print("Salvando...")
name_path = time.strftime('%d%m%y') + '01' + '.csv'
df = pd.DataFrame(lista, columns=["column"])
df.to_csv(name_path, index=False)
except Exception as e:
print(e)
dados = [-0.9143399074673653, -1.0944355744868517, -1.1022400576621294]
save_csv(dados)
Then I found out it was due to a permission error
[Errno 13] Permission denied:
caused by the fact Notepad (without being opened as Administrator) does not have access to some directories and therefore anything run inside it wouldn't be able to write to those directories.
I tried running Notepad as administrator but it didn't work.
The solution was running the code with the Python IDLE.
Did you import the time module? All i did was add that and it made a 21121701.csv with the 3 entries in one columns in the current working directory.
import pandas as pd
import time
def save_csv(lista):
print("Salvando...")
name_path = time.strftime('%d%m%y') + '01' + '.csv'
df = pd.DataFrame(lista, columns=["column"])
df.to_csv(name_path, index=False)
dados = [-0.9143399074673653, -1.0944355744868517, -1.1022400576621294]
save_csv(dados)
Removing the try/except gives a file permission error if you have a file of the same name already open. You have to close any file you are trying to write (on windows at least).
Per Abdou's comment, if you (or the program) don't have write access to the directory then that would cause a permission error too.

Read csv into database SQLite3 ODO Python

I am trying to read in a csv into a new table in a new databased using ODO, SQLite3 and Python.
I am following these guides:
https://media.readthedocs.org/pdf/odo/latest/odo.pdf
http://odo.pydata.org/en/latest/perf.html?highlight=sqlite#csv-sqlite3-57m-31s
I am trying the following:
import sqlite3
import csv
from odo import odo
file_path = 'my_path/'
# In this case 'my_path/' is a substitute for my real path
db_name = 'data.sqlite'
conn = sqlite3.connect(file_path + db_name)
This creates a new sqlite file data.sqlite within file_path. I can see it there in the folder.
When I then try to read my csv into this database I get the following error:
csv_path = 'my_path/data.csv'
odo(csv_path, file_path + db_name)
conn.close()
NotImplementedError: Unable to parse uri to data resource: # lists my path
Can you help?
No thanks to the ODO documentation, this succesfully created a new table in a new database and read in the csv file to that database:
import sqlite3
import csv
from odo import odo
# [1]
# Specify file path
file_path = 'my_path/'
# In this case 'my_path/' is a substitute for my real path
# Specify csv file path and name
csv_path = file_path + 'data.csv'
# Specify database name
db_name = 'data.sqlite'
# Connect to new database
conn = sqlite3.connect(file_path + db_name)
# [2]
# Use Odo to detect the shape and datatype of your csv:
data_shape = discover(resource(csv_path))
# Ready in csv to new table called 'data' within database 'data.sqlite'
odo(pd.read_csv(csv_path), 'sqlite:///' + file_path + 'data.sqlite::data', dshape=data_shape)
# Close database
conn.close()
Sources used in [1]:
https://docs.python.org/2/library/sqlite3.html
python odo sql AssertionError: datashape must be Record type, got 0 * {...}
Sources used in [2]:
https://stackoverflow.com/a/41584832/2254228
http://sebastianraschka.com/Articles/2014_sqlite_in_python_tutorial.html#creating-a-new-sqlite-database
https://stackoverflow.com/a/33316230/2254228
what is difference between .sqlite and .db file?
The ODO documentation is here (good luck...) https://media.readthedocs.org/pdf/odo/latest/odo.pdf
I found the document in the document website and in github are different. Please use github version as reference.
The
NotImplementedError: Unable to parse uri to data resource
error is mentioned in this section.
You could solve by using
pip install odo[sqlite] or
pip install odo[sqlalchemy]
Then you may encounter another error if you use windows and odo 0.5.0:
AttributeError: 'DiGraph object has no attribute 'edge'
Install networkx 1.11 instead of networkx 2.0 could solve this error.
(reference)
pip uninstall networkx
pip install networkx==1.11
I hope this will help

Categories

Resources