I have this structure:
Folder_1
- Scripts
- functions
- data_import.py
- main_notebook.ipynb
- Data
sales_data_1.csv
- SQL
- sales_data_1.sql
- sql_2.sql
Inside data_import.py I have this function:
import os
import pandas as pd
import numpy as np
import psycopg2 as pg
sql_path = r"C:Folder_1\SQL/" # path to local sql folder
data_path = r"C:Folder_1\Data/" # path to local data folder
conn = pg.connect(
dbname="db",
host="host",
user="user",
password="pw",
port="port",
)
conn.set_session(autocommit=True)
def get_data(sql_file_name):
# if file already exists, load it
if os.path.isfile(f'{data_path}{sql_file_name}.csv'):
df = pd.read_csv(f'{data_path}{sql_file_name}.csv')
return df
# otherwise, create it
else:
# get data from the internet
query = open(f'{sql_path}{sql_file_name}.sql', 'r').read()
df = pd.read_sql_query(query,conn)
# save it to a file
df.to_csv(f'{data_path}{sql_file_name}.csv', index=False)
# return it
return df
In my main_notebook.ipynb I import the functions like so:
from functions import data_import as jp
When I am trying to use it like this:
sales_data = jp.get_data('sales_data_1')
I get this:
But when I use the same, identical function in my main_notebook.ipynb ( with the imports and connections above ) I do get the actual df loaded if the file is not present in the data folder, the function correctly loads the query, saves the .csv file inside Data folder for the next use.
sales_data = get_data('sales_data_1') # does work as expected
But when I use it after importing it provides me with the query instead the actual pd.DataFrame. I am not sure where is my mistake, the goal is that the function inside the module data_import would work exactly as if it would be written in the main_notebook.ipynb file.
Related
I have been trying to write a python script to mail merge labels. It would need to allow me to look into a folder, open an excel document, merge the document, and print it as a pdf. All the rows in each excel file are part of the same document and I'd like for them to be printed together. I've written up a script that opens a word template and pulls up the excel file to populate into the mail merge but when I print it:
The printed copy only shows me the merge fields not the information on the workbook
Only prints the first page, some of the files I use to make labels would make more than one page.
I've included the code that I have as well as pictures of what I'm currently getting and what I need the end Product to look like.
If anyone can help me on this, you would be a live saver.
What I need:
What I'm getting:
from os import listdir
import win32com.client as win32
import pathlib
import os
import pandas as pd
pd.options.mode.chained_assignment = None
working_directory = os.getcwd()
path = pathlib.Path().resolve()
inputPath = str(path) + '\Output'
outputPath = str(path) + '\OutputPDF'
inputs = listdir(inputPath)
wordApp = win32.Dispatch('Word.Application')
wordApp.Visible = True
sourceDoc = wordApp.Documents.Open(os.path.join(working_directory, 'labelTemplate.docx'))
mail_merge = sourceDoc.MailMerge
for x in inputs[1:]:
mail_merge.OpenDataSource(inputPath + '/'+ x)
print (x)
y = x.replace('.xlsx', '')
z = y.replace('output_','')
print (z)
mail_merge = wordApp.ActiveDocument
mail_merge.ExportAsFixedFormat(os.path.join(outputPath, z), exportformat:=17)`
I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)
Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)
I've been trying to do a simple upload function that let's the user choose a CSV file from his PC and upload it into my Mongo DB. I am currently using Python, Pymongo and Pandas to do it and it works, but only with my "local" adress (C:\Users\joao.soeiro\Downloads) as it shows on the code.
I'd like to know how I could make this string "dynamic" so it reads and uploads files from anywhere, not only my computer. I know it must be a silly question but im really a begginer here...
Thought about creating some temporary directory using tempfile() module but idk how I'd put it to work in my code, which is the following:
import pandas as pd
from pymongo import MongoClient
client = MongoClient("mongodb+srv://xxx:xxx#bycardb.lrp4p.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
print('connected')
db = client['dbycar']
collection = db['users']
data = pd.read_csv(r'C:\Users\joao.soeiro\Downloads\csteste4.csv')
data.reset_index(inplace=True)
data_dict = data.to_dict("records")
collection.insert_many(data_dict)
Solved with this:
import tkinter as tk
from IPython.display import display
from tkinter import filedialog
import pandas as pd
from pymongo import MongoClient
#conecting db
client = MongoClient("mongodb+srv://xxxx:xxxx#bycardb.lrp4p.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
print('conectado com o banco')
db = client['dbycar']
collection = db['usuarios']
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
data = pd.read_csv(file_path)
data.reset_index(inplace=True)
data_dict = data.to_dict("records")
df = pd.DataFrame(data_dict)
display(df)
collection.insert_many(data_dict)
print('uploaded')
So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
I am working on the Data Analysis using SQL on Kaggle.
https://www.kaggle.com/dimarudov/data-analysis-using-sql/comments
However, I am not sure why tables is returning a blank database.
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
path = r"C:/Users/ksumm/OneDrive/Desktop/Python Projects/Euro Soccer/database.sqlite"
database = path + 'database.sqlite'
conn = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", conn)
Output image:
Output
instead of using
database = path + 'database.sqlite'
You can directly use path since that path already contains path to sqlite database.
modified code :
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
path = r"C:/Users/ksumm/OneDrive/Desktop/Python Projects/Euro Soccer/database.sqlite"
conn = sqlite3.connect(path)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", conn)
OR
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
path = r"C:/Users/ksumm/OneDrive/Desktop/Python Projects/Euro Soccer/"
database = path + "database.sqlite"
conn = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", conn)
path = r"C:/Users/ksumm/OneDrive/Desktop/Python Projects/Euro Soccer/database.sqlite"
database = path + 'database.sqlite'
You're appending the database name to the database name. If you look on your disk, you may find a file named
C:\Users\ksumm\OneDrive\Desktop\Python Projects\Euro Soccer\database.sqlitedatabase.sqlite
To avoid that in the future, the Python SQLite module has a slightly odd way to not create a database if it doesn't exist when you open it. The connect method accepts a URI, and the URI accepts parameters. When the filename is correct, this will do what you want:
conn = sqlite3.connect('file:%s?mode=rw' % database, uri=True )
If database does not describe an existing file, the rw mode causes the function to fail, raising a sqlite3.OperationalError exception.