I've been trying to do a simple upload function that let's the user choose a CSV file from his PC and upload it into my Mongo DB. I am currently using Python, Pymongo and Pandas to do it and it works, but only with my "local" adress (C:\Users\joao.soeiro\Downloads) as it shows on the code.
I'd like to know how I could make this string "dynamic" so it reads and uploads files from anywhere, not only my computer. I know it must be a silly question but im really a begginer here...
Thought about creating some temporary directory using tempfile() module but idk how I'd put it to work in my code, which is the following:
import pandas as pd
from pymongo import MongoClient
client = MongoClient("mongodb+srv://xxx:xxx#bycardb.lrp4p.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
print('connected')
db = client['dbycar']
collection = db['users']
data = pd.read_csv(r'C:\Users\joao.soeiro\Downloads\csteste4.csv')
data.reset_index(inplace=True)
data_dict = data.to_dict("records")
collection.insert_many(data_dict)
Solved with this:
import tkinter as tk
from IPython.display import display
from tkinter import filedialog
import pandas as pd
from pymongo import MongoClient
#conecting db
client = MongoClient("mongodb+srv://xxxx:xxxx#bycardb.lrp4p.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
print('conectado com o banco')
db = client['dbycar']
collection = db['usuarios']
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
data = pd.read_csv(file_path)
data.reset_index(inplace=True)
data_dict = data.to_dict("records")
df = pd.DataFrame(data_dict)
display(df)
collection.insert_many(data_dict)
print('uploaded')
Related
I have this structure:
Folder_1
- Scripts
- functions
- data_import.py
- main_notebook.ipynb
- Data
sales_data_1.csv
- SQL
- sales_data_1.sql
- sql_2.sql
Inside data_import.py I have this function:
import os
import pandas as pd
import numpy as np
import psycopg2 as pg
sql_path = r"C:Folder_1\SQL/" # path to local sql folder
data_path = r"C:Folder_1\Data/" # path to local data folder
conn = pg.connect(
dbname="db",
host="host",
user="user",
password="pw",
port="port",
)
conn.set_session(autocommit=True)
def get_data(sql_file_name):
# if file already exists, load it
if os.path.isfile(f'{data_path}{sql_file_name}.csv'):
df = pd.read_csv(f'{data_path}{sql_file_name}.csv')
return df
# otherwise, create it
else:
# get data from the internet
query = open(f'{sql_path}{sql_file_name}.sql', 'r').read()
df = pd.read_sql_query(query,conn)
# save it to a file
df.to_csv(f'{data_path}{sql_file_name}.csv', index=False)
# return it
return df
In my main_notebook.ipynb I import the functions like so:
from functions import data_import as jp
When I am trying to use it like this:
sales_data = jp.get_data('sales_data_1')
I get this:
But when I use the same, identical function in my main_notebook.ipynb ( with the imports and connections above ) I do get the actual df loaded if the file is not present in the data folder, the function correctly loads the query, saves the .csv file inside Data folder for the next use.
sales_data = get_data('sales_data_1') # does work as expected
But when I use it after importing it provides me with the query instead the actual pd.DataFrame. I am not sure where is my mistake, the goal is that the function inside the module data_import would work exactly as if it would be written in the main_notebook.ipynb file.
I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)
Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)
So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
I want to download the entire collection and put it into a json file. I've tried (see below) but it doesnt work.
import json
from pymongo import MongoClient
import pymongo
from pathlib import Path
myclient = MongoClient("mongodb+srv://<DbName>:<DbPass>#<DbName>.a3b2ai.mongodb.net/<DbName>?retryWrites=true&w=majority")
db = myclient["PlayerPrices"]
Collection = db["Playstation"]
payload = db.inventory.find( {} ) #I think this command is the problem
with open(str(Path(__file__).parents[1]) + '\Main\playstation_1.json', 'r+') as file:
json.dump(payload, file, indent=4)
The issue is that you need to convert the Pymongo Cursor to support json format.
# Python Program for
# demonstrating the
# PyMongo Cursor to JSON
# Importing required modules
from pymongo import MongoClient
from bson.json_util import dumps, loads
# Connecting to MongoDB server
# client = MongoClient('host_name',
# 'port_number')
client = MongoClient('localhost', 27017)
# Connecting to the database named
# GFG
mydatabase = client.GFG
# Accessing the collection named
# gfg_collection
mycollection = mydatabase.College
# Now creating a Cursor instance
# using find() function
cursor = mycollection.find()
# Converting cursor to the list
# of dictionaries
list_cur = list(cursor)
# Converting to the JSON
json_data = dumps(list_cur, indent = 2)
# Writing data to file data.json
with open('data.json', 'w') as file:
file.write(json_data)
Resource taken from: https://www.geeksforgeeks.org/convert-pymongo-cursor-to-json/
I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()