Use Pandas dataframe in mrJob - python

I have a python code and i need to use mrjob to make my python script more faster.
How do I make below script to use mrJob?
the below script works fine for small file, but when i run large file it takes forever. so I am planning to use mrJob which is a mapReducer python package. So, problem is : I dont know how to use mrJob for this script, please advise?
import os
import pandas as pd
import pyffx
import string
import sys
column='first_name'
filename="python_test.csv"
encrypted_value_list = []
alpha=string.printable
key=b'sec-key'
seperator_in='|'
seperator_out='|'
outputfile='encypted.csv'
compression_in=None
compression_out=None
df = pd.read_csv(filename,compression=compression_in, sep=seperator, low_memory=False, encoding='utf-8-sig')
df_null = df[df[column].isnull()]
df_notnull = df[df[column].notnull()].copy()
for index,row in df_notnull.iterrows():
e = pyffx.String(key, alphabet=alpha, length=len(row[column]))
encrypted_value_list.append(e.encrypt(row[column]))
df_notnull[column]=encrypted_value_list
df_merged = pd.concat([df_notnull, df_null], axis=0, ignore_index=True, sort=False)
df_merged

Related

(raspberry pi) instead of pip install all the function (like pandas and json), but we can still use them in script import

system: raspberry pi 4 model B, 32bit, linux run python
Is a dumb question, I was planning to read data from MongoDB to excel and also read excel toMongoDB. Overall the .py scrip/code is fine and working. (the code is below)
I do know if in the code I do "import pandas as pd" then raspberry pi cmd
need to pip install it
my main quesion:
but we also acknowledge that raspberrypi's memory not as bigger as other laptop, is there other way instead of pip install all the stuff, we can still use them?
Becides, I only pip install pandas by raspberrypi took about 15 min, and laptop is like 30sec, and factory might have more than hundred of raspberrypis for recording such as temperature, product data etc on production line.
There should be an efficient way to implement (use pandas and other pymongo without manually pip install on raspberrypi)
the memory left:
joy#raspberrypi:/ $ free
3834332/total , 223876/used , 2844436/free
the fine code.py script MongoDB to excel:
import pandas as pd
from pymongo import MongoClient
import pymongo
from json2excel import Json2Excel
import json
from bson.objectid import ObjectId
from bson import json_util
client = pymongo.MongoClient("mongodb://localhost:27017/")
# Database Name
db = client["(practice_10_14)-0002"]
# Collection Name
col = db["(practice_10_24)read_MongoDB_to_Excel"]
# Find All: It works like Select * query of SQL.
x = col.find()
list_01 = []
for data in x:
list_01.append(data)
print(data)
print("= = = = = ")
df = pd.DataFrame(data,index=[0])
# select two columns
for y in df:
print(y)
print("= = = = = ")
print(type(list_01))
print(list_01)
df = pd.DataFrame(list_01)
writer = pd.ExcelWriter('test10.24.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='welcome', index=False)
writer.save()

issues with Dask in python

I have built a simple Dask application to use multiprocessing to loop through files and create summaries.The code is looping through all the zip files in the directory and creating a list of names while iterating through the files( Dummy task). I was not able to either print the name or append it in the list. what's the issue, i cant figure out.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.ioff()
import time
import os
from pathlib import Path
import glob
import webbrowser
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2) # In this example I have 8 cores and processes (can also use threads if desired)
webbrowser.open(client.dashboard_link)
print(client)
os.chdir("D:\spx\Complete data\item_000027392")
csv_file_list=[file for file in glob.glob("*.zip")]
total_file=len(csv_file_list)
data_date=[]
columns=['Date', 'straddle_price_open', 'straddle_price_close']
summary=pd.DataFrame(columns =columns)
def my_function(i):
df=pd.read_csv(Path("D:\spx\Complete data\item_000027392",csv_file_list[i]),skiprows=0)
date = csv_file_list
data_date.append(date)
print(date)
return date
futures = []
for i in range(0,total_file):
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
The idea is that I should be able to make operations on the data and print outputs and charts while using dask but for some reason i can't.

How to let my subfile use my definition in main file?

I have this sentence in my main.py file:
import pandas as pd
from modules.my_self_defined import *
input='1.csv'
df=just_an_example(input)
in ./modules/my_self_defined.py:
def just_an_example(csv_file):
a=pd.read_csv(csv_file)
return a
Then when I run the file, it says pd is not defined in ./modules/my_self_defined.py
How could I make it work?
You use pandas (pd) in my_self_defined.py, not in main.py. So import it in my_self_defined.py instead and it'll work.

Reading rds file into python

I am trying to read an rds file in python using the following two sets of code that I found on stackoverflow:
import pyreadr
from collections import OrderedDict
result = pyreadr.read_r('Datasets.rds')
df = result["History"]
Which gives me an ordereddict with size 0 and:
import rpy2.robjects as robjects
import tzlocal
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
df = readRDS('Datasets.rds')
df = pandas2ri.ri2py(df)
which does not show anything to me while runs with no error.
Could you please let me know what might be wrong with these codes?

Pyinstaller creating exe file without pandas but still requires pandas import

I have a python code which runs fine when I run it through the cmd using python filename.py
But when I create exe file using pyinstaller the exe file opens and prints an exception I made(see at the bottom of the code) says no module name pandas.
Then, If I edit the code and import pandas and recreate the exe file it will work.
Does anyone have an idea?
I'm not using pandas in the code, and even PyCharm marks the import pandas line as redundant.
I have Windows 10 and anaconda installed.
Thank you
import csv,re
import os.path, time, datetime
import subprocess
import sys
try:
nameOfTitle= "name"
SName="S"
os.chdir(r'someaddress')
summaries_csv_path="summaries.csv"
HtmlPath='html/-----.htm'
HtmlPathNoDir='-----.htm'
HtmlPathNoDirC='----.htm'
HtmlPathNoDir='-----.htm'
sub="name"
headerH3="------"
dataWlCsv="raw.zip"
opRow = 0
sumRow = 0
col_num=0
innerCount=0
x = 0
headerList = list()
htmlfile = open(HtmlPath,"w")
execfile(r'some address')#header
readOp = csv.reader(open(r'some address.csv'),delimiter=',')
for row in readOp: # Read a single row from the CSV file
execfile(r'\some address.py')#logic
execfile(r'some address.py')#footer
except:
e = sys.exc_info()[1]
print("<p>Error: %s</p>" % e)
print "IN "+" MODUL!"

Categories

Resources