Multipleoutput file in hadoop streaming

Multipleoutput file in hadoop streaming - python

I followed this post:
Multiple Output Files for Hadoop Streaming with Python Mapper
for generating Multipleoutput file in hadoop streaming and i am getting that too.
So i wanted my structure like that:
date--:
---code=1
---code=2
---code=3
date--:
:
:
But inside code=1 and other directories everything is written to one file only and since my data is very large my job is taking very large time for completion.
Any workaround for that???
package com.custom;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import java.lang.*;
public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
String key_temp,date,code,key_final;
key_temp=key.toString();
String[] arr=key_temp.split("/");
date="date=" +arr[0];
code ="code="+arr[1];
key_final=date+"/"+code;
Text t1 = new Text(key_final);
return new Path(t1.toString(), leaf).toString();
}
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}

Related

how can i show PDF file as Bytes using QWebEngineView in PyQt5? [duplicate]

So, using PyQt5's QWebEngineView and the .setHTML and .setContent methods have a 2 MB size limitation. When googling for solutions around this, I found two methods:
Use SimpleHTTPServer to serve the file. This however gets nuked by a firewall employed in the company.
Use File Urls and point to local files. This however is a rather bad solution, as the HTML contains confidential data and I can't leave it on the harddrive, under any circumstance.
The best solution I currently see is to use file urls, and get rid of the file on program exit/when loadCompleted reports it is done, whichever comes first.
This is however not a great solution and I wanted to ask if there is a solution I'm overlooking that would be better?

Why don't you load/link most of the content through a custom url scheme handler?
webEngineView->page()->profile()->installUrlSchemeHandler("app", new UrlSchemeHandler(e));
class UrlSchemeHandler : public QWebEngineUrlSchemeHandler
{ Q_OBJECT
public:
void requestStarted(QWebEngineUrlRequestJob *request) {
QUrl url = request->requestUrl();
QString filePath = url.path().mid(1);
// get the data for this url
QByteArray data = ..
//
if (!data.isEmpty())
{
QMimeDatabase db;
QString contentType = db.mimeTypeForFileNameAndData(filePath,data).name();
QBuffer *buffer = new QBuffer();
buffer->open(QIODevice::WriteOnly);
buffer->write(data);
buffer->close();
connect(request, SIGNAL(destroyed()), buffer, SLOT(deleteLater()));
request->reply(contentType.toUtf8(), buffer);
} else {
request->fail(QWebEngineUrlRequestJob::UrlNotFound);
}
}
};
you can then load a website by webEngineView->load(new QUrl("app://start.html"));
All relative pathes from inside will also be forwarded to your UrlSchemeHandler..
And rember to add the respective includes
#include <QWebEngineUrlRequestJob>
#include <QWebEngineUrlSchemeHandler>
#include <QBuffer>

One way you can go around this is to use requests and QWebEnginePage's method runJavaScript:
web_engine = QWebEngineView()
web_page = web_engine.page()
web_page.setHtml('')
url = 'https://youtube.com'
page_content = requests.get(url).text
# document.write writes a string of text to a document stream
# https://developer.mozilla.org/en-US/docs/Web/API/Document/write
# And backtick symbol(``) is for multiline strings
web_page.runJavaScript('document.write(`{}`);'.format(page_content))

I tried to read some values in a txt file with java and python. Finally I could not manage it with both. I cannot figure out where is the problem

I try to extract two kind of value from a txt file and write them to two separate txt files. I know that my functions work properly and I cannot figure out any mistake in my code. I realised that both two languages do not read the text file as it is. What I mean by that is for example normally the txt file has 10367 lines in it but when I count the lines in the code, there are 20735 lines in python. I cannot understand why this happens. I do not have an in-depth knowledge about how programming languages read the files. Please give me some information about the possible causes of this situation.
thanks in advance...
This is the pyhton code:
def main():
serverSpeedsList=list()
totalSpeedsList=list()
ssString=str()
stString=str()
with open("C:\\Users\\yusuf\\OneDrive\\Masaüstü\\SpeedTests\\Logs\\log100.txt",'r') as inFile:
for line in inFile:
i+=1
ss=speedOfServer(line)
st=speedOfTotal(line)
if ss!="":
ssString+=ss+"\n"
serverSpeedsList.append(ss)
if st!="":
stString+=st+"\n"
totalSpeedsList.append(st)
with open("C:\\Users\\yusuf\\OneDrive\\Masaüstü\\SpeedTests\\Results\\server100.txt",'w') as outFile:
outFile.write(ssString)
with open("C:\\Users\\yusuf\\OneDrive\\Masaüstü\\SpeedTests\\Results\\total100.txt",'w') as outFile:
outFile.write(ssString)
def speedOfServer(text):
startStr="time\":\""
endStr=" ms"
result=str()
startIx=text.find(startStr)
endIx=text.find(endStr)
if startIx>=0 and endIx>=0:
result=text[startIx+len(startStr) : endIx]
return result
def speedOfTotal(text):
startStr="showProfile.php ("
endStr="ms"
result=str()
startIx=text.find(startStr)
endIx=text.find(endStr)
if startIx>=0 and endIx>=0:
result=text[startIx+len(startStr) : endIx]
return result
main()
and this is the java code to do the same
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
private static String serverSpeedFileName="C:\\Users\\yusuf\\OneDrive\\Masaüstü\\JavaSpeedAnalyser\\Results\\Server\\serverSpeed100.txt";
private static String responseSpeedFileName="C:\\Users\\yusuf\\OneDrive\\Masaüstü\\JavaSpeedAnalyser\\Results\\Total\\responseSpeed100.txt";
private static String logFilePath="C:\\Users\\yusuf\\OneDrive\\Masaüstü\\JavaSpeedAnalyser\\Logs\\log100.txt";
public static void main(String[] args){
StringBuilder serverSpeeds=new StringBuilder();
StringBuilder responseSpeeds=new StringBuilder();
try{
File file = new File(logFilePath);
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null){
String serverSpeed=speedOfServer(line);
if(!serverSpeed.isEmpty()){
System.out.println(serverSpeed);
serverSpeeds.append(serverSpeed+"\n");
}
String responseSpeed=speedOfResponses(line);
if(!responseSpeed.isEmpty()){
responseSpeeds.append(responseSpeed+"\n");
}
}
br.close();
}catch(IOException e){
e.printStackTrace();
}
writeToFile(serverSpeedFileName, serverSpeeds.toString());
writeToFile(responseSpeedFileName, responseSpeeds.toString());
}
private static String speedOfServer(String text){
String start="time\":\"";
String end=" ms";
String result="";
int startIndex=text.indexOf(start);
int endIndex=text.indexOf(end);
if(startIndex>=0 && endIndex>=0 ){
result=text.substring(startIndex+start.length(),endIndex);
}
return result;
}
private static String speedOfResponses(String text){
String start="%5bF%5dshowProfile.php (";
String end="ms)";
String result="";
int startIndex=text.indexOf(start);
int endIndex=text.indexOf(end);
if(startIndex>=0 && endIndex>=0){
result=text.substring(startIndex+start.length(),endIndex);
}
return result;
}
}
I try to analyse a logcat file from an android phone, this is why I try to do that but I cannot manage it. Please help me

C#: Read fast from a file that is being used by another process

I have a python script that reads from a logfile and outputs certain data from it. The way it reads from it is
try:
with open(os.path.expandvars('Path/To/My/Log.txt', 'r') as f:
logContent = [line.rstrip() for line in f]
except Exception as e:
print(e)
Now I wanted to recreate that python script in C#. The main problem is, that the log file makes about 30.000 Lines in 30 minutes. While the program that handles that log isn't being executed, I can easily open the file and read from it, because it's not being used by that program. But when that program runs, I need to read from the file with a filestream, and so the reading of 30.000 lines takes ages:
private string GetLog(string path)
{
string log = "";
FileStream reader = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
StreamReader logFileReader = new StreamReader(reader);
while (!logFileReader.EndOfStream)
{
log += logFileReader.ReadLine();
// Your code here
}
// Clean up
logFileReader.Close();
reader.Close();
return log;
}
Is there a way to make my code read from the file in max 5 seconds?

I got it. When I use stream.ReadToEnd() it reads everything in about 2 seconds

As you have mentioned file is big, so better to use StringBuilder over string, you can use using also so no need to call close() explicitly.
StringBuilder sb = new StringBuilder();
string path = "some path";
using (FileStream logFileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (StreamReader logFileReader = new StreamReader(logFileStream))
{
while (!logFileReader.EndOfStream)
{
sb.Append(logFileReader.ReadLine());
}
}
}
string log = sb.ToString();

Open apache thrift binary files in python

I have 5gb of data serialized with apache thrift and a .thrift file with the formatting of the data. I have tried using thriftpy and the official thrift package but I can't wrap my head around how to open the files.
The data is the expanded dataset from http://www.iesl.cs.umass.edu/data/wiki-links
A description of the data format can be found here https://code.google.com/p/wiki-link/wiki/ExpandedDataset

The Scala setup is to be found in the ThriftSerializerFactory.scala file. Since the naming of most things is consistent throughout the Thrift libraries, you more or less model your python code after the Scala example:
package edu.umass.cs.iesl.wikilink.expanded.process
import org.apache.thrift.protocol.TBinaryProtocol
import org.apache.thrift.transport.TIOStreamTransport
import java.io.File
import java.io.BufferedOutputStream
import java.io.FileOutputStream
import java.io.BufferedInputStream
import java.io.FileInputStream
import java.util.zip.{GZIPOutputStream, GZIPInputStream}
object ThriftSerializerFactory {
def getWriter(f: File) = {
val stream = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(f)), 2048)
val protocol= new TBinaryProtocol(new TIOStreamTransport(stream))
(stream, protocol)
}
def getReader(f: File) = {
val stream = new BufferedInputStream(new GZIPInputStream(new FileInputStream(f)), 2048)
val protocol = new TBinaryProtocol(new TIOStreamTransport(stream))
(stream, protocol)
}
}
You basically set up a stream transport and the binary protocol. If you leave the data compressed, you will have to add the gzip piece to the puzzle, but once the data are decompressed this should not be needed anymore.
The code in WikiLinkItemIterator.scala shows how to read the data files using the factory class above.
class PerFileWebpageIterator(f: File) extends Iterator[WikiLinkItem] {
var done = false
val (stream, proto) = ThriftSerializerFactory.getReader(f)
private var _next: Option[WikiLinkItem] = getNext()
private def getNext(): Option[WikiLinkItem] = try {
Some(WikiLinkItem.decode(proto))
} catch {case _: TTransportException => {done = true; stream.close(); None}}
def hasNext(): Boolean = !done && (_next != None || {_next = getNext(); _next != None})
def next(): WikiLinkItem = if (hasNext()) _next match {
case Some(wli) => {_next = None; wli}
case None => {throw new Exception("Next on empty iterator.")}
} else throw new Exception("Next on empty iterator.")
}
Steps to implement:
implement Thrift protocol stack factory like above (recommendable pattern, BTW)
instantiate the root element of each record, in our case a WikiLinkItem
call instance.read(proto) to read one record of data

how can i pass xml format data from flex to python

i want to pass xml format data into python from flex.i know how to pass from flex but my question is how can i get the passed data in python and then the data should be inserted into mysql.and aslo i want to retrieve the mysql data to the python(cgi),the python should convert all the data into xml format,and pass all the data to the flex..
Thank's in advance.....

See http://www.artima.com/weblogs/viewpost.jsp?thread=208528 for more details, here is a breif overview of what I think you are looking for.
The SimpleXMLRPCServer library allows you to easily create a server. Here's about the simplest server you can create, which provides two services to manipulate strings:
import sys
from random import shuffle
from SimpleXMLRPCServer import SimpleXMLRPCServer
class MyFuncs:
def reverse(self, str) :
x = list(str);
x.reverse();
return ''.join(x);
def scramble(self, str):
x = list(str);
shuffle(x);
return ''.join(x);
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_instance(MyFuncs())
server.serve_forever()
Once you make a connection to the server, that server acts like a local object. You call the server's methods just like they're ordinary methods of that object.
This is about as clean an RPC implementation as you can hope for (and other Python RPC libraries exist; for example, CORBA clients). But it's all text based; not very satisfying when trying to create polished applications with nice GUIs. What we'd like is the best of all worlds -- Python (or your favorite language) doing the heavy lifting under the covers, and Flex creating the user experience.
To use the library, download it and unpack it somewhere. The package includes all the source code and the compiled as3-rpclib.swc library -- the .swc extension indicates an archive file, and pieces of this library can be pulled out and incorporated into your final .swf. To include the library in your project, you must tell Flexbuilder (you can get a free trial or just use the free command-line tools, and add on the Apollo portion) where the library is located by going to Project|Properties and selecting "Apollo Build Path," then choosing the "Library path" tab and pressing the "Add SWC..." button. Next, you add the namespace ak33m to your project as seen in the code below, and you're ready to create an XMLRPCObject.
Note: the only reason I used Apollo here was that I was thinking in terms of desktop applications with nice UIs. You can just as easily make it a Flex app.
Here's the entire Apollo application as a single MXML file, which I'll explain in detail:
<?xml version="1.0" encoding="utf-8"?>
<mx:ApolloApplication xmlns:mx="http://www.adobe.com/2006/mxml"
xmlns:ak33m="http://ak33m.com/mxml" layout="absolute">
<mx:Form>
<mx:FormHeading label="String Modifier"/>
<mx:FormItem label="Input String">
<mx:TextInput id="instring" change="manipulate()"/>
</mx:FormItem>
<mx:FormItem label="Reversed">
<mx:Text id="reversed"/>
</mx:FormItem>
<mx:FormItem label="Scrambled">
<mx:Text id="scrambled"/>
</mx:FormItem>
</mx:Form>
<ak33m:XMLRPCObject id="server" endpoint="http://localhost:8000"/>
<mx:Script>
<![CDATA[
import mx.rpc.events.ResultEvent;
import mx.rpc.events.FaultEvent;
import mx.rpc.AsyncToken;
import mx.controls.Alert;
import mx.collections.ItemResponder;
private function manipulate() : void {
server.reverse(instring.text).addResponder(new ItemResponder(reverseResult, onFault));
server.scramble(instring.text).addResponder(new ItemResponder(scrambleResult, onFault));
}
private function reverseResult(event : ResultEvent, token : AsyncToken = null) : void {
reversed.text = event.result.toString();
}
private function scrambleResult(event : ResultEvent, token : AsyncToken = null) : void {
scrambled.text = event.result.toString();
}
private function onFault (event : FaultEvent, token : AsyncToken = null) : void {
Alert.show(event.fault.faultString, event.fault.faultCode);
}
]]>
</mx:Script>
</mx:ApolloApplication>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multipleoutput file in hadoop streaming - python

Related

how can i show PDF file as Bytes using QWebEngineView in PyQt5? [duplicate]

I tried to read some values in a txt file with java and python. Finally I could not manage it with both. I cannot figure out where is the problem

C#: Read fast from a file that is being used by another process

Open apache thrift binary files in python

how can i pass xml format data from flex to python

Categories

Resources