I'm trying to use Google Cloud Translation API for translating an excel (or csv) document that includes text in multiple languages and my target language is english.
I would like to use "Translate text in batches (Advanced edition only)" code sample (link here: https://cloud.google.com/translate/docs/samples/translate-v3-batch-translate-text) but in the code sample is a line that defines the source language so there can only be one source language.
But I need to detect the langugage first in the document and then translate the text to english. There is code sample for detecting language in a simple string of a text "Detecting languages (Advanced)" (link: https://cloud.google.com/translate/docs/advanced/detecting-language-v3) but I need to combine the first code sample that translates documents (but only has one source language defined) with the ability to detect language instead of having one source language defined.
Is there this type of code sample in the resources? How could this be solved?
Here is the sample code in question:
from google.cloud import translate
def batch_translate_text(
input_uri="gs://YOUR_BUCKET_ID/path/to/your/file.txt",
output_uri="gs://YOUR_BUCKET_ID/path/to/save/results/",
project_id="YOUR_PROJECT_ID",
timeout=180,
):
"""Translates a batch of texts on GCS and stores the result in a GCS location."""
client = translate.TranslationServiceClient()
location = "us-central1"
# Supported file types: https://cloud.google.com/translate/docs/supported-formats
gcs_source = {"input_uri": input_uri}
input_configs_element = {
"gcs_source": gcs_source,
"mime_type": "text/plain", # Can be "text/plain" or "text/html".
}
gcs_destination = {"output_uri_prefix": output_uri}
output_config = {"gcs_destination": gcs_destination}
parent = f"projects/{project_id}/locations/{location}"
# Supported language codes: https://cloud.google.com/translate/docs/language
operation = client.batch_translate_text(
request={
"parent": parent,
"source_language_code": "en",
"target_language_codes": ["ja"], # Up to 10 language codes here.
"input_configs": [input_configs_element],
"output_config": output_config,
}
)
print("Waiting for operation to complete...")
response = operation.result(timeout)
print("Total Characters: {}".format(response.total_characters))
print("Translated Characters: {}".format(response.translated_characters))
Unfortunately it is not possible to pass array of values to field source_language_code using batchTranslateText. What I could suggest is to perform detectLanguage and translateText per file.
What the code below does is:
It extracts the content to be translated. For testing purposes the the csv files used only have 1 column and content for sample1.csv is in tl(Tagalog) and sample2.csv is in es(Spanish).
Pass the extracted content to detect_language() to get detected language code.
Pass all the required parameters to translate_text() to translate
NOTE: The code below is only tested using csv files with one column. Edit the code at main() to pattern on what column you would like to extract data.
from google.cloud import translate
import csv
def listToString(s):
""" Transform list to string"""
str1 = " "
return (str1.join(s))
def detect_language(project_id,content):
"""Detecting the language of a text string."""
client = translate.TranslationServiceClient()
location = "global"
parent = f"projects/{project_id}/locations/{location}"
response = client.detect_language(
content=content,
parent=parent,
mime_type="text/plain", # mime types: text/plain, text/html
)
for language in response.languages:
return language.language_code
def translate_text(text, project_id,source_lang):
"""Translating Text."""
client = translate.TranslationServiceClient()
location = "global"
parent = f"projects/{project_id}/locations/{location}"
# Detail on supported types can be found here:
# https://cloud.google.com/translate/docs/supported-formats
response = client.translate_text(
request={
"parent": parent,
"contents": [text],
"mime_type": "text/plain", # mime types: text/plain, text/html
"source_language_code": source_lang,
"target_language_code": "en-US",
}
)
# Display the translation for each input text provided
for translation in response.translations:
print("Translated text: {}".format(translation.translated_text))
def main():
project_id="your-project-id"
csv_files = ["sample1.csv","sample2.csv"]
# Perform your content extraction here if you have a different file format #
for csv_file in csv_files:
csv_file = open(csv_file)
read_csv = csv.reader(csv_file)
content_csv = []
for row in read_csv:
content_csv.extend(row)
content = listToString(content_csv) # convert list to string
detect = detect_language(project_id=project_id,content=content)
translate_text(text=content,project_id=project_id,source_lang=detect)
if __name__ == "__main__":
main()
sample1.csv:
kamusta
ayos
sample2.csv:
cómo estás
okey
Output using the code above:
Translated text: how are you okay
Translated text: how are you ok
Related
Trying to parse XML and to store it in json format. I want to do dynamically,so that for different api keys it will also work. How to change this to dynamically to retrieve data from XML and pass it in JSON format? I need details inside PNRAmount in xml file.
Code:
`
pnr = myroot.xpath("/Envelope/Body/q:SellResponse/r:BookingUpdateResponseData/r:Success/r:PNRAmount",namespaces=namespace)
balanceDue = pnr[0][0].text.strip()
AuthorizedBalanceDue = pnr[0][1].text.strip()
SegmentCount = pnr[0][2].text.strip()
PassiveSegmentCount = pnr[0][3].text.strip()
TotalCost = pnr[0][4].text.strip()
PointsBalanceDue = pnr[0][5].text.strip()
TotalPointCost = pnr[0][6].text.strip()
AlternateCurrencyBalanceDue = pnr[0][7].text.strip()
# for pnrDetails in pnr:
PNR = {
"BalanceDue" : balanceDue,
"AuthorizedBalanceDue" : AuthorizedBalanceDue,
"SegmentCount": SegmentCount,
"PassiveSegmentCount":PassiveSegmentCount,
"TotalCost":TotalCost,
"PointsBalanceDue": PointsBalanceDue,
"TotalPointCost":TotalPointCost,
"AlternateCurrencyBalanceDue":AlternateCurrencyBalanceDue
}
print(PNR)
XML Code:
`
<Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<SellResponse xmlns="http://schemas.navitaire.com/WebServices/ServiceContracts/BookingService">
<BookingUpdateResponseData xmlns="http://schemas.navitaire.com/WebServices/DataContracts/Booking" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Success>
<RecordLocator />
<PNRAmount type = "true">
<BalanceDue>
4880.00000
</BalanceDue>
<AuthorizedBalanceDue>
4880.00000
</AuthorizedBalanceDue>
<SegmentCount>
1
</SegmentCount>
<PassiveSegmentCount>
0
</PassiveSegmentCount>
<TotalCost>
4880.00000
</TotalCost>
<PointsBalanceDue>
0
</PointsBalanceDue>
<TotalPointCost>
0
</TotalPointCost>
<AlternateCurrencyBalanceDue>
0
</AlternateCurrencyBalanceDue>
</PNRAmount>
</Success>
<Warning i:nil="true" />
<Error i:nil="true" />
<OtherServiceInformations i:nil="true" xmlns:a="http://schemas.navitaire.com/WebServices/DataContracts/Common" />
</BookingUpdateResponseData>
</SellResponse>
</Body>
</Envelope>
I've tried the static way, writing path for each tag and getting the value by .text method. I want to get the data dynamic way such that if key is different it will work.
Trying to do something like this
vehicle_response = detail_root.xpath('/x:Envelope/x:Body/y:VehicleLocationDetailRsp',
namespaces=optionstravel_vehicle_ns())[0]
counter_location = vehicle_response.xpath('y:LocationInfo',
namespaces=optionstravel_vehicle_ns())[0].get('CounterLocation')
Then passing counter_location as value in json.
You can use SAX interface which is built-in in Python. SAX offers sort of event-based XML parsing mechanism. It is rather helpful in your case since you want to parse only particular section of XML document. One has to implement a custom ContentHandler class to succeed.
Based on that tutorial, below is a handler class and respective code to parse your XML file into JSON file:
from xml.sax.handler import ContentHandler
from xml.sax import parse
import json
from collections import OrderedDict
class MyXmlPayloadHandler(ContentHandler):
def __init__(self, section = ""):
super().__init__()
self.desired_section_name = section # target section of XML doc
self.desired_section_ongoing = False
self.current_element = "" # currently ongoing key/property/element name
self.output_dict = OrderedDict() # the actual dictionary which will be subjected to jsonification
def startElement(self, name, attrs):
""" Called by the parser when new XML element begins """
if self.desired_section_ongoing:
self.current_element = name
self.output_dict[name] = ""
if self.desired_section_name in name:
self.desired_section_ongoing = True
def endElement(self, name):
""" Called by the parser when XML element ends """
# todo: you might want to implement conversion to int here,
# i.e. at the end of each element
if self.desired_section_ongoing:
if self.desired_section_name in name:
self.desired_section_ongoing = False
def characters(self, content):
""" Called by the parser to add characters to the element's value """
if content.strip() != "":
if self.desired_section_ongoing:
# add the characters to the current element value
self.output_dict[self.current_element] += content
# do the job
handler = MyXmlPayloadHandler(section = "PNRAmount")
parse(r"examples\sample.xml", handler)
# dumping the dict into json file
with open("datadict.json", "w") as f:
json.dump(handler.output_dict, f, indent=4)
Resulting datadict.json file:
{
"BalanceDue": "4880.00000",
"AuthorizedBalanceDue": "4880.00000",
"SegmentCount": "1",
"PassiveSegmentCount": "0",
"TotalCost": "4880.00000",
"PointsBalanceDue": "0",
"TotalPointCost": "0",
"AlternateCurrencyBalanceDue": "0"
}
You can use pandas read_xml() for parsing the XML and write it to to_json(), Source:
import pandas as pd
xml = 'Gorantula.xml'
namespaces={"xmlns:s": "http://schemas.xmlsoap.org/soap/envelope/",
"xmlns": "http://schemas.navitaire.com/WebServices/DataContracts/Booking",
"xmlns:i": "http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml(xml, xpath="//xmlns:PNRAmount", namespaces=namespaces)
df1 = df.T
df1.to_json(r'Gorantula.json')
print(df1)
Output this file:
{"0":{"type":true,
"BalanceDue":4880.0,
"AuthorizedBalanceDue":4880.0,
"SegmentCount":1,
"PassiveSegmentCount":0,
"TotalCost":4880.0,
"PointsBalanceDue":0,
"TotalPointCost":0,
"AlternateCurrencyBalanceDue":0}}
I am able to parse the large xml , since i am having memory issue am using SAX parser.
am using XMLGenerator to split the xml and again parse the same.
My question is, is there a way to parse a large xml part by part,
for example once I parse the first 10000 records load into csv or
dataframes in this case I will avoid redoing the same parse on the chunks..
import xml.sax
from collections import defaultdict
import pandas as pd
from sqlalchemy import create_engine
class EmplyeeData(xml.sax.ContentHandler):
def __init__(self):
self.employee_dict = defaultdict(list)
def startElement(self, tag, attr):
self.tag = tag
if tag == 'Emp1oyee':
self.employee_dict['Emp1oyee_ID'].append(attr['id'])
def characters(self, content):
if content.strip():
if self.tag == 'FName': self.FName = content
elif self.tag == 'LName': self.LName = content
elif self.tag == 'City': self.City = content
def endElement(self, tag):
if tag == 'FName': self.employee_dict['FName'].append(self.FName)
elif tag == 'LName': self.employee_dict['LName'].append(self.LName)
elif tag == 'City': self.employee_dict['City'].append(self.City)
handler = EmployeeData()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse('employee_xml.xml')
EmployeeDetails = parser.getContentHandler()
EmployeeData_out = EmployeeDetails.employee_dict
df = pd.DataFrame(EmployeeData_out, columns=EmployeeData_out.keys()).set_index('Emp1oyee_ID')
# for example I am writing in the csv file, actually I will be loading the data into database table.
#I want to load the data incrementaly by parsing certain count of records at a time for example 10000 records at a time.
##con_eng = create_engine('oracle://[user]:[pass]#[host]:[port]/[schema]', echo=False)
##df.to_sql(name='target_table',con=con_eng ,if_exists = 'append', index=False)
df.to_csv('employee_details.csv', sep=',', encoding='utf-8')
Sample XML
<?xml version="1.0" ?>
<Emp1oyees>
<Emp1oyee id=1>
<FName>SAM</FName>
<LName>MARK</LName>
<City>NewJersy</City>
</Emp1oyee>
<Emp1oyee id=2>
<FName>RAJ</FName>
<LName>KAMAL</LName>
<City>NewYork</City>
</Emp1oyee>
<Emp1oyee id=3>
<FName>Brain</FName>
<LName>wood</LName>
<City>Buffalo</City>
</Emp1oyee>
:
:
:
:
:
<Emp1oyee id=1000000>
<FName>Mark</FName>
<LName>wood</LName>
<City>NewJersy</City>
</Emp1oyee>
</Emp1oyees>
Given your XML is pretty shallow, consider the new v1.5 iterparse support for large XML using pandas.read_xml, available for either lxml (default) or etree parsers.
employees_df = (
pd.read_xml(
"Input.xml",
iterparse = {"Employee": ["id", "FName", "LName", "City"]},
names = ["Employee_ID", "FName", "LName", "City"],
parser = "etree"
).set_index('Emp1oyee_ID')
)
(Testing parsed Wikipedia's 12+ GB article daily dump XML into a 3.5mil-row data frame in 5 minutes on a laptop of 8GB RAM. See output in docs.)
Try following powershell :
using assembly System.Xml
using assembly System.Xml.Linq
$FILENAME = "c:\temp\test.xml"
$reader = [System.Xml.XmlReader]::Create($FILENAME)
while($reader.EOF -eq $False)
{
if ($reader.Name -ne "Employee")
{
$reader.ReadToFollowing("Employee")
}
if ($reader.EOF -eq $False)
{
$employee = [System.Xml.Linq.XElement]::ReadFrom($reader)
$id = $employee.Attribute("id").Value
$fname = $employee.Element("FName").Value
$lname = $employee.Element("LName").Value
$city = $employee.Element("City").Value
Write-Output "id = $id first = $fname last = $lname city = $city"
}
}
I'm using Django I want to send some data from my database to a document word, I'm using Python-Docx for creating word documents I use the class ExportDocx it can generate a static word file but I want to place some dynamic data (e.g. product id =5, name=""..) basically all the details to the "product" into the document
class ExportDocx(APIView):
def get(self, request, *args, **kwargs):
queryset=Products.objects.all()
# create an empty document object
document = Document()
document = self.build_document()
# save document info
buffer = io.BytesIO()
document.save(buffer) # save your memory stream
buffer.seek(0) # rewind the stream
# put them to streaming content response
# within docx content_type
response = StreamingHttpResponse(
streaming_content=buffer, # use the stream's content
content_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document'
)
response['Content-Disposition'] = 'attachment;filename=Test.docx'
response["Content-Encoding"] = 'UTF-8'
return response
def build_document(self, *args, **kwargs):
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(0.95)
section.bottom_margin = Inches(0.95)
section.left_margin = Inches(0.79)
section.right_margin = Inches(0.79)
# add a header
document.add_heading("This is a header")
# add a paragraph
document.add_paragraph("This is a normal style paragraph")
# add a paragraph within an italic text then go on with a break.
paragraph = document.add_paragraph()
run = paragraph.add_run()
run.italic = True
run.add_text("text will have italic style")
run.add_break()
return document
This is the URL.py of the
path('<int:pk>/testt/', ExportDocx.as_view() , name='generate-testt'),
How can I generate it tho I think I need to make the data string so it can work with py-docx.
for the python-docx documentation: http://python-docx.readthedocs.io/
For a product record like: record = {"product_id": 5, "name": "Foobar"), you can add it to the document in your build_document()` method like:
document.add_paragraph(
"Product id: %d, Product name: %s"
% (record.product_id, record.name)
)
There are other more modern methods for interpolating strings, although this sprintf style works just fine for most cases. This resource is maybe not a bad place to start.
So I found out that I need to pass the model I was doing it but in another version of code and forgot to add it... Basically, I just had to add these lines of code, hope this helps whoever is reading this.
def get(self, request,pk, *args, **kwargs):
# create an empty document object
document = Document()
product = Product.objects.get(id=pk)
document = self.build_document(product)
And in the build of the document we just need to stringify it simply by using f'{queryset.xxxx}'
def build_document(self,queryset):
document = Document()
document.add_heading(f'{queryset.first_name}')
I am trying to make an intent using androidhelper in Qpython OL with the code below:
action = "org.escpos.intent.action.PRINT"
packagename = "com.loopedlabs.escposprintservice" # target application
data = data # convert data to PDF byte array format
extras = {
'DATA_TYPE':'PDF',
'PDF_DATA' : data # raw PDF data
}
intent = droid.makeIntent( # make an intent
action = action,
packagename = packagename,
extras = extras
)
But I am having the following Error:
org.json.JSONException: Value [2,{"extras":{"DATA_TYPE":"PDF","PDF_DATA":"4"},"categories":null,"action":"org.escpos.intent.action.PRINT","flags":268435456},null] at 0 of type org.json.JSONArray cannot be converted to JSONObject
Additionally, I did not find any working example of android.makeIntent(). Could somebody help, please?
I think something went wrong in extras.
What's your data? Are you sure it can be called in this way?
Here's a working makeIntent example. This works on my phone:
#-*-coding:utf8;-*-
#qpy:2
#qpy:console
from androidhelper import Android
droid = Android()
uri2open = "https://google.com"
intent2start = droid.makeIntent("android.intent.action.VIEW", uri2open, "text/html", None, [u"android.intent.category.BROWSABLE"], None, None, None)
print(droid.startActivityForResultIntent(intent2start.result))
I have a function in AWS Lambda that connects to the Twitter API and returns the tweets which match a specific search query I provided via the event. A simplified version of the function is below. There's a few helper functions I use like get_secret to manage API keys and process_tweet which limits what data gets sent back and does things like convert the created at date to a string. The net result is that I should get back a list of dictionaries.
def lambda_handler(event, context):
twitter_secret = get_secret("twitter")
auth = tweepy.OAuthHandler(twitter_secret['api-key'],
twitter_secret['api-secret'])
auth.set_access_token(twitter_secret['access-key'],
twitter_secret['access-secret'])
api = tweepy.API(auth)
cursor = tweepy.Cursor(api.search,
q=event['search'],
include_entities=True,
tweet_mode='extended',
lang='en')
tweets = list(cursor.items())
tweets = [process_tweet(t) for t in tweets if not t.retweeted]
return json.dumps({"tweets": tweets})
From my desktop then, I have code which invokes the lambda function.
aws_lambda = boto3.client('lambda', region_name="us-east-1")
payload = {"search": "paint%20protection%20film filter:safe"}
lambda_response = aws_lambda.invoke(FunctionName="twitter-searcher",
InvocationType="RequestResponse",
Payload=json.dumps(payload))
results = lambda_response['Payload'].read()
tweets = results.decode('utf-8')
The problem is that somewhere between json.dumpsing the output in lambda and reading the payload in Python, the data has gotten screwy. For example, a line break which should be \n becomes \\\\n, all of the double quotes are stored as \\" and Unicode characters are all prefixed by \\. So, everything that was escaped, when it was received by Python on my desktop with the escaping character being escaped. Consider this element of the list that was returned (with manual formatting).
'{\\"userid\\": 190764134,
\\"username\\": \\"CapitalGMC\\",
\\"created\\": \\"2018-09-02 15:00:00\\",
\\"tweetid\\": 1036267504673337344,
\\"text\\": \\"Protect your vehicle\'s paint! Find out how on this week\'s blog.
\\\\ud83d\\\\udc47\\\\n\\\\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW\\"}'
I can use regex to fix some problems (\\" and \\\\n) but the Unicode is tricky because even if I match it, how do I replace it with a properly escaped character? When I do this in R, using the aws.lambda package, everything is fine, no weird escaped escapes.
What am I doing wrong on my desktop with the response from AWS Lambda that's garbling the data?
Update
The process tweet function is below. It literally just pulls out the bits I care to keep, formats the datetime object to be a string and returns a dictionary.
def process_tweet(tweet):
bundle = {
"userid": tweet.user.id,
"username": tweet.user.screen_name,
"created": str(tweet.created_at),
"tweetid": tweet.id,
"text": tweet.full_text
}
return bundle
Just for reference, in R the code looks like this.
payload = list(search="paint%20protection%20film filter:safe")
results = aws.lambda::invoke_function("twitter-searcher"
,payload = jsonlite::toJSON(payload
,auto_unbox=TRUE)
,type = "RequestResponse"
,key = creds$key
,secret = creds$secret
,session_token = creds$session_token
,region = creds$region)
tweets = jsonlite::fromJSON(results)
str(tweets)
#> 'data.frame': 133 obs. of 5 variables:
#> $ userid : num 2231994854 407106716 33553091 7778772 782310 ...
#> $ username: chr "adaniel_080213" "Prestige_AdamL" "exclusivedetail" "tedhu" ...
#> $ created : chr "2018-09-12 14:07:09" "2018-09-12 11:31:56" "2018-09-12 10:46:55" "2018-09-12 07:27:49" ...
#> $ tweetid : num 1039878080968323072 1039839019989983232 1039827690151444480 1039777586975526912 1039699310382931968 ...
#> $ text : chr "I liked a #YouTube video https://url/97sRShN4pM Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film" "Another #Corvette #ZO6 full body clearbra wrap completed using #xpeltech ultimate plus PPF ... Paint protection"| __truncated__ "We recently protected this Tesla Model 3 with Paint Protection Film and Ceramic Coating.#teslamodel3 #charlotte"| __truncated__ "Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film https://url/AD1cl5dNX3" ...
tweets[131,]
#> userid username created tweetid
#> 131 190764134 CapitalGMC 2018-09-02 15:00:00 1036267504673337344
#> text
#> 131 Protect your vehicle's paint! Find out how on this week's blog.👇\n\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW
In your lambda function you should return a response object with a JSON object in the response body.
# Lambda Function
def get_json(event, context):
"""Retrieve JSON from server."""
# Business Logic Goes Here.
response = {
"statusCode": 200,
"headers": {},
"body": json.dumps({
"message": "This is the message in a JSON object."
})
}
return response
Don't use json.dumps()
I had a similar issue, and when I just returned "body": content instead of "body": json.dumps(content) I could easily access and manipulate my data. Before that, I got that weird form that looks like JSON, but it's not.