Romanize generic Japanese in commandline - python

I would like to transliterate generic Japanese, including Kanji, by the standard Hepburn system on the bash command line.
I've evaluated several options, but
Google Translator (available via Translate Shell) is only accurate at Hiragana / Katakana
KAKASI delivers ASCII, but no transliteration (so Toukyou instead of Tōkyō)
So I would like to parse the ouput of http://nihongo.j-talk.com
The output is in div.outputwrap or div.output
If it's futile to do this purely with Bash tools (curl / jq?), how could I reach this with Python / BeautifulSoup?
Sorry for giving no snippet, I have no clue how to POST data to a website AND use the result if there is no API.

Taking a look of source html of http://nihongo.j-talk.com site, I've made a guess of API.
Here are the steps:
1) Send a Japanese string to the server by wget and obtain the result in index.html.
2) Parse the index.html and extract Romaji strings.
Here is the sample code:
#!/bin/bash
string="日本語は、主に日本で使われている言語である。日本では法規によって「公用語」として規定されているわけではないが、各種法令(裁判所法第74条、会社計算規則第57条、特許法施行規則第2条など)において日本語を用いることが定められるなど事実>上の公用語となっており、学校教育の「国語」でも教えられる。"
uniqid="46a7e5f7e7c7d8a7d9636ecb077da485479b66bc"
wget -N --post-data "uniqid=$uiqid&Submit='Translate Now'&kanji_parts=standard&kanji=$string&converter=spaced&kana_output=romaji" http://nihongo.j-talk.com/ > /dev/null 2>&1
perl -e '
$file = "index.html";
open(FH, $file) or die "$file: $!\n";
while (<FH>) {
if (/<div id=.spaced. class=.romaji.>(.+)/) {
($str = $1) =~ s/<.*?>//g;
$str =~ s/\&\#(\d+);/&utfconv($1)/eg;
print $str, "\n";
}
}
# utf16 to utf8
sub utfconv {
$utf16 = shift;
my $upper = ($utf16 >> 6) & 0b0001_1111 | 0b1100_0000;
my $lower = $utf16 & 0b0011_1111 | 0b1000_0000;
pack("C2", $upper, $lower);
}'
Some comments:
- I wrote the parser with Perl just because it is rather familiar to me but you may modify or convert it to other language by reading index.html file.
- The uniqid string is what I have picked from html source of the site. If it doesn't work well, make sure what is embedded in the html source.
Hope this helps.

Related

How can tshark and powershell redirection create a bytecode textfile?

Okay so this is actually a problem I have been able to fix, but I still do not understand why the problem existed in the first place.
I have been using tshark on network traffic with the intention of creating a txt or csv file containing key information I can use for machine learning. At first glance the file looked perfectly fine and exactly how I imagined. However, in python I notice some strange inital characters and when applying the split operator, suddenly I am working on bytecode.
My powershell script initially looked like this:
$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst
Get-ChildItem $src -Filter *.pcap |
Foreach-Object {
$content = Get-Content $_.FullName
$filename=$_.BaseName
tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta > $dst$filename.txt
}
Now I try to read this file in my jupyter notebook
directory = "G://.../train_data/tsharked/"
file = open(directory+"example.txt", "r")
for line in file.readlines():
print(line)
words = line.split(",")
print(words)
break
The result looks like this
ÿþ134.169.109.51,134.169.109.25,543,0.000000000,0.000000000
['ÿþ1\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x005\x001\x00', '\x001\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x002\x005\x00', '\x005\x004\x003\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00\n']
When I opened the textfile in Editor, the special characters ÿþ did not appear. This is the first time I see them. What do they even mean here?
Anyhow I managed to fix this only by removing the output redirection in my powershell script.
$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst
Get-ChildItem $src -Filter *.pcap |
Foreach-Object {
$content = Get-Content $_.FullName
$filename=$_.BaseName
$out = tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta
Set-Content -Path $dst$filename.txt -Value $out
}
And this is where I am asking myself the question of how it is possible that the output redirection in powershell has managed to write some kind of byte output? In my understanding this is simply a redirection of the console output, hence the name. How can this be anything but a String?
As of PowerShell 7.2, output from external programs is invariably decoded as text before further processing, which means that raw (byte) output can neither be passed on via | nor captured with >. See this answer for details.
PowerShell's > redirection operator is effectively an alias of Out-File, and its default character encoding therefore applies.
In Windows PowerShell, Out-File defaults to "Unicode" encoding, i.e. UTF-16LE:
This encoding uses a BOM (byte-order mark), whose bytes, if interpreted individually as ANSI (Windows-1252) bytes, render as ÿþ), and it represents most characters as two-byte sequences,[1] which in the case of most characters in the Windows-1252 character set (which itself is a superset of ASCII) means that the second byte in each sequence is a NUL (0x0 byte) - this is what you're seeing.
Fortunately, in PowerShell (Core) 7+, all file-processing cmdlets now consistently default to (BOM-less) UTF-8.
To use a different encoding, either call Out-File explicitly and use its -Encoding parameter, or - as you have done, and as is generally preferable for the sake of performance when dealing with data that already is text - use Set-Content.
[1] At least two bytes are needed per character; for characters outside the so-called BMP (Basic Multilingual Plane), a pair of two-byte sequences is needed.

Use csvkit in a bash script to convert CSV to desired format?

I need to convert a large csv file into the static content format for Kirby CMS.
Say I have a csv file:
id,name,age,bio
0,bob,25,"Example bio, with a comma"
1,sam,37,"Hello World"
...
That I would like to restructure into separate folders/files like so:
1_bob/person.txt
ID: 0
----
Name: bob
----
Age: 25
----
Bio: Example bio, with a comma
2_sam/person.txt
ID: 1
----
Name: sam
----
Age: 37
----
Bio: Hello World
etc...
This is obviously a far more simplified version of my data, thus I had considered using csvkit because of its ability to properly parse commas in quoted fields etc.
I had found this script: https://forum.getkirby.com/t/import-from-csv/6038/15 which fails as a result of the above issue (the inability for basic bash IFS to read more complex CSV data)
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read number year title website slug
do
if [ ! -d "$number-$slug" ]; then
mkdir ./$number-$slug
fi
echo -e "Year: $year\n----\nTitle: $title\n----\nWebsite: $website" > $number-$slug/project.txt
done < projects.csv
IFS=$OLDIFS
I know I could write a python script to do this faily easily but was wondering if there is indeed a way to combine any of the tooling of csvkit to do this in a bash script. My assumption was to use csvcut to pull lines of data out of the csv but of course am still at the same block of how to parse this data and output it into the desired format.
Usually, much easier to process TSV files vs CSV files - with bash, awk and many utilities. It avoid the need for quoting. csvformat will handle the conersion:
Using your current script:
csvformat -T projects.csv | while IFS=$'\t' read number year title website slug
do
if [ ! -d "$number-$slug" ]; then
mkdir ./$number-$slug
fi
echo -e "Year: $year\n----\nTitle: $title\n----\nWebsite: $website" > $number-$slug/project.txt
done
The code expect 'slug' column for each record, which is not in the sample input. I'm assuming the actual input will have this in the 5th column

Read JSON Multiple Values into Bash Variable - not able to use any 3rd party tools like jq etc

This has been asked a million times and I know there are a million solns. However im restricted in that I cant install anything on this client server , so I have whatever bash can come up with :)
I'm referencing Parsing JSON with Unix tools and using this to read data and split into lines.
$ cat demo.json
{"rows":[{"name":"server1.domain.com","Access":"Owner","version":"99","Business":"Owner1","Owner2":"Main_Apprve","Owner1":"","Owner2":"","BUS":"Marketing","type":"data","Egroup":["ALPHA","BETA","GAMA","DELTA"],"Ename":["D","U","G","T","V"],"stage":"TEST"}]}
However as you can see it splits the "Egroup" and others with multiple entries into single lines making it a little bit more difficult.
cat demo.json | sed -e 's/[{}]/''/g' | awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}>
"rows":["name":"server1.domain.com"
"Access":"Owner"
"version":"99"
"Business":"Owner1"
"Owner2":"Main_Apprve"
"Owner1":""
"Owner2":""
"BUS":"Marketing"
"type":"data"
"Egroup":["ALPHA"
"BETA"
"GAMA"
"DELTA"]
"Ename":["D"
"U"
"G"
"T"
"V"]
"stage":"TEST"]
Im trying to capture the data so i can list using a shell script. How would you advise me to capture each variable and then reuse in reporting in a shell script?
grep -Po '"Egroup":.*?[^\\]",' demo.json
"Egroup":["ALPHA",
As you can see this wouldn't work for lines with more than 1 entry.
Thoughts appreciated. ( btw Im open to python and perl options but without having to install any extra modules to use with json )
Perl 1-liner using JSON module:
perl -lane 'use JSON; my $data = decode_json($_); print join( ",", #{ $data->{rows}->[0]->{Egroup} } )' demo.json
Output
ALPHA,BETA,GAMA,DELTA
If you do not have JSON installed, instead of trying to reinvent a JSON parser, you can copy the source of JSON::PP (PP means Pure Perl) and put it in your working directory:
/working/demo.json
/working/JSON/PP.pm # <- put it here
perl -lane -I/working 'use JSON::PP; my $data = decode_json($_); print join( ",", #{ $data->{rows}->[0]->{Egroup} } )' demo.json
It's simple using Python.
Example
$ python -c 'import sys, json; print json.load(sys.stdin)["rows"][0]["Egroup"]' <demo.json
[u'ALPHA', u'BETA', u'GAMA', u'DELTA']
I think it's good to add jq solution, because it's a great JSON tool:
<input jq --raw-output '.rows[].Egroup|join(",")'
You can see it in action here https://jqplay.org/s/de4X1TUBG4

Trying to read DBM file

I have a stripped down real-time Linux box that interfaces with some hardware.
The configuration files are *.dbm files and I cannot access them. They seem to be some sort of key-value database but every library I have tried has come up empty.
I have tried the DBM reading libraries from Perl, Python, and Ruby with no luck. Any guidance on these files would be great, I have not seen them before.
This is what happens when I cat one file out.
DBMFILE Aug 31 2004,�
,jy �
�~���"��+�K&��gB��7JJ�
,��GLOBA.PB_COMBI�SMSI���
JG]
,��BUS_DP
PC �
'
xLokalT
J��
,��SL_DP
PC!�
��
#,��PLC_PARAMJPf,��PROJEKT�PROFIBUS new network1.00022.02.2012J,��KBL_HEADER:�JJp,��KBLJ��,��ALI-SETUPB ����
������������������JJ,,��OBJ-DEFJJ��,��ALI_CLIENTTJJ�
,��ALI_SERVERJ J\r�����2, �� ST_OV_00Boolean0Integer8 0Integer16
0Integer32
0Unsigned8
0Unsigned32Floating-Point0igned16
Octet String Jo� ,��DESCRIPT �ABB OyABB Drives RPBA-01ABBSlave1***reserved***�
�
%
So to show what i've tried already, and only come up with empty objects ( no key-values)*edit
perl -
#!/usr/bin/perl -w
use strict;
use DB_File;
use GDBM_File;
my ($filename, %hash, $flags, $mode, $DB_HASH) = #ARGV;
tie %hash, 'DB_File', [$filename, $flags, $mode, $DB_HASH]
or die "Cannot open $filename: $!\n";
while ( my($key, $value) = each %hash ) {
print "$key = $value\n";
}
# these unties happen automatically at program exit
untie %hash;
which returns nothing
python -
db = dbm.open('file', 'c')
ruby -
db = DBM.open('file', 666, DBM::CREATRW)
Every one of these returned empty. I assume they use the same low level library. Some history/context on DBM files would be great as there seems to be some different versions.
**Edit
running file on it returns
$ file abb12mb_uncontrolledsynch_ppo2_1slave.dbm
abb12mb_uncontrolledsynch_ppo2_1slave.dbm: data
and running strings outputs
$ strings abb12mb_uncontrolledsynch_ppo2_1slave.dbm
DBMFILE
Aug 31 2004
GLOBAL
PB_COMBI
SMSI
BUS_DP
Lokal
SL_DP
PLC_PARAM
PROJEKT
PROFIBUS new network
1 .000
22.02.2012
KBL_HEADER
ALI-SETUP
OBJ-DEF
ALI_CLIENT
ALI_SERVER
ST_OV_0
Boolean
Integer8
Integer16
Integer32
Unsigned8
Unsigned16
Unsigned32
Floating-Point
Octet String
DESCRIPT
ABB Oy
ABB Drives RPBA-01
ABBSlave1
***reserved***
Just to make my comment clear, you should try using the default options for DB_File, like this
use strict;
use warnings;
use DB_File;
my ($filename) = #ARGV;
tie my %dbm, 'DB_File', $filename or die qq{Cannot open DBM file "$filename": $!};
print "$_\n" for keys %dbm;
From the documentation for Perl's dbmopen function:
[This function has been largely superseded by the tie function.]
You probably want to try tieing it with DB_File.
use DB_File;
tie %hash, 'DB_File', $filename, $flags, $mode, $DB_HASH;
Then your data is in %hash.
Might also be interesting to run file against the file to see what it actually is.

How can I parse a C header file with Perl?

I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back.
For example I have some structure like
const BYTE Some_Idx[] = {
4,7,10,15,17,19,24,29,
31,32,35,45,49,51,52,54,
55,58,60,64,65,66,67,69,
70,72,76,77,81,82,83,85,
88,93,94,95,97,99,102,103,
105,106,113,115,122,124,125,126,
129,131,137,139,140,149,151,152,
153,155,158,159,160,163,165,169,
174,175,181,182,183,189,190,193,
197,201,204,206,208,210,211,212,
213,214,215,217,218,219,220,223,
225,228,230,234,236,237,240,241,
242,247,249};
Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like:
const BYTE Some_Idx_Mod_mul_2[] = {
8,14,20, ...
...
484,494,498};
Is there any Perl library already available for this? If not Perl, something else like Python is also OK.
Can somebody please help!!!
Keeping your data lying around in a header makes it trickier to get at using other programs like Perl. Another approach you might consider is to keep this data in a database or another file and regenerate your header file as-needed, maybe even as part of your build system. The reason for this is that generating C is much easier than parsing C, it's trivial to write a script that parses a text file and makes a header for you, and such a script could even be invoked from your build system.
Assuming that you want to keep your data in a C header file, you will need one of two things to solve this problem:
a quick one-off script to parse exactly (or close to exactly) the input you describe.
a general, well-written script that can parse arbitrary C and work generally on to lots of different headers.
The first case seems more common than the second to me, but it's hard to tell from your question if this is better solved by a script that needs to parse arbitrary C or a script that needs to parse this specific file. For code that works on your specific case, the following works for me on your input:
#!/usr/bin/perl -w
use strict;
open FILE, "<header.h" or die $!;
my #file = <FILE>;
close FILE or die $!;
my $in_block = 0;
my $regex = 'Some_Idx\[\]';
my $byte_line = '';
my #byte_entries;
foreach my $line (#file) {
chomp $line;
if ( $line =~ /$regex.*\{(.*)/ ) {
$in_block = 1;
my #digits = #{ match_digits($1) };
push #digits, #byte_entries;
next;
}
if ( $in_block ) {
my #digits = #{ match_digits($line) };
push #byte_entries, #digits;
}
if ( $line =~ /\}/ ) {
$in_block = 0;
}
}
print "const BYTE Some_Idx_Mod_mul_2[] = {\n";
print join ",", map { $_ * 2 } #byte_entries;
print "};\n";
sub match_digits {
my $text = shift;
my #digits;
while ( $text =~ /(\d+),*/g ) {
push #digits, $1;
}
return \#digits;
}
Parsing arbitrary C is a little tricky and not worth it for many applications, but maybe you need to actually do this. One trick is to let GCC do the parsing for you and read in GCC's parse tree using a CPAN module named GCC::TranslationUnit.
Here's the GCC command to compile the code, assuming you have a single file named test.c:
gcc -fdump-translation-unit -c test.c
Here's the Perl code to read in the parse tree:
use GCC::TranslationUnit;
# echo '#include <stdio.h>' > stdio.c
# gcc -fdump-translation-unit -c stdio.c
$node = GCC::TranslationUnit::Parser->parsefile('stdio.c.tu')->root;
# list every function/variable name
while($node) {
if($node->isa('GCC::Node::function_decl') or
$node->isa('GCC::Node::var_decl')) {
printf "%s declared in %s\n",
$node->name->identifier, $node->source;
}
} continue {
$node = $node->chain;
}
Sorry if this is a stupid question, but why worry about parsing the file at all? Why not write a C program that #includes the header, processes it as required and then spits out the source for the modified header. I'm sure this would be simpler than the Perl/Python solutions, and it would be much more reliable because the header would be being parsed by the C compilers parser.
You don't really provide much information about how what is to be modified should be determined, but to address your specific example:
$ perl -pi.bak -we'if ( /const BYTE Some_Idx/ .. /;/ ) { s/Some_Idx/Some_Idx_Mod_mul_2/g; s/(\d+)/$1 * 2/ge; }' header.h
Breaking that down, -p says loop through input files, putting each line in $_, running the supplied code, then printing $_. -i.bak enables in-place editing, renaming each original file with a .bak suffix and printing to a new file named whatever the original was. -w enables warnings. -e'....' supplies the code to be run for each input line. header.h is the only input file.
In the perl code, if ( /const BYTE Some_Idx/ .. /;/ ) checks that we are in a range of lines beginning with a line matching /const BYTE Some_Idx/ and ending with a line matching /;/.
s/.../.../g does a substitution as many times as possible. /(\d+)/ matches a series of digits. The /e flag says the result ($1 * 2) is code that should be evaluated to produce a replacement string, instead of simply a replacement string. $1 is the digits that should be replaced.
If all you need to do is to modify structs, you can directly use regex to split and apply changes to each value in the struct, looking for the declaration and the ending }; to know when to stop.
If you really need a more general solution you could use a parser generator, like PyParsing
There is a Perl module called Parse::RecDescent which is a very powerful recursive descent parser generator. It comes with a bunch of examples. One of them is a grammar that can parse C.
Now, I don't think this matters in your case, but the recursive descent parsers using Parse::RecDescent are algorithmically slower (O(n^2), I think) than tools like Parse::Yapp or Parse::EYapp. I haven't checked whether Parse::EYapp comes with such a C-parser example, but if so, that's the tool I'd recommend learning.
Python solution (not full, just a hint ;)) Sorry if any mistakes - not tested
import re
text = open('your file.c').read()
patt = r'(?is)(.*?{)(.*?)(}\s*;)'
m = re.search(patt, text)
g1, g2, g3 = m.group(1), m.group(2), m.group(3)
g2 = [int(i) * 2 for i in g2.split(',')
out = open('your file 2.c', 'w')
out.write(g1, ','.join(g2), g3)
out.close()
There is a really useful Perl module called Convert::Binary::C that parses C header files and converts structs from/to Perl data structures.
You could always use pack / unpack, to read, and write the data.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my #data;
{
open( my $file, '<', 'Some_Idx.bin' );
local $/ = \1; # read one byte at a time
while( my $byte = <$file> ){
push #data, unpack('C',$byte);
}
close( $file );
}
print join(',', #data), "\n";
{
open( my $file, '>', 'Some_Idx_Mod_mul_2.bin' );
# You have two options
for my $byte( #data ){
print $file pack 'C', $byte * 2;
}
# or
print $file pack 'C*', map { $_ * 2 } #data;
close( $file );
}
For the GCC::TranslationUnit example see hparse.pl from http://gist.github.com/395160
which will make it into C::DynaLib, and the not yet written Ctypes also.
This parses functions for FFI's, and not bare structs contrary to Convert::Binary::C.
hparse will only add structs if used as func args.

Categories

Resources