I have a file.cc that contains an array of doubles values, as seen here:
double values[][4] = {
{ 0.1234, +0.5678, 0.1222, 0.9683 },
{ 0.1631, +0.4678, 0.2122, 0.6643 },
{ 0.1332, +0.5678, 0.1322, 0.1683 },
{ 0.1636, +0.7678, 0.7122, 0.6283 }
... continue
}
How can I export these values to a Python list?
I cannot touch these files because they belong to an external library, subject to modification. Exactly, I want to be able to update the library without affecting my code.
This is pretty much answered in this other SO post.
But I will add a bit here. You need to define a type then use the in_dll method.
From your example I made a so with those values in values. I hope you have an idea how big it is or can find out from other vars in the library, otherwise this is a seg fault waiting to happen.
import ctypes
lib = ctypes.CDLL('so.so')
da = ctypes.c_double*4*4
da.in_dll(lib, "values")[0][0]
# 0.1234
da.in_dll(lib, "values")[0][1]
# 0.5678
da.in_dll(lib, "values")[0][2]
# 0.1222
From here I would just loop over them reading into a list.
How about using a temporary file? Put the matrix in it by C and read them by python.
In file.cc, write a function to save the matrix to a file.
int save_to_file(double matrix[][4],int row) {
int i,j;
FILE *fp;
fp=fopen("tmp","w");
for(i=0;i<row;i++)
for(j=0;j<4;j++) {
fprintf(fp,"%f",matrix[i][j]);
if(j==3)
fprintf(fp,"\n",matrix[i][j]);
else
fprintf(fp," ",matrix[i][j]);
}
fclose(fp);
return 0;
}
and read them by a Python script like this:
tmp=open('tmp')
L = []
for line in tmp:
newline = []
t = line.split(' ')
for string in t:
newline.append(float(string))
L.append(newline)
tmp.close()
for row in L:
for number in row:
print "%.4f" %number
print " "
Related
I have scala code and python code that are attempting the same task (2021 advent of code day 1 https://adventofcode.com/2021/day/1).
The Python returns the correct solution, the Scala does not. I ran diff on both of the outputs and have determined that my Scala code is incorrectly evaluating the following pairs:
1001 > 992 -> false
996 > 1007 -> true
1012 > 977 -> false
the following is my Python code:
import pandas as pd
data = pd.read_csv("01_input.csv", header=None)
incr = 0
prevval = 99999
for index, row in data.iterrows():
if index != 0:
if row[0] > prevval:
print(f"{index}-{row[0]}-{prevval}")
incr += 1
prevval = row[0]
prevval = row[0]
print(incr)
and here is my Scala code:
import scala.io.Source; // library to read input file in
object advent_of_code_2021_01 {
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("01_input.csv").getLines().toList; // file as immutable list
var increases = 0;
for (i <- 1 until lines.length) { // iterate over list by index
if (lines(i) > lines(i-1)) {
increases += 1;
println(s"$i-${lines(i)}-${lines(i-1)}")
}
}
println(increases);
}
}
I do not understand what is causing this issue on these particular values. in the shell, Scala evaluates them correctly, but I do not know where to even begin with this. Is there some behavior I need to know about that I'm not accounting for? Am I just doing something stupid? Any help is appreciated, thank you.
As #Edward Peters https://stackoverflow.com/users/6016064/edward-peters correctly identified, my problem was that I was doing string comparisons, and not numerical comparisons, so I needed to convert my values to Int and not String. I did this with the very simple .toInt and it fixed all my issues.
fixed scala code:
import scala.io.Source; // library to read input file in
object advent_of_code_2021_01 {
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("01_input.csv").getLines().toList; // file as immutable list
var increases = 0;
for (i <- 1 until lines.length) { // iterate over list by index
if (lines(i).toInt > lines(i-1).toInt) { // evaluate
increases += 1; // increment when true
println(s"$i-${lines(i)}-${lines(i-1)}") // debug line
}
}
println(increases); // result
}
}
I have two text files. Both contain hundreds of millions of lines (rows). The second one is about four times larger.
Both text files have two columns each. The first one is an ID (key), the second one is the Value string which has to be compared between both files.
EDIT2: There might be duplicates in Value for both files.
Structure of both text files:
ID Value
B00CC0:2610:20880:13730 cd99AABABBABABABABABABABABA
B00CC0:2549:10230:33301 cd99BABABBABBBABBBBBBAAABBB
B00CC0:1272:8504:27179 cd99BBBBBBBBAAAAAAAAABBBBBB
B00CC0:1556:10628:35055 cd99AAAABBBBABABAAAAAAAAAAB
... ...
Now I want to output every line in the second file which contains a Value occurring in the first file (exact match, not a substring!).
I was trying a naive implementation in in Python via just loading both files to dataframes and then perform filtering:
import sys
import modin.pandas as pd
import ray
ray.init()
# load 1st file
data_one = pd.read_csv(filename1, compression='gzip', header=0, sep='\t', usecols=[1], names=['Value'])
data_one_list = data_tso['Value'].tolist()
# load 2nd file
data_two = pd.read_csv(filename2, compression='gzip', header=0, sep='\t', usecols=[0,1], names=['ID','alue'])
# filter
data_two_filtered = data_two[data_two['Value'].isin(data_one_list)]
However, this works only if I subset the first file otherwise it is too big and the Python script crashes (Eating up all the RAM). And it is too slow anyway. I was trying to use modin.pandas to speed up the entire process, but does not solve my problem.
Now I have questions going into two directions:
First direction:
Do you think it is possible to develop a solution with "decent" performance in Python? Or do you think C/C++ is needed (mentioning C/C++ since those are the only compiled languages I master at least enough to solve this problem)?
Second direction:
Do you think I have to use an approach such as a hash table or a trie for lookup or do you think a simple table lookup as tested is sufficient if done correctly?
If you suggest a specific approach, what would it be (data structure, approach)?
EDIT:
I have a machine with 256 GB RAM and 64 threads.
A decent speed would be to have this filtering performed within about 1-2 minutes max.
Obviously, several solutions are possible. Since there is a lot of memory available on your computer, one could read the value column from the file line by line and add each value to a set.
After that one reads file 2 line by line and checks each value if it is in the set. If it is, then one outputs the current line buffer.
Such a C program is quickly written in < 100 lines of code, especially if you use an existing set implementation. I chose https://github.com/barrust/set because it looks good and is easy to integrate, just copy set.c and set.h into your project. For a quick test, I created a file with 100 million lines of random data with a similar structure as shown in your question.
It seems that with set_init_alt you can already set a high capacity for the hash table.
With
gtime \-f\ "CPU: %Us\tReal: %es\tRAM: %MKB" ./search file1.txt file2.txt I measured about 45 seconds at 8.6GB RAM for building the hash on my laptop, which seems to be a good result.
C program
The C program assumes that column 1 and column 2 are separated by spaces. It is easily adaptable if other separators are to be used.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include "set.h"
#define MAX_LINE_LENGTH 128
static char *get_value(char *buf);
static void error_exit(char *prefix, char *msg) {
fprintf(stderr, "%s: %s\n", prefix, msg);
exit(-1);
}
static void build_set(char *fileName, SimpleSet *set) {
char buf[MAX_LINE_LENGTH];
FILE *fp;
if ((fp = fopen(fileName, "r")) == NULL) {
error_exit("failure opening file1", strerror(errno));
}
while (fgets(buf, sizeof(buf), fp) != NULL) {
char *value = get_value(buf);
set_add(set, value);
}
if (ferror(fp)) {
error_exit("error reading from file1", strerror(errno));
}
fclose(fp);
}
static void query_set(char *fileName, SimpleSet *set) {
char buf[MAX_LINE_LENGTH];
FILE *fp;
if ((fp = fopen(fileName, "r")) == NULL) {
error_exit("failure opening file2", strerror(errno));
}
while (fgets(buf, sizeof(buf), fp) != NULL) {
char *value = get_value(buf);
if (set_contains(set, value) == SET_TRUE) {
printf("%s\n", buf);
}
}
if (ferror(fp)) {
error_exit("error reading from file2", strerror(errno));
}
fclose(fp);
}
static char *get_value(char *buf) {
char *ptr = buf;
while (*ptr && *ptr != ' ')
ptr++;
while (*ptr == ' ')
ptr++;
char *value = ptr;
while (*ptr && *ptr != '\n')
ptr++;
*ptr = '\0';
return value;
}
int main(int argc, char *argv[]) {
if (argc != 3) {
error_exit("usage", "search <file1> <file2>");
}
SimpleSet set;
set_init_alt(&set, 500000000, NULL); /* use default hash */
build_set(argv[1], &set);
query_set(argv[2], &set);
//the cleanup takes some time, but since the program terminates anyway, not necessary
//set_destroy(&set);
return 0;
}
Build command
gcc -Wall -Wextra main.c set.c -O3 -o search
Last remark
This is certainly not a perfect, fully optimized version, and of course far more advanced solutions could be developed, but perhaps it is a starting point for your own experiments.
I have been looking to speed up a basic Python function which basically just takes a line of text and checks the line for a substring. The Python program is as follows:
import time
def fun(line):
l = line.split(" ", 10)
if 'TTAGGG' in l[9]:
pass # Do nothing
line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"
time0 = time.time()
for i in range(10000):
fun(line)
print time.time() - time0
I wanted to see if I could use some of the high level features of Rust to possibly gain some performance, but the code runs considerably slower. The Rust conversion is:
extern crate regex;
extern crate time;
use regex::Regex;
fn main() {
let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";
let substring: &str = "TTAGGG";
let time0: f64 = time::precise_time_s();
for _ in 0..10000 {
fun(line, substring);
}
let time1: f64 = time::precise_time_s();
let elapsed: f64 = time1 - time0;
println!("{}", elapsed);
}
fn fun(line: &str, substring: &str) {
let l: Vec<&str> = line.split(" ")
.enumerate()
.filter(|&(i, _)| i==9)
.map(|(_, e) | e)
.collect();
let re = Regex::new(substring).unwrap();
if re.is_match(&l[0]) {
// Do nothing
}
}
On my machine, Python times this at 0.0065s vs Rusts 1.3946s.
Just checking some basic timings, the line.split() part of the code takes around 1s, and the regex step is around 0.4s. Can this really be right, or is there an issue with timing this properly?
As a baseline, I ran your Python program with Python 2.7.6. Over 10 runs, it had a mean time of 12.2ms with a standard deviation of 443μs. I don't know how you got the very good time of 6.5ms.
Running your Rust code with Rust 1.4.0-dev (febdc3b20), without optimizations, I got a mean of 958ms and a standard deviation of 33ms.
Running your code with optimizations (cargo run --release), I got a mean of 34.6ms and standard deviation of 495μs. Always do benchmarking in release mode.
There are further optimizations you can do:
Compiling the regex once, outside of the timing loop:
fn main() {
// ...
let substring = "TTAGGG";
let re = Regex::new(substring).unwrap();
// ...
for _ in 0..10000 {
fun(line, &re);
}
// ...
}
fn fun(line: &str, re: &Regex) {
// ...
}
Produces an average of 10.4ms with a standard deviation of 678μs.
Switching to a substring match:
fn fun(line: &str, substring: &str) {
// ...
if l[0].contains(substring) {
// Do nothing
}
}
Has a mean of 8.7ms and a standard deviation of 334μs.
And finally, if you look at just the one result instead of collecting everything into a vector:
fn fun(line: &str, substring: &str) {
let col = line.split(" ").nth(9);
if col.map(|c| c.contains(substring)).unwrap_or(false) {
// Do nothing
}
}
Has a mean of 6.30ms and standard deviation of 114μs.
A direct translation of the Python would be
extern crate time;
fn fun(line: &str) {
let mut l = line.split(" ");
if l.nth(9).unwrap().contains("TTAGGG") {
// do nothing
}
}
fn main() {
let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";
let time0 = time::precise_time_s();
for _ in 0..10000 {
fun(line);
}
println!("{}", time::precise_time_s() - time0);
}
Using cargo run --release on stable (1.2.0), I get about 0.0267 as compared to about 0.0240 for Python (CPython, 2.7.10). Given Python's in on strings is just a C routine, this is reasonable.
Impressively, on beta (1.3.0) and nightly (1.4.0) this decreases to about just 0.0122, or about twice the speed of CPython!
I have a little problem concerning string generation in C.
The following code snippet is part of a C Extension for a Python/Tkinter app which generates images (mandelbrot, gradients and such). Before anyone asks: I don't want to power up Photoshop for such a simple task - overkill...
The problem I'm having is at the end of the snippet in the last for-loop.
This function generates a PPM image file for further processing. The main goal is to generate a string containing the raster data in binary format and pass that string back to Python and then to Tkinter image data to have a preview of the result.
At the moment I write a file to disk which is pretty slow.
The iterator-function returns a pointer to a RGB-array.
If I now write every single color-value to the file using
fputc(col[0], outfile)
it works (the section which is commeted out).
To get closer to my main goal I tried to merge the three color values into a string and write that into the file.
When I run that code from my Python app, I end up with a file containing just the header.
Could anyone please point me in the right direction? Tha whole C-thing is pretty new to me - so I'm pretty much stuck here...
static PyObject* py_mandelbrotppm(PyObject* self, PyObject* args)
{
//get filename from argument
char *filename;
PyArg_ParseTuple(args, "s", &filename);
//---------- open file for writing and create header
FILE *outfile = NULL;
outfile = fopen(filename, "w");
//---------- create ppm header
char header[17];
sprintf(header,"P6\n%d %d\n255\n", dim_x, dim_y);
fputs(header, outfile);
//---------- end of header generation
for(int y = 0;y<dim_y;y++)
{
for(int x = 0;x<dim_x;x++)
{
int *col = iterator(x,y);
char pixel[3] = {col[0], col[1], col[2]};
fputs(pixel, outfile);
/*
for(int i = 0;i<3;i++)
{
fputc(pixel[i], outfile);
}
*/
}
}
fclose(outfile);
Py_RETURN_NONE;
}
You have a couple of problems with your new code.
pixel is missing a null terminator (and space for it). Fix it like this:
char pixel[4] = {col[0], col[1], col[2], '\0'};
But I'll let you in on a little secret. Putting a bunch of ints into an array of chars is going to truncate them and do all sorts of weird, squirrly things. Maybe not for char-length numbers, but in terms of general style I wouldn't recommend it. Consider this:
...
for(int x = 0;x<dim_x;x++){
int *col = iterator(x,y);
fprintf(outfile, "%d, %d, %d", col[0], col[1], col[2]);
}
...
On the other hand, I'm a little confused as to why iterator returns ints when RGB values are from 0-255, which is precisely the range an unsigned char has:
unsigned char *col = iterator(x,y);
fprintf(outfile, "%u, %u, %u", col[0], col[1], col[2]);
I need a way to pass an array to char* from Python using ctypes library to a C library.
Some ways I've tried lead me to segmentation faults, others to rubbish info.
As I've been struggling with this issue for some time, I've decided to write a small HowTo so other people can benefit.
Having this C piece of code:
void passPointerArray(int size, char **stringArray) {
for (int counter=0; counter < size; counter++) {
printf("String number %d is : %s\n", counter, stringArray[counter]);
}
}
We want to call it from python using ctypes (more info about ctypes can be found in a previous post), so we write down the following code:
def pass_pointer_array():
string_set = [
"Hello",
"Bye Bye",
"How do you do"
]
string_length = len(string_set)
select_type = (c_char_p * string_length)
select = select_type()
for key, item in enumerate(string_set):
select[key] = item
library.passPointerArray.argtypes = [c_int, select_type]
library.passPointerArray(string_length, select)
Now that I read it it appears to be very simple, but I enjoyed a lot finding the proper type to pass to ctypes in order to avoid segmentation faults...