Not able to convert cassandra blob/bytes string to integer - python

I have a column-family/table in cassandra-3.0.6 which has a column named "value" which is defined as a blob data type.
CQLSH query select * from table limit 2; returns me:
id | name | value
id_001 | john | 0x010000000000000000
id_002 | terry | 0x044097a80000000000
If I read this value using cqlengine(Datastax Python Driver), I get the output something like:
{'id':'id_001', 'name':'john', 'value': '\x01\x00\x00\x00\x00\x00\x00\x00\x00'}
{'id':'id_002', 'name':'terry', 'value': '\x04#\x97\xa8\x00\x00\x00\x00\x00'}
Ideally the values in the "value" field are 0 and 1514 for row1 and row2 resp.
However, I am not sure how I can convert the "value" field values extracted using cqlengine to 0 and 1514. I tried few methods like ord(), decode(), etc but nothing worked. :(
Questions:
What is this format?
'\x01\x00\x00\x00\x00\x00\x00\x00\x00' or
'\x04#\x97\xa8\x00\x00\x00\x00\x00'?
How I can convert these arbitrary values to 0 and 1514?
NOTE: I am using python 2.7.9 on Linux
Any help or pointers would be useful.
Thanks,

Blob will be converted to a byte array in Python if you read it directly. That looks like a byte array containing the Hex value of the blob.
One way is to explicitly do the conversion in your query.
select id, name, blobasint(value) from table limit 3
There should be a conversion method with the Python driver as well.

Related

psycopg3 interpret empty space column strings as NULL

I am using the new psycopg3 driver and implementing a copy from syntax as follows:
#Connect to an existing database
with psycopg.connect(f'postgresql://{config.USER_PG}:{config.PASS_PG}#{config.HOST_PG}:{config.PORT_PG}/{config.DATABASE_PG}') as conn:
# Open a cursor to perform database operations
with conn.cursor() as cur:
cur.execute("TRUNCATE TABLE eim RESTART IDENTITY")
# Copy buffer into PG
with cur.copy("COPY eim(FIMNO,SPV_COMPONENT_ID,EIM_TECHNOLOGY_NAME,EIM_TECHNOLOGY_VERSION,ACTUAL_REMEDIATION_DATE,INTENDED_REMEDIATION_DATE,ISSUESTATUS,CURRENT_MANU_PHASE,CURRENT_MANU_PHASE_START,CURRENT_MANU_PHASE_END,REMEDIATION_TYPE,ISSUE_COMPONENT_ID,MIDDLEWARE_INSTANCE_NAME) FROM STDIN WITH CSV") as copy:
print(next(stream))
for data in stream:
copy.write(data)
The code above works, but the strange thing is that if I omit the column names (FIMNO,SPV_COMPONENT_ID,EIM_TECHNOLOGY_NAME,EIM_TECHNOLOGY_VERSION,ACTUAL_REMEDIATION_DATE,INTENDED_REMEDIATION_DATE,ISSUESTATUS,CURRENT_MANU_PHASE,CURRENT_MANU_PHASE_START,CURRENT_MANU_PHASE_END,REMEDIATION_TYPE,ISSUE_COMPONENT_ID,MIDDLEWARE_INSTANCE_NAME) it throws me an error.
My data contain null values and the csv looks like that (empty string for null):
,13,lot_of_other_data,
Apparently the command does not properly interpret the field without explicitly expressing the column names. I would like to make it work without the column names, maybe by translating empty strings to null.
This is the ordinal position of my columns. Note that the last one is id which has a serial default type. I think when I don't specify the columns the positional inference is not working properly.
"fimno" 1
"spv_component_id" 2
"eim_technology_name" 3
"eim_technology_version" 4
"obsolete_start_date" 5
"actual_remediation_date" 6
"intended_remediation_date" 7
"issuestatus" 8
"current_manu_phase" 9
"current_manu_phase_start" 10
"current_manu_phase_end" 11
"remediation_type" 12
"issue_component_id" 13
"middleware_instance_name" 14
"id" 15
Note the extra serial id on position 15, this is is not present in my string buffer which I'm writing to the table.

Delete categorical data from a column and leave the numerical data only ? I want to delete the word "SUBM-" from "SUBM-1245" IN PHYTON

I have a column in phyton which data type is object but I want to change it to integer.
The records on that column show :
SUBM - 4562
SUBM - 4563
and all the information in that column is like that. I want to delete the SUBM - word from the records and apply a similar function like excel "replace with" and I will add 0 to leave that empty with the numerical data only. Can anyone suggest a way to do that ?
If you are working with a column in python, so I assume you are using pandas to parse your table. In this case, you can simply use
df["mycolumn"] = df["mycolumn"].str.replace("SUBM-","")
However, you still have a column of type "object" then. A save way to convert it to numeric is this, where you basically throw away everything that can't be converted to a numeric:
df["mycolumn"] = pd.to_numeric(df["mycolumn"], errors="coerce", downcast="integer")
If you specifically need integer values (float not acceptible for you in case of NaN) you can afterwards fill empty cells with 0 and convert the column to integer:
df["mycolumn"] = df["mycolumn"].fillna(0).map(int) # if you specifically need integers
Alternative is to extract all numeric values using regular expressions. This would automatically return NaN if the expressions do not match (i.e. also when "SUBM-" is not present in your cell)
df["mycolumn"] = df["mycolumn"].str.extract("SUBM-([0-9]*)")

Is there a more efficient way to write code for bin values in Databricks SQL?

I am using Databricks SQL, and want to understand if I can make my code lighter:
Select
case when (
age_18__24 is null AND
age_25__34 is null AND
age_35__44 is null AND
age_45_or_more is null
) then 1 else 0 end as flag1...
Instead of writing each line, is there a cool way to state that all of these columns starting with "age_" need to be null in 1 or 2 lines of code?
If each bin is a column then you probably are going to have to spell it out, but you could use coalesce:
select
case when
coalesce(age_18__24, age_25__34, age_35__44, age_45_or_more) is null
then 1 else 0
end as flag1

How to save values from a dataframe as a string where condition is met?

I have a .csv file that looks a bit like this:
ID | Cancelled
---|----------
1 | N
2 | Y
3 | N
4 | N
5 | Y
I want to use Python to save the cancelled IDs to a variable as a string with this format:
('2','5')
It needs to be in this format because the variable will be used later on in a bit of SQL nested within the same Python script.
Does anyone know how to do this please? I'm able to get a dataframe that holds the relevant rows using the code below, but I don't know how to strip this down to just the ID column and convert the IDs into the correct format.
V_test1 = pd.read_csv("myfile.csv")
V_test2 = V_test1[V_test1['Cancelled'] == 'Y']
If ID is index then use:
out=tuple('000'+V_test1.loc[(V_test1['Cancelled']=='Y')].index.astype(str))
Else use:
out=tuple('000'+V_test1.loc[(V_test1['Cancelled']=='Y'),'ID'].astype(str))
Output:
print(out)
>>>
('0002','0005')

How to search for multiple values in column with Sqlalchemy postgres

I have a query in PostgreSQL that searches and selects all the rows with values '0' or '3' inside an arrayed column called 'news'. This column has an array of multiple values. For example:
id | country | news
---------------------
one | xyz | {'2','4','8'}
two | esc | {'0','4','2'}
three| eec | {'9','3','5'}
So,
SELECT * FROM table WHERE news && '{"0", "3"}';
results in row two and three being selected. Perfect. But I need to do this in sqlalchemy.
Does anyone know how this can be written in SQLalchemy?
#balderman helped me with resources that I used to come up with this sqlalchemy code:
full_id_list = []
for n in ['0','3']:
ids = db.session.query(table).filter(table.news.op('#>')([n]))
full_id_list.append(booklist)
But is there a simpler way, without using a for Loop?
Solved it, finally.
db.session.query(Table).filter(Table.news.op('&&')(['0','3']))
All I had to do is change the operation (.op) from #> to &&, because && means you can search for multiple values inside a column with an array of values.
See https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#sqlalchemy.dialects.postgresql.ARRAY.Comparator
query = session.query(table).filter(table.news.contains([some_int])).all()

Categories

Resources