I am trying to import csv into aws redshift( postgresql 8.x) .
The data flow is:
mysql -> parquet files on s3 -> csv files on s3 -> redshift.
Table structure
The mysql table sql:
create table orderitems
(
id char(36) collate utf8_bin not null
primary key,
store_id char(36) collate utf8_bin not null,
ref_type int not null,
ref_id char(36) collate utf8_bin not null,
store_product_id char(36) collate utf8_bin not null,
product_id char(36) collate utf8_bin not null,
product_name varchar(50) null,
main_image varchar(200) null,
price int not null,
count int not null,
logistics_type int not null,
time_create bigint not null,
time_update bigint not null,
...
);
I used same sql to create table in redshift , but it got error while importing csv.
My code import csv to redshift (python)
# parquet is dumpy by sqoop
p2 = 'xxx'
df = pd.read_parquet(path)
with smart_open.smart_open(p2, 'w') as f:
df.to_csv(f, index=False) # python3 default encoding is utf-8
conn = psycopg2.connect(CONN_STRING)
sql="""COPY %s FROM '%s' credentials 'aws_iam_role=%s' region 'cn-north-1'
delimiter ',' FORMAT AS CSV IGNOREHEADER 1 ; commit ;""" % (to_table, p2, AWS_IAM_ROLE)
print(sql)
cur = conn.cursor()
cur.execute(sql)
conn.close()
Got error
By checking STL_LOAD_ERRORS found error on product_name column
row_field_value : .............................................215g/...
err_code: 1204
err_reason: String length exceeds DDL length
The real_value is 伊利畅轻蔓越莓奇亚籽风味发酵乳215g/瓶( chinese) .
So it looks like some encoding problem. Since mysql is utf-8 and the csv is utf-8 too , I don't know what is wrong .
Your column is a varchar data type, with length 50. That's 50 bytes, not 50 characters. The string example you've given looks to be about 16 chinese characters, which are probably 3 bytes each (UTF-8) and four ASCII characters (one byte each), so about 52 bytes. That's longer than the byte length of the column, so the import fails.
Related
Here is my table creation code:
CREATE TABLE `crypto_historical_price2` (
`Ticker` varchar(255) COLLATE latin1_bin NOT NULL,
`Timestamp` varchar(255) COLLATE latin1_bin NOT NULL,
`PerpetualPrice` double DEFAULT NULL,
`SpotPrice` double DEFAULT NULL,
`Source` varchar(255) COLLATE latin1_bin NOT NULL,
PRIMARY KEY (`Ticker`,`Timestamp`,`Source`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
I'm updating stuff in batch with sql statements like the following
sql = "INSERT INTO crypto."+TABLE+"(Ticker,Timestamp,PerpetualPrice,SpotPrice,Source) VALUES %s;" % batchdata
where batchdata is just a string of data in "('SOL', '2022-11-03 02:01:00', '31.2725', '31.2875', 'FTX'),('SOL', '2022-11-03 02:02:00', '31.3075', '31.305', 'FTX')
Now my script runs for a bit of time successfully inserting data in to the table but then it barfs with the following errors:
error 1265 data truncated for column PerpetualPrice
and
Duplicate entry 'SOL-2022-11-02 11:00:00-FTX' for key 'primary'
I've tried to solve the second error with
sql = "INSERT INTO crypto.crypto_historical_price2(Ticker,Timestamp,PerpetualPrice,SpotPrice,Source) VALUES %s ON DUPLICATE KEY UPDATE Ticker = VALUES(Ticker), Timestamp = VALUES(Timestamp), PerpetualPrice = VALUES(PerpetualPrice), SpotPrice = VALUES(SpotPrice), Source = VALUES(Source);" % batchdata
and
sql = "INSERT INTO crypto.crypto_historical_price2(Ticker,Timestamp,PerpetualPrice,SpotPrice,Source) VALUES %s ON DUPLICATE KEY UPDATE Ticker = VALUES(Ticker),Timestamp = VALUES(Timestamp),Source = VALUES(Source);" % batchdata
The above 2 attempted remedy runs and doesn't throw an duplicate entry error but it doesn't update the table at all.
If I pause my script a couple of minutes and re-run, the error duplicate error goes away and it updates which even confuses me EVEN more lol.
Any ideas?
I'm dealing with a django-silk issue trying to figure out what it won't migrate. It says all migrations complete and then when I run my code it gives me a warning that I still have 8 unapplied migrations, despite double checking with python manage.py migrate --plan. I'm pretty stumped at this point so I began setting up queries to just populate the db with the info already.
And now the query is giving me a syntax error that for the life of me I can't understand! Hoping there are some Postgres masters here who can tell me what I'm missing. Thanks!
Here's the query:
CREATE TABLE IF NOT EXISTS public.silk_request(
id character varying(36) COLLATE pg_catalog."default" NOT NULL,
path character varying(190) COLLATE pg_catalog."default" NOT NULL,
query_params text COLLATE pg_catalog."default" NOT NULL,
raw_body text COLLATE pg_catalog."default" NOT NULL,
body text COLLATE pg_catalog."default" NOT NULL,
method character varying(10) COLLATE pg_catalog."default" NOT NULL,
start_time timestamp with time zone NOT NULL,
view_name character varying(190) COLLATE pg_catalog."default",
end_time timestamp with time zone,
time_taken double precision,
encoded_headers text COLLATE pg_catalog."default" NOT NULL,
meta_time double precision,
meta_num_queries integer,
meta_time_spent_queries double precision,
pyprofile text COLLATE pg_catalog."default" NOT NULL,
num_sql_queries integer NOT NULL,
prof_file character varying(300) COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT silk_request_pkey PRIMARY KEY (id)
) TABLESPACE pg_default;
ALTER TABLE
IF EXISTS public.silk_request OWNER to tapappdbuser;CREATE INDEX IF NOT EXISTS silk_request_id_5a356c4f_like ON public.silk_request USING btree (
id COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST
) TABLESPACE pg_default;CREATE INDEX IF NOT EXISTS silk_request_path_9f3d798e ON public.silk_request USING btree (path COLLATE pg_catalog."default" ASC NULLS LAST) TABLESPACE pg_default;CREATE INDEX IF NOT EXISTS silk_request_path_9f3d798e_like ON public.silk_request USING btree (
path COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST
) TABLESPACE pg_default;CREATE INDEX IF NOT EXISTS silk_request_start_time_1300bc58 ON public.silk_request USING btree (start_time ASC NULLS LAST) TABLESPACE pg_default;CREATE INDEX IF NOT EXISTS silk_request_view_name_68559f7b ON public.silk_request USING btree (
view_name COLLATE pg_catalog."default" ASC NULLS LAST
) TABLESPACE pg_default;CREATE INDEX IF NOT EXISTS silk_request_view_name_68559f7b_like ON public.silk_request USING btree (
view_name COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST
) TABLESPACE pg_default;
Thanks!
Update:
Here's the error message. Sorry should've included originally.
ERROR: syntax error at or near "("
LINE 5: ...t" NOT NULL,
CONSTRAINT silk_request_pkey PRIMARY KEY (id))
^
SQL state: 42601
Character: 1015
I have a problem with storing in MySQL db a PDF file made with Reportlab library. Here's my code:
def insertIntoDb(pdfFullPath,name,surname,gravity):
print('PRIMA DEL MYSQL')
print('pdf full path'+pdfFullPath)
mydb = mysql.connector.connect(host="localhost", user="root", passwd="", database="deepface")
with open(pdfFullPath,'rb') as pdfvar:
blob = pdfvar.read()
print(blob)
sqlQuery = "INSERT INTO diagnosi(name,surname,pdf,gravity) VALUES (%s,%s,%s,%s)"
mycursor = mydb.cursor()
val = (name,surname,blob,gravity,)
mycursor.execute(sqlQuery,val)
mydb.commit()
mycursor.close()
mydb.close()
The console says:
mysql.connector.errors.DataError: 1406 (22001): Data too long for column 'pdf' at row 1
I have already set max allowed packet in mysql configuration file, but the problem is that when I try to print (I know that i can't) the PDF I get this:
^\\7:[,1qq,N_Sd$dm-:XU2/Pga=O1f/`hY7X1nrca).:_\'-4,n*"L5r,CHFpGo:"E,MDLu7EW%CFF0$Rl?jT\'6%k%,?AF%UK6ojt/c$<^Xh=;VarY:L8cQYTgj/:CfA/j1=dbU#a<:%D;rDV[)WDu)5*"98A5kkfYAqs0FFVZk[*Mb(Rs?hIk
and another things, how i can store in my db?
I tried to decode with b64 but doesn't work.
> SHOW CREATE TABLE diagnosi;
diagnosi | CREATE TABLE `diagnosi` (
`tempId` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(30) NOT NULL,
`surname` varchar(30) NOT NULL,
`pdf` blob NOT NULL,
`gravity` varchar(50) NOT NULL,
PRIMARY KEY (`tempId`)\n) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
The MySQL BLOB datatype is limited to 216 bytes in size. The LONGBLOB datatype can be up to 232 bytes, so change the column type from BLOB to LONGBLOB.
See the storage requirements for string types in the docs.
Im new to python (3) and would like to now the following:
I'm trying to collect data via pandas from a website and would like to store the results into a mysql database like:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mysql://python:"+'pw'+"#localhost/test?charset=utf8")
url = r'http://www.boerse-frankfurt.de/devisen'
dfs = pd.read_html(url,header=0,index_col=0,encoding="UTF-8")
devisen = dfs[9] #Select the right table
devisen.to_sql(name='table_fx', con=engine, if_exists='append', index=False)
I'm receiving the following error:
....
_mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tBezeichnung\n\t\t\t\t\t\t\t\n\t\t\t\t' in 'field list'") [SQL: 'INSERT INTO tbl_fx (\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tBezeichnung\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tzum Vortag\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tLetzter Stand\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tTageshoch\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tTagestief\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t52-Wochenhoch\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t52-Wochentief\n\t\t\t\t\t\t\t\n\t\t\t\t, \n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tDatum\n\t\t\t\t\t\t\t\n\t\t\t\t, \nAktionen\t\t\t\t) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: (('VAE Dirham', '-0,5421%', 45321.0, 45512.0, 45306.0, 46080.0, 38550.0, '20.02.2018 14:29:00', None), ('Armenischer Dram', '-0,0403%', 5965339.0, 5970149.0, 5961011.0, 6043443.0, 5108265.0, '20.02.2018 01:12:00', None), ....
How can sqlalchemy INSERT respective data into table_fx? Problem is the header with the multiple \n and \t.
The mysql table hase the following structur:
(
name varchar(10) COLLATE utf8_unicode_ci DEFAULT NULL,
bezeichnung varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL,
diff_vortag varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
last double DEFAULT NULL,
day_high double DEFAULT NULL,
day_low double DEFAULT NULL,
52_week_high double DEFAULT NULL,
52_week_low double DEFAULT NULL,
date_time varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
unnamed varchar(200) COLLATE utf8_unicode_ci DEFAULT NULL
)
Any help is higly welcome.
Thank you very much in advance
Andreas
This should do it. If you convert to a dataframe you can rename columns first. The "dfs" entity you were creating was actually a list of dataframe entities.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mysql://python:"+'pw'+"#localhost/test?charset=utf8")
url = r'http://www.boerse-frankfurt.de/devisen'
dfs = pd.read_html(url,header=0,index_col=0,encoding="UTF-8")
devisen = dfs[9].dropna(axis=0, thresh=4) # Select right table and make a DF
devisen.columns = devisen.columns.str.strip() # Strip extraneous characters
devisen.to_sql(name='table_fx', con=engine, if_exists='append', index=False)
My MySQL table:
CREATE TABLE ref_data (
id BIGINT AUTO_INCREMENT NOT NULL,
symbol VARCHAR(64) NOT NULL,
metadata JSON NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO ref_data (symbol, metadata)
VALUES ('XYZ', '{"currency": "USD", "tick_size": 0.01}');
My Python script:
import mysql.connector
con = mysql.connector.connect(**config)
cur = con.cursor()
cur.execute(
"""
SELECT JSON_EXTRACT(metadata, '$.tick_size')
FROM ref_data
WHERE symbol = 'CL';
""")
And the result is a unicode:
cur.fetchall()[0][0]
>> u'0.01'
When I run the same query within MySQL Workbench I get a double. I know I could convert the string to a float, but the point of using a JSON was flexibility and not having to specify what each column is, etc.
Thanks!