I want to detect different dataframes in the excel file and give each detected dataframe an id and store this dataframe as an object/blob into Oracle database.
So in DB table, it would look like:
DF_ID
DF_BLOB
1
/blob string for df 1/
2
/blob string for df 2/
I know how to store entire excel file as blob in oracle (basically directly store excelfile.read())
but I cannot directly read() or open() pandas df. Then how can I store this df object as blob?
The go to library for storing Python objects in a binary format is pickle.
To get a byte-string instead of writing to a file use pickle.dumps():
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
Arguments protocol, fix_imports and buffer_callback have the same meaning as in the Pickler constructor.
Related
I'm writing some pretty simple code to parse a csv, apply a schema, save the file on gcp as a parquet, and then use bigquery to load many parquets in a single folder as the same table. So I read the csv in using pandas and apply the schema like this:
df = pd.read_csv(in_file, dtype=pd_schema, parse_dates=date_vars)
But there's an issue when a string column has one csv with all nulls and another with values. I'm not sure if this issue is happening on the parquet side or the bigquery side, but I get the error:
400 Error while reading table: temp, error message: Parquet column 'blah' has type BYTE_ARRAY which does not match the target cpp_type INT64. File: gs://blah/blah.parquet
It seems it gives an all-none object column the type INT64, but an object column with strings the type BYTE_ARRAY. Is there a way I can see if this issue is happening at the parquet level or the bigquery level? And is there a way I can be more specific in applying a schema? It seems the base issue is that pandas assigns the dtype Object when I specify str, is there maybe a pyarrow datatype I can apply that will be more specific? Thanks!
I've successfully set up a Kafka instance at my job and I've been able to pass simple 'Hello World' messages through it.
However, I'm not sure how to do more interesting things. I've got a CSV that contains four records from a DB that I'm trying to move through kafka, then take into a DF on the other side and save it as a CSV again.
producer = KafkaProducer(boostrap_servers='my-server-id:443',
....
df = pd.read_csv('data.csv')
df = df.to_json()
producer.send(mytopic, df.encode('utf8'))
This returns code in a tuple object (conusmer.record object, bool) that contains a list of my data. I can access the data as:
msg[0][0][6].decode('utf8')
But that comes in as a single string that I can't pass to a dataframe simply (it just merges everything into one thing).
I'm not sure if I even need a dataframe or a to_json() method or anything. I'm really just not sure how to organize data to send properly and then return it and feed it back into a dataframe so that I can either a) save it to a CSV or b) reinsert the dataframe to a DB with to_Sql.
Kafka isn't really suited to send entire matricies/dataframes around.
You can send a list of CSV rows, JSON arrays, or preferrably some other compressable binary dataformat such as Avro or Protobuf as whole objects. If you are working exclusively in Python, you could pickle the data you send and receive.
When you read the data, you must deserialize it but how you do that, is ultimately your choice, and there is no simple answer for any given application.
The solution, for this one case, would be json_normalize, then to_csv, however... And I would like to point out that Kafka isn't required for you to test that, as you definitely should be writing unit tests...
df = pd.read_csv('data.csv')
jdf = df.to_json()
msg_value = jdf # pretend you got a message from Kafka, as a JSON string
df = pd.json_normalize(msg_value) # back to a dataframe
df.to_csv()
I received the json every day with 10 attributes but some days if any attribute has no value they will send the 9 attributes and 10th attribute has not there in json. How can I read the json file in pyspark without changing old table schema
It seems like you should enforce a schema when reading the files.
I'm assuming you have something like this:
df = spark.read.json(path_to_json_files)
In order to preserve all the attributes/fields, use the schema like so:
df = spark.read.schema(file_schema).json(path_to_json_files)
To get the file_schema you can use an old file(s) that you know every attribute is available:
file_schema = spark.read.json(full_json_file).schema
I have webscraped data into a pickle file and want to write that data into a sqlite3 database. Can anybody help me out with what needs to be done?
You need to create a column of type BLOB (which is a supported datatype in Sqlite3).
Then, you can just INSERT INTO data (id, content) VALUES (?, ?) with the binary dump of your pickle object.
Here's a walk through on Sqlite3 inserts.
pickle.dumps will convert the object into a byte-string you can store in the database. pickle.loads will turn it back after being SELECT'd from the database.
You can also consider using dill for more complex objects.
I am fairly new to Python(and handling files) , I am using pandas and storing a dataframe in a text file.
My program requires constant changes in the dataframe , which in turn requires to be updated in the text.
Writing the whole dataframe over and over again ,would not be efficient(i guess,given that i may want to update only a cell)! Appending data would mean adding the whole dataframe again(which is not what i want).
And , then there is Binary file , should i store as that , open it ,edit as normal python object and it reflects back in the file?
How do i achieve this?
Besides the discussion of whether or not you should use the database, it seems what you need is a quick way to save/read again the DataFrame.
You can do it with pickle. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
import pickle
# Save the DataFrame
pickle.dump(df, open( "dataFrame.p", "wb" ))
# Load the DataFrame
df_read =pickle.load( open( "dataFrame.p", "rb"))