not getting integer type in elastic - python

I have been trying this for hours now. But this is not working as expected.
I am pushing data to elasticsearch via python script. Below are some fields I want as integers, but they are not being stored as integers. Sometimes, they are of None Type, else they are strings. So, I did this
body['fuel_fee'] = int(rows[a][23] or 0)
body['late_fee'] = int(rows[a][24] or 0)
body['other_fee'] = int(rows[a][26] or 0)
But I see that they are still being stored as strings in elastic. I want sum
operation on these.
I even deleted index and rewrote the entire data, so I can confirm that there is no issue of previous mappings here.
Why am I not getting these fields as integers ? How can I get this done ?
EDIT - I am fetching data from postgres database. And in postgres, these fields are stored as strings, not integers. Can it have any effect ? I think no, as I am type casting in here in python.

The datatype of a field is determined in either of the following ways
When you create mappings (before indexing any real data) and explicitly tell elasticsearch about the field type. In your example, the field fuel_fee will be mapped to long and any record containing non-integral values will throw an error
Based on the first document indexed, elasticsearch determines the field type. It tries to convert the subsequent document field values to the same type thereafter.
Coming back to your question, how do you know that all your fields are stored as strings and not integer? Can you try GET <your-index>/_mapping and see if your assumption is correct.
If the problem persists, try any of the following:
Create mappings before indexing any data.
Index only 1 document(with kibana or through curl api) and check the mapping output again.

Related

When converting Python Object to Dataframe, output is different

I am pulling data from an api, converting it into a Python Object, and attempting to convert it to a dataframe. However, when I try to unpack Python Object, I am getting extra rows than are in the Python Object.
You can see in my dataframe how on 2023-02-03, there are multiple rows. One row seems to be giving me the correct data while the other row is giving me random data. I am not sure where the extra row is coming from. I'm wondering if it has something to do with the null values or whether I am not unpacking the Python Object correctly.
My code
I double checked the raw data from the JSON response and don't see the extra values there. On Oura's UI, I checked the raw data and didn't notice anything there either.
Here's an example of what my desired output would look like:
enter image description here
Can anyone identify what I might be doing wrong?

Mongodb get new values from collection without timestamp

I want to fetch added new values from mongodb collections without timestamp value. I guess only choice using objectid field. I using test dataset on github. "https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json"
For example if I add new data to this collection, how ı fetch or how ı find these new values.
Some mongodb collections using timestamp value, and I use this timestamp value for get new values. But ı do not know, how ı find without timestamp.
Example dataset ;
enter image description here
I want like this filter. but it doesn't work
{_id: {$gt: '622e04d69edb39455e06d4af'}}
If you don't want to create a new field in the document.
SomeGlobalObj = ObjectId[] // length limit is 10
// you will need Redis or other outside storage if you have multi server
SomeGlobalObj.shift(newDocumentId)
SomeGlobalObj = SomeGlobalObj.slice(0,10)
//Make sure to keep the latest 10 IDs.
Now, if you want to retrieve the latest document, you can use this array.
If the up-to-date thing you're talking about is, disappears after checking, you can remove it from this array after query.
In the comments you mentioned that you want to do this using Python, so I shall answer from that perspective.
In Mongo, an ObjectId is composed of 3 sections:
a 4-byte timestamp value, representing the ObjectId's creation, measured in seconds since the Unix epoch
a 5-byte random value generated once per process. This random value is unique to the machine and process.
a 3-byte incrementing counter, initialized to a random value
Because of this, we can use the ObjectId to sort or filter by created timestamp. To construct an ObjectId for a specific date, we can use the following code:
gen_time = datetime.datetime(2010, 1, 1)
dummy_id = ObjectId.from_datetime(gen_time)
result = collection.find({"_id": {"$lt": dummy_id}})
Source: objectid - Tools for working with MongoDB ObjectIds
This example will find all documents created before 2010/01/01. Substituting $gt would allow this query to function as you desire.
If you need to get the timetamp from an ObjectId, you can use the following code:
id = myObjectId.generation_time

How to only get only errors from insert_rows_from_dataframe method in Bigquery Client?

I am using client.insert_rows_from_dataframe method to insert data into my table.
obj = client.insert_rows_from_dataframe(table=TableRef, dataframe=df)
If there is no errors, obj will be an empty list of lists like
> print(obj)
[[] [] []]
But I want to know how to get the error messages out, if there are some errors while inserting?
I tried
obj[["errors"]] ?
but that is not correct. Please help.
To achieve the results that you want, you must set to your DataFrame a header identical to the one in your schema. For example, if you schema in BigQuery has the fields index and name, your DataFrame should have these two columns.
Lets take a look in the example below:
I created an table in BigQuery named insert_from_dataframe which contains the fields index, name and number, respectively INTEGER, STRING and INTEGER, all of them REQUIRED.
In the image below you can see that the insertion cause no errors.In the second image, we can see that the data was inserted.
No erros raised
Data inserted successfully
After that, I removed the column number for the last row of the same data. As you can see below, when I tried to push it to BigQuery, I got an error.
Given that, I would like to reinforce two points:
The error structured that is returned is a list of lists ( [],[],[],...]). The reason for that is because your data is supposed to be pushed in chunks (subsets of your data). In the function used you can specify how many rows each chunk will have using the parameter chunk_size=<number_of_rows>. Lets suppose that your data has 1600 rows and your chunk size is 500. You data will be divided into 4 chunks. The object returned after the insert request, hence, will consist of 4 lists inside a list, where each one of the four lists is related to one chunk. Its also important to say that if a row fails the process, all the rows inside the same chunk will not be inserted in the table.
If you are using string fields you should pay attention in the data inserted. Sometimes Pandas read null values as empty strings and it leads to a misinterpretation of the data by the insertion mechanism. In other words, its possible that you have empty strings inserted in your table while the expected result would be an error saying that the field can not be null.
Finally, I would like to post here some useful links for this problem:
BigQuery client documentation
Working with missing values in Pandas
I hope it helps.

Using sqlalchemy, how to query if an entry has a column with a specific Numeric type value?

I'm using python and sqlalchemy.
I have a db with a column that is Numeric type.
I want to query the db to check if there is an entry that has a specific value in the column.
Let's assume the value I want to look for is 1.04521, and we know it's in the db.
I've tried
(result,) = session.query( exists().where(MyEntryClass.someNumericValue == 1.0452)
but result is still False even when I know it's in the db.
How do I check to see if there is an entry with a column with a specific Numeric value?
Post Original Question:
After a little more exploration, I think it's due to rounding/representation of the non-integer number.
Actually you are not getting the result because there are query executors like all(), first(), scalar() so use scalar:
result = session.query(exists().where(MyEntryClass.someNumericValue==1.0452)).scalar()

What model should a SQLalchemy database column be to contain an array of data?

So I am trying to set up a database, the rows of which will be modified frequently. Every hour, for instance, I want to add a number to a particular part of my database. So if self.checkmarks is entered into the database equal to 3, what is the best way to update this part of the database with an added number to make self.checkmarks now equal 3, 2? I tried establishing the column as db.Array but got an attribute error:
AttributeError: 'SQLAlchemy' object has no attribute 'Array'
I have found how to update a database, but I do not know the best way to update by adding to a list rather than replacing. My approach was as follows, but I don't think append will work because the column cannot be an array:
ven = data.query.filter_by(venid=ven['id']).first()
ven.totalcheckins = ven.totalcheckins.append(ven['stats']['checkinsCount'])
db.session.commit()
Many thanks in advance
If you really want to have a python list as a Column in SQLAlchemy you will want to have a look at the PickleType:
array = db.Column(db.PickleType(mutable=True))
Please note that you will have to use the mutable=True parameter to be able to edit the column. SQLAlchemy will detect changes automatically and they will be saved as soon as you commit them.
If you want the pickle to be human-readable you can combine it with json or other converters that suffice your purposes.

Categories

Resources