Cleansing Data with Updates - Mongodb + Python - python

I have imported the into Mongodb but not able to cleanse the data in Python. Please see the below question and the script. I need answer of Script 1 & 2
import it into MongoDB, cleanse the data in Python, and update MongoDB with the cleaned data. Specifically, you'll be taking a people dataset where some of the birthday fields look like this:
{
...
"birthday": ISODate("2011-03-17T11:21:36Z"),
...
}
And other birthday fields look like this:
{
...
"birthday": "Thursday, March 17, 2011 at 7:21:36 AM",
...
}
MongoDB natively supports a Date datatype through BSON. This datatype is used in the first example, but a plain string is used in the second example. In this assessment, you'll complete the attached notebook to script a fix that makes all of the document's birthday field a Date.
Download the notebook and dataset to your notebook directory. Once you have the notebook up and running, and after you've updated your connection URI in the third cell, continue through the cells until you reach the fifth cell, where you'll import the dataset. This can take up to 10 minutes depending on the speed of your Internet connection and computing power of your computer.
After verifying that all of the documents have successfully been inserted into your cluster, you'll write a query in the 7th cell to find all of the documents that use a string for the birthday field.
To verify your understanding of the first part of this assessment, how many documents had a string value for the birthday field (the output of cell 8)?
Script1
Replace YYYY with a query on the people-raw collection that will return a cursor with only
documents where the birthday field is a string
people_with_string_birthdays = YYYY
This is the answer to verify you completed the lab:
people_with_string_birthdays.count()
Script2
updates = []
# Again, we're updating several thousand documents, so this will take a little while
for person in people_with_string_birthdays:
# Pymongo converts datetime objects into BSON Dates. The dateparser.parse function
# returns a datetime object, so we can simply do the following to update the field
# properly. Replace ZZZZ with the correct update operator
updates.append(UpdateOne(
{"_id": person["_id"]},
{ZZZZ: { "birthday": dateparser.parse(person["birthday"]) } }
))
count += 1
if count == batch_size:
people_raw.bulk_write(updates)
updates = []
count = 0
if updates:
people_raw.bulk_write(updates)
count = 0
# If everything went well this should be zero
people_with_string_birthdays.count()

import json
with open("./people-raw.json") as dataset:
array={}
for i in dataset:
a=json.loads(i)
if type(a["birthday"]) not in array:
array[type(a["birthday"])]=1
else:
array[type(a["birthday"])]+=1
print(array)
give the path of your people-raw.json file in open() method iff JSON file not in same directory.
Ans : 10382

Script 1: YYYY = people_raw.find({"Birthday" : {"$type" : "string"}})

Related

Coverting InvluxQL v1 query to FLUX query in python -- getting last reading for every key-value tag

So I am new to InfluxDB (v1) and even newer to InfluxDB v2 and Flux . Please don't be an arse as I am really trying hard to get my python code working again.
Recently, I upgraded my flux database from v1.8 to 2.6. This has been an absolute challenge but I think I have things working for the most part. (At least inserting data back into the database.) Reading items out of the database, however, has been especially challenging as I can't get my python code to work.
This is what I previously used in my python code when I was running flux 1.8 and using FluxQL. Essentially I need to convert these FluxQL queries to FLUX and get the expected results
meterids = influx_client.query('show tag values with key ="meter_id"')
metervalues = influx_client.query('select last(reading) from meter_consumption group by \*;')
With flux v2.6 I must use FLUX queries. For 'meterids' I do the following and it seems to work. (This took me days to figure out.)
metervalues_list = []
query_api = influx_client.query_api()
querystr = 'from(bucket: "rtlamr_bucket") \
|\> range(start: -1h)\
|\> keyValues(keyColumns: \["meter_id"\])' # this gives a bunch of meters ids but formatted like \[('reading', '46259044'),'reading', '35515159'),...\]
result = query_api.query(query=querystr)
for table in result:
for record in table.records:
meterid_list.append((record.get_value()))
print('This is meterids: %s' %(meterid_list))
But when I try to pull actual last readings / value for each meter_id (the meter_consumption) I can't seem to get any Flux query to work. This is what i currently have:
#metervalues = influx_client.query('select last(reading) from meter_consumption group by \*;')
querystrconsumption = 'from(bucket: "rtlamr_bucket")\
|\> range(start: -2h)\
|\> filter(fn:(r) =\> r.\_measurement == "meter_consumption")\
|\> group(columns: \["\_time"\], mode: "by")\
|\> last()'
resultconsumption = query_api.query(query=querystrconsumption)
for tableconsumption in resultconsumption:
for recordconsumption in tableconsumption.records:
metervalues_list.append((record.get_value()))
print('\\n\\nThis is metervalues: %s' %(metervalues_list))
Not sure if this will help, but in v1.8 of influxdb these were my measurements, tags and fields:
Time: timestamp
Measurement: consumption <-- consumption is the "measurement name"
Key-Value Tags (meta): meter_id, meter_type
Key-Value Fields (data): <meter_consumption in gal, ccf, etc.>
Any thoughts, suggestions or corrections would be most greatly appreciated. Apologies if I am not using the correct terminology. I have tried reading tons of google articles but I can't seem to figure this one out. :(

Searching sub string value in Numpy Array

First of all, I am using "Python" and the latest Pycharm Community edition.
I am currently working on a user interface with tkinter which is requesting several values from the user - two string values and one integer. Afterwards the program should search through an Excel or CSV file to find those values. Unfortuantely, I am currently stucked on the first entry. I've created a numpy array out of the dataframe, since I've read that arrays are much faster when it comes to work with great data. The final excel/csv file I am working with will contain several thousands of rows and up to 60 columns. In addition, the enrty_name could be a sub string of a bigger string and the search algortihm should find the fraction or the full name ( example: entry: "BMW", in array([["BMW Werk", "BMW-Automobile", "BMW_Client"], ["BMW Part1", "BMW Part2", "XS-12354"]) ). Afterwards I would like to proceed with other calculations based on the values in the array.
Example: entry: "BMW", in array([["BMW Werk", "Car1", "XD-12345"], ["BMW Part1", "exauster", "XS-12354"]])
Program found "BMW Werk" and "BMW Part1" in array, returns ["BMW Werk", "Car1", "XD-12345"] and ["BMW Part1", "exauster", "XS-12354"]
entry_name = "BMW"
path_for_excel = "D:\Python PyCharm\Tool\Clientlist.xslx"
client_list_df= pd.read_excel(path_for_excel , engine="openpyxl")
client_list_array= client_list_df.to_numpy()
#first check if entry_name is populated ( entry field in ui )
if entry_name == True:
#search for sub string in string
part_string_value = np.char.startswith(client_list_array, entry_name)
if part_string_value in client_list_array:
index = np.where(client_list_array == part_string_value)
#print found value, including the other values in the list
print(client_list_array[])
I am able to retrieve the requested values if the client is using the correct, full name like "BMW Werk", but any typo will hinder the process and it is very exausting for some names to type the full name, as an example one name looks like: "BMW Werk Bloemfontein, 123-45, Willows".
Hopefully, somebody finds time to help with my issue.
Thank you !

Optimize Elasticsearch update script body using python bulk update

Consider this script:
hashed_ids = [hashlib.md5(doc.encode('utf-8')).hexdigest() for doc in shingles]
update_by_query_body =
{
"query":{
"terms": {
"id":["id1","id2"]
}
},
"script":{
"source":"long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count(); ctx._source.weight += weightToAdd;",
"params":{
"hashed_ids":["id1","id1","id1","id2"]
}
}
}
This script does what it's supposed to do but it is very slow and from time to time raises time out error.
What it's supposed to do?
I need to update a field of a doc in Elasticsearch and add the count of that doc in a list inside python code. The weight field contains the count of the doc in a dataset. The dataset needs to be updated from time to time. So the count of each document must be updated too. hashed_ids is a list of document ids that are in the new batch of data. the weight of matched id must be increased by the count of that id in hashed_ids.
for example let say a doc with id=d1b145716ce1b04ea53d1ede9875e05a and weight=5 is already present in index. and also the string d1b145716ce1b04ea53d1ede9875e05a is repeated three times in the hashed_ids so the update_with_query query will match the doc in database. I need to add 3 to 5 and have 8 as final weight.
I need ideas to improve the efficiency of the code.

Refining a Python finance tracking program

I am trying to create a basic program that tracks retirement finances. What I have got so far below takes input for ONE entry and stores it. The next time I run it, the previous values get wiped out. Ideally what I would like is a program that appends indefinitely to a list, if I open it up 2 weeks from now I'd like to be able to see current data in dict format, plus add to it. I envision running the script, entering account names and balances then closing it, and doing it again at a later point
Few questions:
How do I achieve that? I think I need some loop concept to get there
Is there a more elegant way to enter the Account Name and Balance, rather than hardcoding it in the parameters like I have below? I tried input() but it only runs for the Account Name, not Balance (again, maybe loop related)
I'd like to add some error checking, so if the user doesn't enter a valid account, say (HSA, 401k or Roth) they are prompted to re-enter. Where should that input/check occur?
Thanks!
from datetime import datetime
Account = {
"name": [],
"month": [],
"day": [],
"year": [],
"balance": []
}
finance = [Account]
def finance_data(name, month, day, year, balance):
Account['name'].append(name)
Account['month'].append(month)
Account['day'].append(day)
Account['year'].append(year)
Account['balance'].append(balance)
print(finance)
finance_data('HSA',
datetime.now().month,
datetime.now().day,
datetime.now().year,
500)
When you run a script and put values in variables defined in the code, the values only last for however long the program runs. Each time you run the script, it will start over from the initial state defined in the code and, thus, not save state from the last time you ran the script.
What you need is persistent data that lasts beyond the runtime of the script. Normally, we accomplish this by create a database, using the script to write new data to the database, and then, when the script runs next, read the old values from the database to remember what happened in the past. However, since your use case is smaller, it probably doesn't need a full blown database system. Instead, I would recommend writing the data to a text file and then reading from the text file to get the old data. You could do that as follows:
# read about file io in python here: https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
dataStore = open("dataFile.txt", "r+") # r+ is the read and write mode
def loadDataToAccount(dataStore):
Account = {
"name": [],
"month": [],
"day": [],
"year": [],
"balance": []
}
for line in dataStore.read().splitlines():
(name, month, day, year, balance) = line.split("|")
Account['name'].append(name)
Account['month'].append(month)
Account['day'].append(day)
Account['year'].append(year)
Account['balance'].append(balance)
return Account
Account = loadDataToAccount(dataStore)
Here I am assuming that we organize the text file so that each line is an entry and the entry is "|" separated such as:
bob|12|30|1994|500
rob|11|29|1993|499
Thus, we can parse the text into the Account dictionary. Now, lets look at entering the data into the text file:
def addData(Account, dataStore):
name = raw_input("Enter the account name: ")
balance = raw_input("Enter the account balance: ")
# put name and balance validation here!
month = datetime.now().month
day = datetime.now().day
year = datetime.now().year
# add to Account
Account['name'].append(name)
Account['month'].append(month)
Account['day'].append(day)
Account['year'].append(year)
Account['balance'].append(balance)
# also add to our dataStore
dataStore.write(name + "|" + month + "|" + day + "|" + year + "|" + balance + "\n")
addData(Account, dataStore)
Notice here how I wrote it to the dataStore with the expected format that I defined to read it in. Without writing it to the text file, it will not save the data and be available for the next time you run.
Also, I used input to get the name and balance so that it is more dynamic. After collecting the input, you can put an if statement to make sure it is a valid name and then use some sort of while loop structure to keep asking for the name until they enter a valid one.
You would probably want to extract the code that adds the values to Account and put that in a helper function since we use the same code twice.
Good luck!

Scraping blog and saving date to database causes DateError: unknown date format

I am working on a project where I scrape a number of blogs, and save a selection of the data to a SQLite database. Such as the title of the post, the date it was posted, and the content of the post.
The goal in the end is to do some fancy textual analyses, but right now I have a problem with writing the data to the database.
I work with the library pattern for Python. (the module about databases can be found here)
I am busy with the third blog now. The data from the two other blogs is already saved in the database, and for the third blog, which is similarly structured, I adapted the code.
There are several functions well integrated with each other, they work fine. I also got access to all the data the right way, when I try it out in IPython Notebook it works fine. When I ran the code as a trial in the Console for only one blog page (it has 43 altogether), it also worked and saved everything nicely in the database. But when I ran it again for 43 pages, it threw a data error.
There are some comments and print statements inside the functions now which I used for debugging. The problem seems to happen in the function parse_post_info, which passes a dictionary on to the function that goes over all blog pages and opens every single post, and then saves the dictionary that the function parse_post_info returns IF it is not None, but I think it IS empty because something about the date format goes wrong.
Also - why does the code work once, and the same code throws a dateerror the second time:
DateError: unknown date format for '2015-06-09T07:01:55+00:00'
Here is the function:
from pattern.db import Database, field, pk, date, STRING, INTEGER, BOOLEAN, DATE, NOW, TEXT, TableError, PRIMARY, eq, all
from pattern.web import URL, Element, DOM, plaintext
def parse_post_info(p):
""" This function receives a post Element from the post list and
returns a dictionary with post url, post title, labels, date.
"""
try:
post_header = p("header.entry-header")[0]
title_tag = post_header("a < h1")[0]
post_title = plaintext(title_tag.content)
print post_title
post_url = title_tag("a")[0].href
date_tag = post_header("div.entry-meta")[0]
post_date = plaintext(date_tag("time")[0].datetime).split("T")[0]
#post_date = date(post_date_text)
print post_date
post_id = int(((p).id).split("-")[-1])
post_content = get_post_content(post_url)
labels = " "
print labels
return dict(blog_no=blog_no,
post_title=post_title,
post_url=post_url,
post_date=post_date,
post_id=post_id,
labels=labels,
post_content=post_content
)
except:
pass
The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW.
You can have diff with local time.
Also the format is "YYYY-MM-DD hh:mm:ss".
The convert time format can be found here

Categories

Resources