webscraping and extracting dates - python

Using python BeautifulSoup, I'm trying to extract the date of each newspaper article from a google search page:
https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1
Here is the my code:
from bs4 import BeautifulSoup
import requests
article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"
page = requests.get(article_link)
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('div', {'class':'slp'}):
date = links.get_text()
print(date)
The source code is something like:
The output is "PE Hub (blog) - 1 day ago"
Can I just extract the date part (2018. 5. 11)?

Not sure exactly why BeautifulSoup is pulling it that way but you can use regex and datetime to clean what you're pulling then you can clean it and use timedelta otherwise use strptime to convert it to the correct format.
from bs4 import BeautifulSoup
import requests
hold = []
article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u
9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"
page = requests.get(article_link)
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('div', {'class':'slp'}):
date = links.get_text()
hold.append(date) #added list append
---------
#converting to datetime values
import re
from datetime import datetime as dt
hold2 = []
for item in hold:
item = re.sub('^.+ - ','', item)
if 'ago' in item:
item = re.sub(' days? ago$','',item)
hold2.append(dt.today() - timedelta(int(item)))
else:
item = dt.strptime(item, '%b %d, %Y')
hold2.append(item)
hold2
[datetime.datetime(2018, 5, 12, 14, 37, 39, 653618),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653636),
datetime.datetime(2018, 5, 11, 14, 37, 39, 653643),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653649),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653655),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653661),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653667),
datetime.datetime(2018, 4, 24, 0, 0),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653716),
datetime.datetime(2018, 4, 25, 0, 0)]

Related

How to change the format of json to spacy/custom json format in python?

I do have a json format which is generated from docanno annotation tool. I want to convert the json into another format. Please check below for the format
Docanno json format :
{"id": 2, "data": "My name is Nithin Reddy and i'm working as a Data Scientist.", "label": [[3, 8, "Misc"], [11, 23, "Person"], [32, 39, "Activity"], [45, 59, "Designation"]]}
{"id": 3, "data": "I live in Hyderabad.", "label": [[2, 6, "Misc"], [10, 19, "Location"]]}
{"id": 4, "data": "I'm pusring my master's from Bits Pilani.", "label": [[15, 24, "Education"], [29, 40, "Organization"]]}
Required json format :
("My name is Nithin Reddy and i'm working as a Data Scientist.", {"entities": [(3, 8, "Misc"), (11, 23, "Person"), (32, 39, "Activity"), (45, 59, "Designation")]}),
("I live in Hyderabad.", {"entities": [(2, 6, "Misc"), (10, 19, "Location")]}),
("I'm pusring my master's from Bits Pilani.", {"entities": [(15, 24, "Education"), (29, 40, "Organization")]})
I tried the below code, but it's not working
import json
with open('data.json') as f:
data = json.load(f)
new_data = []
for i in data:
new_data.append((i['data'], {"entities": i['label']}))
with open('data_new.json', 'w') as f:
json.dump(new_data, f)
Can anyone help me with the python code which will change the json to required format?

How to read data from oracle database and convert a dataframe the query result in Python?

I created a method to read data from oracle db:
def read_data_oracle(server, database, username, password, query):
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
cursor.execute(query)
query_results = cursor.fetchall()
return cursor, query_results
Then, I connected DB and read data as follow:
server = 'server_id'
database = 'database_name'
username = 'UID'
password = 'PWD'
query = """SELECT TOP 5 [Id], [Date], [Order_Sum] FROM [DB_Name].[dbo].[Table_Name] order by [Date] desc"""
cursor, query_results = read_data_oracle(server, database, username, password, query)
cursor.description
Output is:
(('Id', int, None, 10, 10, 0, False), ('CreatedOnUtc', datetime.datetime, None, 23, 23, 3, False), ('OrderTotal', decimal.Decimal, None, 18, 18, 4, False))
query_results
the output is:
[(611020, datetime.datetime(2021, 6, 7, 8, 43, 57, 467000), Decimal('520.3100')),
(611019, datetime.datetime(2021, 6, 7, 8, 43, 41, 967000), Decimal('281.1200')),
(611018, datetime.datetime(2021, 6, 7, 8, 38, 40, 33000), Decimal('774.4900')),
(611017, datetime.datetime(2021, 6, 7, 8, 38, 32, 210000), Decimal('774.4900')),
(611016, datetime.datetime(2021, 6, 7, 8, 37, 53, 233000), Decimal('299.7000'))]
I want to get the query results as a dataframe. Like as follow:
Id Date Order_Sum
611020 2021-06-07 08:43:57.467 520.3100
611019 2021-06-07 08:43:41.967 281.1200
611018 2021-06-07 08:38:40.330 774.4900
611017 2021-06-07 08:38:32.210 774.4900
611016 2021-06-07 08:37:53.233 299.7000
How to create a dataframe from query results?
Try pd.read_sql function:
pd.read_sql(query, cnxn)

AWS Config Python Unicode Mess

I am running into issue with trying to pull out usable items from this output. I am just trying to pull a single value from this string of Unicode and it has been super fun.
my print(response) returns this: FYI this is way longer than this little snippet.
{u'configurationItems': [{u'configurationItemCaptureTime': datetime.datetime(2020, 6, 4, 21, 56, 31, 134000, tzinfo=tzlocal()), u'resourceCreationTime': datetime.datetime(2020, 5, 22, 16, 32, 55, 162000, tzinfo=tzlocal()), u'availabilityZone': u'Not Applicable', u'awsRegion': u'us-east-1', u'tags': {u'brassmonkeynew': u'tomtagnew'}, u'resourceType': u'AWS::DynamoDB::Table', u'resourceId': u'tj-test2', u'configurationStateId': u'1591307791134', u'relatedEvents': [], u'relationships': [], u'arn': u'arn:aws:dynamodb:us-east-1:896911201517:table/tj-test2', u'version': u'1.3', u'configurationItemMD5Hash': u'', u'supplementaryConfiguration': {u'ContinuousBackupsDescription': u'{"continuousBackupsStatus":"ENABLED","pointInTimeRecoveryDescription":{"pointInTimeRecoveryStatus":"DISABLED"}}', u'Tags': u'[{"key":"brassmonkeynew","value":"tomtagnew"}]'}, u'resourceName': u'tj-test2', u'configuration': u'{"attributeDefinitions":[{"attributeName":"tj-test2","attributeType":"S"}],"tableName":"tj-test2","keySchema":[{"attributeName":"tj-test2","keyType":"HASH"}],"tableStatus":"ACTIVE","creationDateTime":1590165175162,"provisionedThroughput":{"numberOfDecreasesToday":0,"readCapacityUnits":5,"writeCapacityUnits":5},"tableArn":"arn:aws:dynamodb:us-east-1:896911201517:table/tj-test2","tableId":"816956d7-95d1-4d31-8d18-f11b18de4643"}', u'configurationItemStatus': u'OK', u'accountId': u'896911201517'}, {u'configurationItemCaptureTime': datetime.datetime(2020, 6, 1, 16, 27, 21, 316000, tzinfo=tzlocal()), u'resourceCreationTime': datetime.datetime(2020, 5, 22, 16, 32, 55, 162000, tzinfo=tzlocal()), u'availabilityZone': u'Not Applicable', u'awsRegion': u'us-east-1', u'tags': {u'brassmonkeynew': u'tomtagnew', u'backup-schedule': u'daily'}, u'resourceType': u'AWS::DynamoDB::Table', u'resourceId': u'tj-test2', u'configurationStateId': u'1591028841316', u'relatedEvents': [], u'relationships': [], u'arn': u'arn:aws:dynamodb:us-east-1:896911201517:table/tj-test2', u'version': u'1.3', u'configurationItemMD5Hash': u'', u'supplementaryConfiguration': {u'ContinuousBackupsDescription': u'{"continuousBackupsStatus":"ENABLED","pointInTimeRecoveryDescription":{"pointInTimeRecoveryStatus":"DISABLED"}}', u'Tags': u'[{"key":"brassmonkeynew","value":"tomtagnew"},{"key":"backup-schedule","value":"daily"}]'}, u'resourceName': u'tj-test2', u'configuration': u'{"attributeDefinitions":[{"attributeName":"tj-test2","attributeType":"S"}],"tableName":"tj-test2","keySchema":[{"attributeName":"tj-
and so on. I have tried a few different ways of getting this info but every time I get a key error:
I also tried converting this into JSON and but since i have Date/time at the top it gives me this error:
“TypeError: [] is not JSON serializable
Failed attempts:
# print(response[0]["tableArn"])
print(response2)
print(response2['tableArn'])
print(response2.arn)
print(response2['configurationItems'][0]['tableArn'])
print(response2['configurationItems']['tableArn'])
print(response.configurationItems[0])
arn = response.configurationItems[0].arn
def lambda_handler(event, context):
# print("Received event: " + json.dumps(event, indent=2))
message = event['Records'][0]['Sns']['Message']
print("From SNS: " + message)
response = client.get_resource_config_history(
resourceType='AWS::DynamoDB::Table',
resourceId = message
)
response2 = dict(response)
print(response)
return message
Here's some Python3 code that shows how to access the elements:
import boto3
import json
import pprint
config_client = boto3.client('config')
response = config_client.get_resource_config_history(
resourceType='AWS::DynamoDB::Table',
resourceId = 'stack-table'
)
for item in response['configurationItems']:
configuration = item['configuration'] # Returns a JSON string
config = json.loads(configuration) # Convert to Python object
pprint.pprint(config) # Show what's in it
print(config['tableArn']) # Access elements in object
The trick is that the configuration field contains a JSON string that needs to be converted into a Python object for easy access.

Cant parse boto3 client json response using python

i am new in Python language. I need to get all Amazon-Web-Services Identity and Access Management (Amazon-IAM) policy details using Boto 3 and Python.
I tried to parse JSON output from Boto 3 client and also need to save key-value pair into a map (policyName, Arn). Sample JSON output is like this:
{
'ResponseMetadata': {
'HTTPStatusCode': 200,
'HTTPHeaders': {
'vary': 'Accept-Encoding',
'content-length': '19143',
'content-type': 'text/xml',
'date': 'Thu, 23 Feb 2017 06:39:25 GMT'
}
},
u 'Books': [ {
u 'PolicyName': 'book1',
u 'Arn': '002dfgdfgdfgdfgvdfxgdfgdfgdfgfdg',
u 'CreateDate': datetime.datetime(2017, 2, 22, 13, 10, 55, tzinfo = tzutc()),
u 'UpdateDate': datetime.datetime(2017, 2, 22, 13, 10, 55, tzinfo = tzutc())
}, {
u 'PolicyName': 'book2','
u 'Arn': '002dfgdfgdfgdfgvdfxgdfgdfgdfgfdg',
u 'CreateDate': datetime.datetime(2017, 2, 22, 13, 10, 55, tzinfo = tzutc()),
u 'UpdateDate': datetime.datetime(2017, 2, 22, 13, 10, 55, tzinfo = tzutc())
}]
}
I have following code
iampolicylist_response = iamClient.list_policies(
Scope='Local',
MaxItems=150
)
print iampolicylist_response
res=json.dumps(iampolicylist_response)
print res
ret={}
for i in res["PolicyName"]:
ret[i["PolicyName"]]=i["Arn"]
return ret
Using json.loads, it shows error like this
TypeError: expected string or buffer
Using json.dumps, it shows error like this
TypeError: datetime.datetime(2017, 2, 22, 13, 10, 55, tzinfo=tzutc()) is not JSON serializable
What is actual issue?
The result iampolicylist_response is already a dictionary
You do not need to parse it .
See http://boto3.readthedocs.io/en/latest/reference/services/iam.html#IAM.Client.list_policies
The response is a dictionary object
Remove res=json.dumps(iampolicylist_response)

How to extract numbers from JSON API

I want to extract numbers and calculate the sum of these numbers from JSON API. The format is
{
comments: [
{
name: "Matthias"
count: 97
},
{
name: "Geomer"
count: 97
}
...
]
}
And my code is
import json
import urllib
url = 'http://python-data.dr-chuck.net/comments_204529.json'
print 'Retrieving', url
uh = urllib.urlopen(url)
data = uh.read()
print 'Retrieved',len(data),'characters'
result = json.loads(url)
print result
I can get the result of how many characters in this data but cannot continue with the code because it's said JSON object cannot be decoded.
Does anyone know how to finish this code? Much appreciated!
First of all, I suggest you study the built-in Python Data Structures to get a better understanding about what you are dealing with.
result is a dictionary, result["comments"] is a list of dictionaries - you can make a list comprehension to get all the comments counts:
>>> import json
>>> import urllib
>>>
>>> url = 'http://python-data.dr-chuck.net/comments_204529.json'
>>> uh = urllib.urlopen(url)
>>> result = json.load(uh)
>>>
>>> [comment["count"] for comment in result["comments"]]
[100, 96, 95, 93, 85, 85, 77, 73, 73, 70, 65, 65, 65, 62, 62, 62, 61, 57, 50, 49, 46, 46, 43, 42, 39, 38, 37, 36, 34, 33, 31, 28, 28, 26, 26, 25, 22, 20, 20, 18, 17, 15, 14, 12, 10, 9, 8, 6, 5, 3]

Categories

Resources