Python/Beautifulsoup - Missing first row of table

Python/Beautifulsoup - Missing first row of table - python

Could someone please help me understand why the output for the below code is missing the first row of the table? I'm new to python, and not for lack of trying, I haven't been able to troubleshoot this myself.
import requests
import csv
from collections import OrderedDict
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
def printfunction():
with open("C:/Users/.../audusd.csv", 'a', newline='') as f:
wr = csv.writer(f)
wr.writerows([(data[0], data[1], data[2], data[3], data[4], data[5])])
url = requests.get("https://au.investing.com/currencies/aud-usd-historical-data/",
headers={'User-Agent': 'Mozilla/5.0'})
od = OrderedDict()
content_page = soup(url.content, 'html.parser')
table = content_page.find('table', {'class': 'genTbl closedTbl historicalTbl'})
cols = [th.text for th in table.select("th")[1:]]
for row in table.select("tr + tr"):
data = [td.text for td in row.select("td")]
printfunction()
print(data)
Output appears thus:
['Aug 23, 2018', '0.7246', '0.7349', '0.7355', '0.7240', '-1.37%']
['Aug 22, 2018', '0.7347', '0.7370', '0.7371', '0.7332', '-0.33%']
['Aug 21, 2018', '0.7371', '0.7341', '0.7383', '0.7332', '0.42%']
['Aug 20, 2018', '0.7340', '0.7306', '0.7344', '0.7294', '0.44%']
['Aug 19, 2018', '0.7308', '0.7316', '0.7317', '0.7308', '-0.05%']
['Aug 17, 2018', '0.7312', '0.7261', '0.7321', '0.7253', '0.70%']
['Aug 16, 2018', '0.7261', '0.7240', '0.7288', '0.7222', '0.30%']
['Aug 15, 2018', '0.7239', '0.7247', '0.7249', '0.7202', '-0.08%']
['Aug 14, 2018', '0.7245', '0.7270', '0.7284', '0.7222', '-0.33%']
['Aug 13, 2018', '0.7269', '0.7289', '0.7300', '0.7248', '-0.25%']
['Aug 12, 2018', '0.7287', '0.7278', '0.7300', '0.7273', '-0.21%']
['Aug 10, 2018', '0.7302', '0.7372', '0.7381', '0.7279', '-0.95%']
['Aug 09, 2018', '0.7372', '0.7435', '0.7456', '0.7371', '-0.81%']
['Aug 08, 2018', '0.7432', '0.7420', '0.7440', '0.7382', '0.15%']
['Aug 07, 2018', '0.7421', '0.7386', '0.7440', '0.7379', '0.46%']
['Aug 06, 2018', '0.7387', '0.7398', '0.7406', '0.7372', '-0.09%']
['Aug 05, 2018', '0.7394', '0.7397', '0.7400', '0.7394', '-0.08%']
['Aug 03, 2018', '0.7400', '0.7359', '0.7412', '0.7346', '0.54%']
['Aug 02, 2018', '0.7360', '0.7405', '0.7413', '0.7354', '-0.59%']
['Aug 01, 2018', '0.7404', '0.7427', '0.7430', '0.7389', '-0.34%']
['Jul 31, 2018', '0.7429', '0.7408', '0.7442', '0.7402', '0.30%']
['Jul 30, 2018', '0.7407', '0.7390', '0.7416', '0.7387', '0.12%']
['Jul 29, 2018', '0.7398', '0.7400', '0.7406', '0.7398', '-0.04%']
['Jul 27, 2018', '0.7401', '0.7377', '0.7416', '0.7369', '0.31%']
['Jul 26, 2018', '0.7378', '0.7456', '0.7464', '0.7370', '-1.03%']
['Jul 25, 2018', '0.7455', '0.7424', '0.7466', '0.7391', '0.46%']
Desired output (as per source table):
['Aug 24, 2018', 'x', 'x', 'x', 'x', 'x']
['Aug 23, 2018', '0.7246', '0.7349', '0.7355', '0.7240', '-1.37%']
['Aug 22, 2018', '0.7347', '0.7370', '0.7371', '0.7332', '-0.33%']
['Aug 21, 2018', '0.7371', '0.7341', '0.7383', '0.7332', '0.42%']
['Aug 20, 2018', '0.7340', '0.7306', '0.7344', '0.7294', '0.44%']
['Aug 19, 2018', '0.7308', '0.7316', '0.7317', '0.7308', '-0.05%']
['Aug 17, 2018', '0.7312', '0.7261', '0.7321', '0.7253', '0.70%']
['Aug 16, 2018', '0.7261', '0.7240', '0.7288', '0.7222', '0.30%']
['Aug 15, 2018', '0.7239', '0.7247', '0.7249', '0.7202', '-0.08%']
['Aug 14, 2018', '0.7245', '0.7270', '0.7284', '0.7222', '-0.33%']
['Aug 13, 2018', '0.7269', '0.7289', '0.7300', '0.7248', '-0.25%']
['Aug 12, 2018', '0.7287', '0.7278', '0.7300', '0.7273', '-0.21%']
['Aug 10, 2018', '0.7302', '0.7372', '0.7381', '0.7279', '-0.95%']
['Aug 09, 2018', '0.7372', '0.7435', '0.7456', '0.7371', '-0.81%']
['Aug 08, 2018', '0.7432', '0.7420', '0.7440', '0.7382', '0.15%']
['Aug 07, 2018', '0.7421', '0.7386', '0.7440', '0.7379', '0.46%']
['Aug 06, 2018', '0.7387', '0.7398', '0.7406', '0.7372', '-0.09%']
['Aug 05, 2018', '0.7394', '0.7397', '0.7400', '0.7394', '-0.08%']
['Aug 03, 2018', '0.7400', '0.7359', '0.7412', '0.7346', '0.54%']
['Aug 02, 2018', '0.7360', '0.7405', '0.7413', '0.7354', '-0.59%']
['Aug 01, 2018', '0.7404', '0.7427', '0.7430', '0.7389', '-0.34%']
['Jul 31, 2018', '0.7429', '0.7408', '0.7442', '0.7402', '0.30%']
['Jul 30, 2018', '0.7407', '0.7390', '0.7416', '0.7387', '0.12%']
['Jul 29, 2018', '0.7398', '0.7400', '0.7406', '0.7398', '-0.04%']
['Jul 27, 2018', '0.7401', '0.7377', '0.7416', '0.7369', '0.31%']
['Jul 26, 2018', '0.7378', '0.7456', '0.7464', '0.7370', '-1.03%']
['Jul 25, 2018', '0.7455', '0.7424', '0.7466', '0.7391', '0.46%']
Many thanks!
OM.

The selector tr + tr means "a tr that's after a tr". So the first row doesn't show up because you're specifically asking for it not to show up. If you want to select all the rows, just select plain tr.
If you don't know how selectors work, and just copied this from some other code that seemed close, read the docs.
If you were trying to do this because there's a tr inside the th and you wanted to skip that one, this is not the way to do it.
You could try to come up with a complicated selector for every tr that's either before or after another tr (and hope you never run into a one-row table…), or something like that.
But, more simply, just select every tr that's inside a tbody.
for row in table.select('tbody tr'):
… or directly inside:
for row in table.select('tbody > tr'):
Or just select all the rows inside the tbody inside of inside the table:
for row in table.tbody.select('tr'):

Related

Fill in missing values for missing dates in dataframe

I have the following dataframe:
df = pd.DataFrame(
{
'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]
}
)
status month counts
0 open January 2020 10
1 closed January 2020 12
2 open February 2020 32
3 closed February 2020 12
4 open April 2020 19
5 closed April 2020 40
6 open August 2020 10
7 closed August 2020 11
I'm trying to get a stacked bar plot using seaborn:
sns.histplot(df, x='month', weights='counts', hue='status', multiple='stack')
The purpose is to get a plot with a continuous timeseries without missing months. How can I fill in the missing rows with values so that the dataframe would look like below?
status month counts
open January 2020 10
closed January 2020 12
open February 2020 32
closed February 2020 12
open March 2020 0
closed March 2020 0
open April 2020 19
closed April 2020 40
open May 2020 0
closed May 2020 0
open June 2020 0
closed June 2020 0
open July 2020 0
closed July 2020 0
open August 2020 10
closed August 2020 11

You could pivot the dataframe, and then reindex with the desired months.
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
df_pivoted = df.pivot(values='counts', index='month', columns='status').reindex(months).fillna(0)
ax = df_pivoted.plot.bar(stacked=True, width=1, ec='black', rot=0, figsize=(12, 5))
A seaborn solution, could use order=. That doesn't work with a histplot, only with a barplot, which doesn't stack bars.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
plt.figure(figsize=(12, 5))
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
ax = sns.barplot(data=df, x='month', y='counts', hue='status', order=months)
plt.tight_layout()
plt.show()

Sorting list that includes dates as strings

I have a following list
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020',
'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020',
'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
I've tried looping and matching if date in element set that element behind another one but I cannot get it working
I want to sort list from first month Jan to last month Dec but I don't even know where to start

The elements in data are strings, therefore if you run sort on the list, it will be sorted alphabetically.
You can tell the sort function to sort them considering they are dates. To do this, you need to convert them using datetime.strptime like this datetime.strptime(element, '%b %d, %Y') (read more about about strptime here)
So your sort function becomes:
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
data.sort(key=lambda date: datetime.strptime(date, '%b %d, %Y'))
print(data)
Outputs:
['Aug 17, 2020', 'Aug 25, 2020', 'Aug 28, 2020', 'Aug 30, 2020', 'Aug 31, 2020', 'Sep 6, 2020', 'Sep 10, 2020', 'Sep 14, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']

You can convert your strings into datetime objects and then sort them:
from datetime import datetime
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
dates = list(map(lambda time_str: datetime.strptime(time_str, "%b %d, %Y"), data))
dates.sort()
print(dates)
# or if you want them as strings
print(list(map(lambda x: x.strftime("%b %d, %Y"), dates)))

What you had:
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020',
'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020',
'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
What you need to add:
import datetime
r = lambda x: datetime.datetime.strptime(x, '%b %d, %Y')
data.sort(key=r)
Result:
['Aug 17, 2020',
'Aug 25, 2020',
'Aug 28, 2020',
'Aug 30, 2020',
'Aug 31, 2020',
'Sep 6, 2020',
'Sep 10, 2020',
'Sep 14, 2020',
'Nov 12, 2020',
'Dec 3, 2020',
'Dec 17, 2020',
'Dec 28, 2020',
'Dec 31, 2020']

You can pass a key argument to the sort method which will tell Python how to sort the items. For example, suppose I wanted to sort a list of strings based on the alphabetical order of their last letters.
words = ['alfa', 'bravo', 'charlie']
words.sort(key=lambda x: x[-1])
print(words)
Output:
['alfa', 'charlie', 'bravo']
So, you would need to write a function to pass as the key to tell when one date string is "less than" another. This would work:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
data=sort(key=lambda x: months.index(x.split()[0])*100 + int(x.split()[1]))

Pymongo convert timestamp to date as a new field

I have many entities in my collection and I have to create a new date field in collection to use for future queries.
{'_id': ObjectId('5afea920d326051990a7f337'), 'created_at': 'Fri May 18 10:21:07 +0000 2018', 'timestamp_ms': '1526638867739'}
{'_id': ObjectId('5afea920d326051990a7f339'), 'created_at': 'Fri May 18 10:21:08 +0000 2018', 'timestamp_ms': '1526638868310'}
{'_id': ObjectId('5afea972d326051c5c05bc11'), 'created_at': 'Fri May 18 10:22:30 +0000 2018', 'timestamp_ms': '1526638950799'}
{'_id': ObjectId('5afea974d326051c5c05bc16'), 'created_at': 'Fri May 18 10:22:32 +0000 2018', 'timestamp_ms': '1526638952160'}
{'_id': ObjectId('5afea974d326051c5c05bc17'), 'created_at': 'Fri May 18 10:22:32 +0000 2018', 'timestamp_ms': '1526638952841'}
I need to convert timestamp_ms into date format like this:
{'_id': ObjectId('5afea920d326051990a7f337'), 'created_at': 'Fri May 18 10:21:07 +0000 2018', 'timestamp_ms': '1526638867739’, 'NewDate': '2018 05 18 10:21:07'}
{'_id': ObjectId('5afea920d326051990a7f339'), 'created_at': 'Fri May 18 10:21:08 +0000 2018', 'timestamp_ms': '1526638868310’, 'NewDate': '2018 05 18 10:21:08'}
{'_id': ObjectId('5afea972d326051c5c05bc11'), 'created_at': 'Fri May 18 10:22:30 +0000 2018', 'timestamp_ms': '1526638950799’, 'NewDate': '2018 05 18 10:22:30'}
{'_id': ObjectId('5afea974d326051c5c05bc16'), 'created_at': 'Fri May 18 10:22:32 +0000 2018', 'timestamp_ms': '1526638952160’, 'NewDate': '2018 05 18 10:22:32'}
{'_id': ObjectId('5afea974d326051c5c05bc17'), 'created_at': 'Fri May 18 10:22:32 +0000 2018', 'timestamp_ms': '1526638952841’, 'NewDate': '2018 05 18 10:22:32'}
I used this code (Python 3.6, pymongo 3.8, mongodb 4.0):
pipeline = [
{
'$addFields': {
'newDate': {
'$toDate': '$timestamp_ms'
}
}
}
]
cursor = collection.aggregate(pipeline)
But gives this error message: pymongo.errors.OperationFailure: Error parsing date string '1526638867739'; 12: Unexpected character '9'
I am not sure that aggregate is the right method for this task. datetime.strptime() can be better for 'created_at' but I haven't figured out how to implement it into db.Mycollection_update_many().

Use below query with pymongo and mongodb 4.0
db.test.aggregate(
[
{
"$addFields": {
"NewDate": {"$toDate": "$timestamp_ms"}
}
},
{
"$out": "test"
},
],
)

I got an answer for my question from Kanika Singla at MongoDB University Discussion Forum. The answer is here, if you have same problem.
pipeline = [
{
'$project': {
'yearMonthDayUTC': {
'$convert': {
'to': 'double',
'input': '$timestamp_ms'
}
}
}
}, {
'$addFields': {
'newDate': {
'$toDate': '$yearMonthDayUTC'
}
}
}
]

How can I sort dates that are in String? (python)

I've only tried
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
print(my_dates)
But how can I make this work for date formats like
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']

One inelegant solution which comes to mind is replacing all spaces with dashes as shown below:
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date.replace(' ', '-'), "%d-%b-%y"))
print(my_dates)

If all you're looking is the specific set of dates you have provided, just change up the format in strptime():
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d %b %Y"))
# Change the last %y to %Y
However, if you have varying date formats in your list, you could prepare a list of possibly anticipated string formats and define your own function to parse against each format:
def func(date, formats):
for frmt in formats:
try:
str_date = datetime.strptime(date, frmt)
return str_date
except ValueError:
continue
# might want to consider handling a scenario when nothing is returned
my_formats = ['%d-%b-%y', '%d %b %y', '%d %b %Y']
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17', '05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: func(date, my_formats))
print(my_dates)
# ['7 Mar 17', '07 Mar 2017', '25-Mar-17', '25 Mar 2017', '1-Nov-18', '01 Nov 2018', '5-Nov-18', '05 Nov 2018']
The caveat here is if an unanticipated date format shows up, the function will return None so it won't be sorted properly. If that's a concern, you might want to add some handling at the end of the func() when all parsing attempts failed. Some devs might also say to avoid try...except..., but I can only come up with this way.

dates can be sorted once it's a datetime object
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates = [ dt.strptime(date, '%d %b %Y') for date in my_dates ]
print(my_dates)
# [datetime.datetime(2018, 11, 5, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2017, 3, 7, 0, 0)]
my_dates.sort()
print(my_dates)
# [datetime.datetime(2017, 3, 7, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2018, 11, 5, 0, 0)]
my_dates = [ date.strftime('%d %b %Y') for date in my_dates ]
print(my_dates)
# ['07 Mar 2017', '25 Mar 2017', '01 Nov 2018', '05 Nov 2018']

Function to join the first 2 elements from each list in a list of lists

I have this list of lists:
listoflist = [['BOTOS', 'AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE', 'AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]
^^ That was just a example, I will have many more lists in my list of lists with the same format.
This will be my desired output:
listoflist = [['BOTOS AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]
Basically in each list I want to join first index with the second to form a full name in one index like in the example.
And I would need a function for that who will take a input a list, how can I do this in a simple way? (I don't want extra lib for this). I use python 3.5, thank you so much for your time!

You can iterate through the outer list and then join the slice of the first two items:
def merge_names(lst):
for l in lst:
l[0:2] = [' '.join(l[0:2])]
merge_names(listoflist)
print(listoflist)
# [['BOTOS AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]

this simple list-comprehension should do the trick:
res = [[' '.join(item[0:2]), *item[2:]] for item in listoflist]
join the first two items in the list and append the rest as is.

You can try this:
f = lambda *args: [' '.join(args[:2]), *args[2:]]
listoflist = [['BOTOS', 'AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE', 'AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]
final_list = [f(*i) for i in listoflist]
Output:
[['BOTOS AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]

You can use a list comprehension as well:
listoflist = [['BOTOS', 'AUGUSTIN', 14, 'March 2016', 600, 'ALOCATIA'], ['HENDRE', 'AUGUSTIN', 14, 'February 2015', 600, 'ALOCATIA']]
def f(lol):
return [[' '.join(l[0:2])]+l[3:] for l in lol]
listoflist = f(listoflist)
print(listoflist)
# => [['BOTOS AUGUSTIN', 'March 2016', 600, 'ALOCATIA'], ['HENDRE AUGUSTIN', 'February 2015', 600, 'ALOCATIA']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Beautifulsoup - Missing first row of table - python

Related

Fill in missing values for missing dates in dataframe

Sorting list that includes dates as strings

Pymongo convert timestamp to date as a new field

How can I sort dates that are in String? (python)

Function to join the first 2 elements from each list in a list of lists

Categories

Resources