Inconsistent index value in re module - python

I have two list which have different value. I tried to put the a list in an organized format with g.split. Although it work fine on the a list, but it cant filter b list properly
a = ['Sehingga 8 Ogos 2021: Jumlah kes COVID-19 yang dilaporkan adalah 18,688 kes (1,262,540 kes)\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 6,565 (465,015)\nWPKL - 1,883 (140,404)\nJohor - 1,308 (100,452)\nSabah -Lagi 1,379 (93,835)\nSarawak - 581 (81,328)\nNegeri Sembilan - 1,140 (78,777)\nKedah - 1,610 (56,598)\nPulau Pinang - 694 (52,368)\nKelantan - 870 (49,433)\nPerak - 861 (43,924)\nMelaka - 526 (35,584)\nPahang - 602 (29,125)\nTerengganu - 598 (20,696)\nWP Labuan - 2 (9,711)\nWP Putrajaya - 63 (4,478)\nPerlis - 6 (812)\n\n- KPK KKM']
b = ['Sehingga 9 Ogos 2021. Jumlah kes COVID-19 yang dilaporkan adalah 17,236 kes (1,279,776 kes).\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 5,740 (470,755)\nWPKL - 1,567 (141,971)\nJohor - 1,232 (101,684)\nSabah -Lagi 1,247 (95,082)\nSarawak - 589 (81,917)\nNegeri Sembilan - 1,215 (79,992)\nKedah - 1,328 (57,926)\nPulau Pinang - 908 (53,276)\nKelantan - 914 (50,347)\nPerak - 935 (44,859)\nMelaka - 360 (35,944)\nPahang - 604 (29,729)\nTerengganu - 501 (21,197)\nWP Labuan - 8 (9,719)\nWP Putrajaya - 66 (4,544)\nPerlis - 22 (834)\n\n- KPK KKM']
My code
out = []
for v in b:
for g in re.findall(r"^(.*?\(.*?\))\n", v, flags=re.M):
out.append(g.split(":")[0])
print(*out[0])
Whenever i print print(out[0]) in b list it only show me Selangor - 5 , 7 4 0 (470,755) which is wrong, it should be Sehingga 9 Ogos 2021.
I tried the same code but this time in a list and it work properly without any issues. However I noticed there is minor differences between the two list, one is the ':' and '.' after the Sehingga 8 Ogos 2021. How can I make the function to work on both list? I'm still new to re and gsplit, does anyone have any idea on this ? Thanks.
`

There are issue with your data format and regex, I am not that good at regex but this works on me.
import re
a = ['Sehingga 8 Ogos 2021: Jumlah kes COVID-19 yang dilaporkan adalah 18,688 kes (1,262,540 kes)\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 6,565 (465,015)\nWPKL - 1,883 (140,404)\nJohor - 1,308 (100,452)\nSabah -Lagi 1,379 (93,835)\nSarawak - 581 (81,328)\nNegeri Sembilan - 1,140 (78,777)\nKedah - 1,610 (56,598)\nPulau Pinang - 694 (52,368)\nKelantan - 870 (49,433)\nPerak - 861 (43,924)\nMelaka - 526 (35,584)\nPahang - 602 (29,125)\nTerengganu - 598 (20,696)\nWP Labuan - 2 (9,711)\nWP Putrajaya - 63 (4,478)\nPerlis - 6 (812)\n\n- KPK KKM']
b = ['Sehingga 9 Ogos 2021. Jumlah kes COVID-19 yang dilaporkan adalah 17,236 kes (1,279,776 kes).\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 5,740 (470,755)\nWPKL - 1,567 (141,971)\nJohor - 1,232 (101,684)\nSabah -Lagi 1,247 (95,082)\nSarawak - 589 (81,917)\nNegeri Sembilan - 1,215 (79,992)\nKedah - 1,328 (57,926)\nPulau Pinang - 908 (53,276)\nKelantan - 914 (50,347)\nPerak - 935 (44,859)\nMelaka - 360 (35,944)\nPahang - 604 (29,729)\nTerengganu - 501 (21,197)\nWP Labuan - 8 (9,719)\nWP Putrajaya - 66 (4,544)\nPerlis - 22 (834)\n\n- KPK KKM']
out = []
for v in b:
regex_list = re.findall(r"^(.*?\(.*?\))\n", v.replace('.\n', '\n').replace('.',':'), flags=re.M)
for g in regex_list:
print(g)
out.append(g.split(":")[0])
print(*out[0])

Related

Why can't I scrape table data in order?

I'm trying to scrape table data off of this website:
https://www.nfl.com/standings/league/2019/REG
I have working code (below), however, it seems like the table data is not in the order that I see on the website.
On the website I see (top-down):
Baltimore Ravens, Green Bay Packers, ..., Cincinatti Bengals
But in my code results, I see (top-down): Bengals, Lions, ..., Ravens
Why is soup returning the tags out of order? Does anyone know why this is happening? Thanks!
import requests
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import lxml
url = 'https://www.nfl.com/standings/league/2019/REG'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup) #not sure why soup isn't returning tags in the order I see on website
table = soup.table
headers = []
for th in table.select('th'):
headers.append(th.text)
print(headers)
df = pd.DataFrame(columns=headers)
for sup in table.select('sup'):
sup.decompose() #Removes sup tag from the table tree so x, xz* in nfl_team_name will not show up
for tr in table.select('tr')[1:]:
td_list = tr.select('td')
td_str_list = [td_list[0].select('.d3-o-club-shortname')[0].text]
td_str_list = td_str_list + [td.text for td in td_list[1:]]
df.loc[len(df)] = td_str_list
print(df.to_string())
After initial load the table is dynamically sorted by column PCT - To get your goal do the same with your DataFrame using sort_values():
pd.read_html('https://www.nfl.com/standings/league/2019/REG')[0].sort_values(by='PCT',ascending=False)
Or based on your example:
df.sort_values(by='PCT',ascending=False)
Output:
NFL Team
W
L
T
PCT
PF
PA
Net Pts
Home
Road
Div
Pct
Conf
Pct
Non-Conf
Strk
Last 5
Ravens
14
2
0
0.875
531
282
249
7 - 1 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
4 - 0 - 0
12W
5 - 0 - 0
49ers
13
3
0
0.813
479
310
169
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
3 - 1 - 0
2W
3 - 2 - 0
Saints
13
3
0
0.813
458
341
117
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
9 - 3 - 0
0.75
4 - 0 - 0
3W
4 - 1 - 0
Packers
13
3
0
0.813
376
313
63
7 - 1 - 0
6 - 2 - 0
6 - 0 - 0
1
10 - 2 - 0
0.833
3 - 1 - 0
5W
5 - 0 - 0
...

get financial data using Python

I have managed to write some Python code and Selenium that navigates to a webpage that contains financial data that is in some tables.
I want to be able to extract the data and put it into excel.
The tables seem to be html based tables code below:
<tr>
<td class="bc2T bc2gt">Last update</td>
<td class="bc2V bc2D">03/15/2018</td><td class="bc2V bc2D">03/14/2019</td><td class="bc2V bc2D">03/12/2020</td><td class="bc2V bc2D" style="background-color:#DEFEFE;">05/22/2020</td><td class="bc2V bc2D" style="background-color:#DEFEFE;">05/20/2020</td><td class="bc2V bc2D" style="background-color:#DEFEFE;">05/18/2020</td>
</tr>
</table>
The table has the following class name:
<table class='BordCollapseYear2' style="margin-right:20px; font-size:12px; width:100%;" cellspacing=0>
Is there a way I can extract this data? Ideally I want this to be dynamic so that it can extract information for different companies.
I've never used it before, but I've seen BeautifulSoup library mentioned a few times.
https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/financials/
As an example Microsoft. I'd want to extract the income statement data, balance sheet etc.
This script will scrape all tables found on the page and pretty prints them:
import requests
from bs4 import BeautifulSoup
url = 'https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/financials/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = {}
# for every table found on page...
for table in soup.select('table.BordCollapseYear2'):
table_name = table.find_previous('b').text
all_data[table_name] = []
# ..scrape every row
for tr in table.select('tr'):
row = [td.get_text(strip=True, separator=' ') for td in tr.select('td')]
if len(row) == 7:
all_data[table_name].append(row)
#pretty print all data:
for k, v in all_data.items():
print('Table name: {}'.format(k))
print('-' * 160)
for row in v:
print(('{:<25}'*7).format(*row))
print()
Prints:
Table name: Valuation
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Fiscal Period: June 2017 2018 2019 2020 2021 2022
Capitalization 1 532 175 757 640 1 026 511 1 391 637 - -
Entreprise Value (EV) 1 485 388 700 112 964 870 1 315 823 1 299 246 1 276 659
P/E ratio 25,4x 46,3x 26,5x 32,3x 29,7x 25,8x
Yield 2,26% 1,70% 1,37% 1,10% 1,18% 1,31%
Capitalization / Revenue 5,51x 6,87x 8,16x 9,81x 8,89x 7,95x
EV / Revenue 5,02x 6,34x 7,67x 9,28x 8,30x 7,30x
EV / EBITDA 12,7x 15,4x 17,7x 20,2x 18,3x 15,9x
Cours sur Actif net 7,46x 9,15x 10,0x 12,1x 10,1x 8,49x
Nbr of stocks (in thousands)7 720 510 7 683 198 7 662 818 7 583 440 - -
Reference price (USD) 68,9 98,6 134 184 184 184
Last update 07/20/2017 07/19/2018 07/18/2019 05/08/2020 04/30/2020 04/30/2020
Table name: Annual Income Statement Data
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Fiscal Period: June 2017 2018 2019 2020 2021 2022
Net sales 1 96 657 110 360 125 843 141 818 156 534 174 945
EBITDA 1 38 117 45 319 54 641 65 074 70 966 80 445
Operating profit (EBIT) 129 339 35 058 42 959 52 544 57 045 65 289
Operating Margin 30,4% 31,8% 34,1% 37,1% 36,4% 37,3%
Pre-Tax Profit (EBT) 1 23 149 36 474 43 688 52 521 57 042 65 225
Net income 1 21 204 16 571 39 240 43 693 47 223 53 905
Net margin 21,9% 15,0% 31,2% 30,8% 30,2% 30,8%
EPS 2 2,71 2,13 5,06 5,68 6,18 7,11
Dividend per Share 2 1,56 1,68 1,84 2,02 2,16 2,41
Last update 07/20/2017 07/19/2018 07/18/2019 05/22/2020 05/22/2020 05/22/2020
Table name: Balance Sheet Analysis
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Fiscal Period: June 2017 2018 2019 2020 2021 2022
Net Debt 1 - - - - - -
Net Cash position 1 46 787 57 528 61 641 75 814 92 392 114 978
Leverage (Debt / EBITDA) -1,23x -1,27x -1,13x -1,17x -1,30x -1,43x
Free Cash Flow 1 31 378 32 252 38 260 41 953 46 887 53 155
ROE (Net Profit / Equities)29,4% 19,4% 42,4% 36,6% 34,5% 36,1%
Shareholders' equity 1 72 195 85 215 92 524 119 417 136 690 149 484
ROA (Net Profit / Asset) 9,76% 6,51% 14,4% 18,5% 14,6% 14,7%
Assets 1 217 276 254 580 272 703 235 800 323 445 366 702
Book Value Per Share 2 9,24 10,8 13,4 15,2 18,2 21,6
Cash Flow per Share 2 5,04 5,63 6,73 7,03 8,02 9,79
Capex 1 8 129 11 632 13 925 15 698 17 922 19 507
Capex / Sales 8,41% 10,5% 11,1% 11,1% 11,4% 11,2%
Last update 07/20/2017 07/19/2018 07/18/2019 05/22/2020 05/22/2020 05/04/2020
EDIT (to save all_data as csv file):
import csv
with open('data.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for k, v in all_data.items():
spamwriter.writerow([k])
for row in v:
spamwriter.writerow(row)
Screenshot from LibreOffice:

How to keep duplicated rows that repeat exactly n times in pandas DataFame

I have DataFrame that looks like this, with ~10k rows:
peak start peak end motif_start motif_end strand
0 948 177 3210085 3210103 -
1 948 177 3210047 3210065 +
2 062 419 3223269 3223287 -
3 062 419 3223229 3223247 +
4 062 419 3223232 3223250 +
.
.
.
Some of the rows repeat from 2 to 8 times in the 'peak start' column. I need to cut out to a new DataFrame rows that repeat exactly n times (n between 2 to 8).
Desired output:
n=2
peak start peak end motif_start motif_end strand
0 948 177 3210085 3210103 -
1 948 177 3210047 3210065 +
n=3
peak start peak end motif_start motif_end strand
2 062 419 3223269 3223287 -
3 062 419 3223229 3223247 +
4 062 419 3223232 3223250 +
And so on for each n.
I tried:
new_df = df.groupby('peak start').head(n)
but it also returns first n rows, for rows that repeat more than n times.
I am new to Python, so looking for maybe existing method that I am not aware of, rather then iterating over the data frame and counting.
Any ideas?
use .transform and count with a boolean filter.
s = df.groupby('peak_start')['peak_start'].transform('count')
df[s == 2]
peak_start peak_end motif_start motif_end strand
0 948 177 3210085 3210103 -
1 948 177 3210047 3210065 +
print(df[s == 3])
peak_start peak_end motif_start motif_end strand
2 62 419 3223269 3223287 -
3 62 419 3223229 3223247 +
4 62 419 3223232 3223250 +
Use GroupBy.transform with size to performance a boolean indexing
m = df.groupby(['peak start'])['peak start'].transform('size')
#if you want to consider both
#m = df.groupby(['peak start', 'peak end'])['peak start'].transform('size')
now you can filter your dataframe:
df.loc[m.between(2, 8)] #inclusive = True by default
peak start peak end motif_start motif_end strand
0 948 177 3210085 3210103 -
1 948 177 3210047 3210065 +
2 062 419 3223269 3223287 -
3 062 419 3223229 3223247 +
4 062 419 3223232 3223250 +
df.loc[m.eq(2)]
peak start peak end motif_start motif_end strand
0 948 177 3210085 3210103 -
1 948 177 3210047 3210065 +
df.loc[m.eq(3)]
peak start peak end motif_start motif_end strand
2 062 419 3223269 3223287 -
3 062 419 3223229 3223247 +
4 062 419 3223232 3223250 +
We can also use value_counts
m = df['peak start'].value_counts()
df.loc[df['peak start'].map(m).eq(2)]
or GroupBy.filter
n = 2
my_range = range(2-1, 8+1)
df.groupby('peak_start').filter(lambda group: len(group) == n)
df.groupby('peak_start').filter(lambda group: len(group) in my_range)

Convert python pandas rows to columns

Decade difference (kg) Version
0 1510 - 1500 -0.346051 v1.0h
1 1510 - 1500 -3.553251 A2011
2 1520 - 1510 -0.356409 v1.0h
3 1520 - 1510 -2.797978 A2011
4 1530 - 1520 -0.358922 v1.0h
I want to transform the pandas dataframe such that the 2 unique enteries in the Version column are transfered to become columns. How do I do that?
The resulting dataframe should not have a multiindex
In [28]: df.pivot(index='Decade', columns='Version', values='difference (kg)')
Out[28]:
Version A2011 v1.0h
Decade
1510 - 1500 -3.553251 -0.346051
1520 - 1510 -2.797978 -0.356409
1530 - 1520 NaN -0.358922
or
In [31]: df.pivot(index='difference (kg)', columns='Version', values='Decade')
Out[31]:
Version A2011 v1.0h
difference (kg)
-3.553251 1510 - 1500 None
-2.797978 1520 - 1510 None
-0.358922 None 1530 - 1520
-0.356409 None 1520 - 1510
-0.346051 None 1510 - 1500
both satisfy your requirements.

Python Function returns wrong value

periodsList = []
su = '0:'
Su = []
sun = []
SUN = ''
I'm formating timetables by converting
extendedPeriods = ['0: 1200 - 1500',
'0: 1800 - 2330',
'2: 1200 - 1500',
'2: 1800 - 2330',
'3: 1200 - 1500',
'3: 1800 - 2330',
'4: 1200 - 1500',
'4: 1800 - 2330',
'5: 1200 - 1500',
'5: 1800 - 2330',
'6: 1200 - 1500',
'6: 1800 - 2330']
into '1200 - 1500/1800 - 2330'
su is the day identifier
Su, sun store some values
SUN stores the converted timetable
for line in extendedPeriods:
if su in line:
Su.append(line)
for item in Su:
sun.append(item.replace(su, '', 1).strip())
SUN = '/'.join([str(x) for x in sun])
Then I tried to write a function to apply my "converter" also to the other days..
def formatPeriods(id, store1, store2, periodsDay):
for line in extendedPeriods:
if id in line:
store1.append(line)
for item in store1:
store2.append(item.replace(id, '', 1).strip())
periodsDay = '/'.join([str(x) for x in store2])
return periodsDay
But the function returns 12 misformatted strings...
'1200 - 1500', '1200 - 1500/1200 - 1500/1800 - 2330',
You can use collections.OrderedDict here, if order doesn't matter then use collections.defaultdict
>>> from collections import OrderedDict
>>> dic = OrderedDict()
for item in extendedPeriods:
k,v = item.split(': ')
dic.setdefault(k,[]).append(v)
...
>>> for k,v in dic.iteritems():
... print "/".join(v)
...
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
To access a particular day you can use:
>>> print "/".join(dic['0']) #sunday
1200 - 1500/1800 - 2330
>>> print "/".join(dic['2']) #tuesday
1200 - 1500/1800 - 2330
This is your general logic:
from collections import defaultdict
d = defaultdict(list)
for i in extended_periods:
bits = i.split(':')
d[i[0].strip()].append(i[1].strip())
for i,v in d.iteritems():
print i,'/'.join(v)
The output is:
0 1200 - 1500/1800 - 2330
3 1200 - 1500/1800 - 2330
2 1200 - 1500/1800 - 2330
5 1200 - 1500/1800 - 2330
4 1200 - 1500/1800 - 2330
6 1200 - 1500/1800 - 2330
To make it function for a day, simply select d[0] (for Sunday, for example):
def schedule_per_day(day):
d = defaultdict(list)
for i in extended_periods:
bits = i.split(':')
d[i[0].strip()].append(i[1].strip())
return '/'.join(d[day]) if d.get(day) else None

Categories

Resources