Cut string if the string size exceeds 5000 bytes - python

Currently I am trying to translate a Pandas dataframe with Amazon Translate however, the max length of request text allowed is 5000 bytes and the dataframe contains multiple strings that exceed this limit.
Therefore I want to implement a solution that is able to cut the string of the "Content" column in chunks below 5000 bytes, amount of chunks dependent on the original string size, so that the limit is not exceeded and there is no lost of text.
To be more precise: the dataframe contains 3 columns:
Newspaper Date Content
6 Trouw 2018 Het is de laatste kolenmijn in Duitsland de Pr...
7 Trouw 2018 Liever wat meer kwijt aan energieheffing dan r...
8 Trouw 2018 De VVD doet een voorstel dat op het Binnenhof ...
9 Trouw 2018 In Nederland bestaat grote weerstand tegen ker...
10 Trouw 2017 Theo Potma 1932 2017 had zijn blocnote altijd...
11 Trouw 2017 Hoe en hoe snel kan Nederland zijn beloften op...
12 Trouw 2017 transitie Hoe en hoe snel kan Nederland zijn ...
14 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
15 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
16 Trouw 2015 Rapport Dwing burger CO\n Nederland heeft e...
And only the "Content" column should be checked for string size and cut in chunks but keep the original "Newspaper" and "Date" column data. By doing so I can still trace back the text to the original row.
Is there anyone who can help with such a solution?

Related

get the number of involved singer in a phase

I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64

Scraping issue with id_tag

I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work.
Can someone help me plz?
Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par Camille Fabien </p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
Camille Fabien
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title
Author
Translator(s)
Illustrator(s)
0
Le Sépulcre des Ombres
Jonathan Green
Noël Chassériau
Alan Langford
1
La Légende de Zagor
Ian Livingstone
Pascale Houssin
Martin McKenna
2
Les Mages de Solani
Keith Martin
Noël Chassériau
Russ Nicholson
3
Le Siège de Sardath
Keith P. Phillips
Yannick Surcouf
Pete Knifton
4
Retour à la Montagne de Feu
Ian Livingstone
Yannick Surcouf
Martin McKenna
5
Les Mondes de l'Aleph
Peter Darvill-Evans
Yannick Surcouf
Tony Hough
6
Les Mercenaires du Levant
Paul Mason
Mona de Pracontal
Terry Oakes
7
L'Arpenteur de la Lune
Stephen Hand
Pierre de Laubier
Martin McKenna, Terry Oakes
8
La Tour de la Destruction
Keith Martin
Mona de Pracontal
Pete Knifton
9
La Légende des Guerriers Fantômes
Stephen Hand
Alexis Galmot
Martin McKenna
10
Le Repaire des Morts-Vivants
Dave Morris
Nicolas Grenier
David Gallagher
11
L'Ancienne Prophétie
Paul Mason
Mona de Pracontal
Terry Oakes
12
La Vengeance des Démons
Jim Bambra
Mona de Pracontal
Martin McKenna
13
Le Sceptre Noir
Keith Martin
Camille Fabien
David Gallagher
14
La Nuit des Mutants
Peter Darvill-Evans
Anne Collas
Alan Langford
15
L'Élu des Six Clans
Luke Sharp
Noël Chassériau
Martin Mac Kenna, Martin McKenna
16
Le Volcan de Zamarra
Luke Sharp
Olivier Meyer
David Gallagher
17
Les Sombres Cohortes
Ian Livingstone
Noël Chassériau
Nik William
18
Le Vampire du Château Noir
Keith Martin
Mona de Pracontal
Martin McKenna
19
Le Voleur d'Âmes
Keith Martin
Mona de Pracontal
Russ Nicholson
20
Le Justicier de l'Univers
Martin Allen
Mona de Pracontal
Tim Sell
21
Les Esclaves de l'Eternité
Paul Mason
Sylvie Bonnet
Bob Harvey
22
La Créature venue du Chaos
Steve Jackson
Noël Chassériau
Alan Langford
23
Les Rôdeurs de la Nuit
Graeme Davis
Nicolas Grenier
John Sibbick
24
L'Empire des Hommes-Lézards
Marc Gascoigne
Jean Lacroix
David Gallagher
25
Les Gouffres de la Cruauté
Luke Sharp
Sylvie Bonnet
Russ Nicholson
26
Les Spectres de l'Angoisse
Robin Waterfield
Mona de Pracontal
Ian Miller
27
Le Chasseur des Étoiles
Luke Sharp
Arnaud Dupin de Beyssat
Cary Mayes, Gary Mayes
28
Les Sceaux de la Destruction
Robin Waterfield
Sylvie Bonnet
Russ Nicholson
29
La Crypte du Sorcier
Ian Livingstone
Noël Chassériau
John Sibbick
30
La Forteresse du Cauchemar
Peter Darvill-Evans
Mona de Pracontal
Dave Carson
31
La Grande Menace des Robots
Steve Jackson
Danielle Plociennik
Gary Mayes
32
L'Épée du Samouraï
Mark Smith
Pascale Jusforgues
Alan Langford
33
L'Épreuve des Champions
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Brian Williams
34
Défis Sanglants sur l'Océan
Andrew Chapman
Jean Walter
Bob Harvey
35
Les Démons des Profondeurs
Steve Jackson
Noël Chassériau
Bob Harvey
36
Rendez-vous avec la M.O.R.T.
Steve Jackson
Arnaud Dupin de Beyssat
Declan Considine
37
La Planète Rebelle
Robin Waterfield
C. Degolf
Gary Mayes
38
Les Trafiquants de Kelter
Andrew Chapman
Anne Blanchet
Nik Spender
39
Le Combattant de l'Autoroute
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Kevin Bulmer
40
Le Mercenaire de l'Espace
Andrew Chapman
Jean Walthers
Geoffroy Senior
41
Le Temple de la Terreur
Ian Livingstone
Denise May
Bill Houston
42
Le Manoir de l'Enfer
Steve Jackson
43
Le Marais aux Scorpions
Steve Jackson
Camille Fabien
Duncan Smith
44
Le Talisman de la Mort
Steve Jackson
Camille Fabien
Bob Harvey
45
La Sorcière des Neiges
Ian Livingstone
Michel Zénon
Edward Crosby, Gary Ward
46
La Citadelle du Chaos
Steve Jackson
Marie-Raymond Farré
Russ Nicholson
47
La Galaxie Tragique
Steve Jackson
Camille Fabien
Peter Jones
48
La Forêt de la Malédiction
Ian Livingstone
Camille Fabien
Malcolm Barter
49
La Cité des Voleurs
Ian Livingstone
Henri Robillot
Iain McCaig
50
Le Labyrinthe de la Mort
Ian Livingstone
Patricia Marais
Iain McCaig
51
L'Île du Roi Lézard
Ian Livingstone
Fabienne Vimereu
Alan Langford
52
Le Sorcier de la Montagne de Feu
Steve Jackson
Camille Fabien
Russ Nicholson
Bear in mind this method fails for Le Manoir de l'Enfer, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)

Data formatting in Python

I want to know if there is a possibility to modify these dates (they are in PT-BR), I want to put them in numerical form in EN or PT-BR.
Início Fim
0 15 de março de 1985 14 de fevereiro de 1986
1 14 de fevereiro de 1986 23 de novembro de 1987
2 23 de novembro de 1987 15 de janeiro de 1989
3 16 de janeiro de 1989 14 de março de 1990
4 15 de março de 1990 23 de janeiro de 1992
We can setlocale LC_TIME to pt_PT then to_datetime will work as expected with a format string:
import locale
import pandas as pd
locale.setlocale(locale.LC_TIME, 'pt_PT')
df = pd.DataFrame({
'Início': ['15 de março de 1985', '14 de fevereiro de 1986',
'23 de novembro de 1987', '16 de janeiro de 1989',
'15 de março de 1990'],
'Fim': ['14 de fevereiro de 1986', '23 de novembro de 1987',
'15 de janeiro de 1989', '14 de março de 1990',
'23 de janeiro de 1992']
})
cols = ['Início', 'Fim']
df[cols] = df[cols].apply(pd.to_datetime, format='%d de %B de %Y')
df:
Início Fim
0 1985-03-15 1986-02-14
1 1986-02-14 1987-11-23
2 1987-11-23 1989-01-15
3 1989-01-16 1990-03-14
4 1990-03-15 1992-01-23
Split the string inside each cell by " de ".
Replace the 2nd element with the corresponding number (I suggest using a dictionary for this).
Join the list into a string. I suggest using str.join, but string concatenation or formatted strings work too.
Let's use an example.
date = "23 de novembro de 1987"
dates = date.split(" de ") # ['23', 'novembro', '1987']
dates[1] = "11" # ['23', '11', '1987']
numerical_date = '/'.join(dates) # "23/11/1987"

Python requests / urllib / selenium not parsing entire webpage HTML

I've been trying to parse HTML from https://www.teamrankings.com/nba/team/oklahoma-city-thunder but can't get the full page to parse. I've tried requests, urllib, and selenium with BeautifulSoup. All of them don't parse full HTML. The closest I got was with urllib (code below). I've tried many different user agents and all different parsers.
If I print webpage before using BeautifulSoup, I can see all the content. Once I use BeautifulSoup, it cuts most of it out. I've tried html.parser, lxml, and html5.
url = https://www.teamrankings.com/nba/team/oklahoma-city-thunder
req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'})
webpage = urlopen(req).read()
print(webpage)
basketball = BeautifulSoup(webpage)
print(basketball)
Thanks in advance!
not sure what you mean by not getting all the content. Have you tried just using Pandas (it uses beautifulsoup under the hood to parse <table> tags. Returns the full table for me:
EDIT
In the furture, be more specific in your question. It wasn't until your comments that you explained more. It's all there, you just need to iterate through it all.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.teamrankings.com/nba/team/oklahoma-city-thunder'
response = requests.get(url)
df = pd.DataFrame()
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find_all('table')[1]
cols = [ each.text for each in table.find_all('th') ]
rows = table.find_all('tr')
for row in rows:
data = [ each.text for each in row.find_all('td') ]
temp_df = pd.DataFrame([data])
df = df.append(temp_df, sort=True).reset_index(drop=True)
df = df.dropna()
df.columns = cols
Output:
print (df)
Date Opponent Result Location W/L Div Spread Total Money
1 10/23 Utah L 95-100 Away 0-1 0-1 +9.0 Un 221.0 +339
2 10/25 Washington L 85-97 Home 0-2 0-1 -8.5 Un 218.5 -399
3 10/27 Golden State W 120-92 Home 1-2 0-1 -1.0 Un 223.5 -117
4 10/28 Houston L 112-116 Away 1-3 0-1 +10.0 Ov 227.5 +433
5 10/30 Portland L 99-102 Home 1-4 0-2 +1.5 Un 221.5 +104
6 11/02 New Orleans W 115-104 Home 2-4 0-2 -2.0 Un 228.5 -124
7 11/05 Orlando W 102-94 Home 3-4 0-2 -3.0 Un 201.5 -142
8 11/07 San Antonio L 112-121 Away 3-5 0-2 +5.0 Ov 211.5 +172
9 11/09 Golden State W 114-108 Home 4-5 0-2 -12.5 Ov 216.5 -770
10 11/10 Milwaukee L 119-121 Home 4-6 0-2 +8.5 Ov 220.0 +329
11 11/12 Indiana L 85-111 Away 4-7 0-2 +1.0 Un 213.0 -101
12 11/15 Philadelphia W 127-119 Home 5-7 0-2 +3.5 Ov 214.0 +148
13 11/18 LA Clippers L 88-90 Away 5-8 0-2 +7.5 Un 222.0 +297
14 11/19 LA Lakers L 107-112 Away 5-9 0-2 +11.0 Ov 209.5 +469
15 11/22 LA Lakers L 127-130 Home 5-10 0-2 +4.5 Ov 209.5 +186
16 11/25 Golden State W 100-97 Away 6-10 0-2 -7.5 Un 213.5 -297
17 11/27 Portland L 119-136 Away 6-11 0-3 +3.0 Ov 219.0 +137
18 11/29 New Orleans W 109-104 Home 7-11 0-3 -4.5 Un 229.0 -195
19 12/01 New Orleans W 107-104 Away 8-11 0-3 +2.5 Un 226.5 +124
20 12/04 Indiana L 100-107 Home 8-12 0-3 +1.5 Un 208.5 +102
21 12/06 Minnesota W 139-127 Home 9-12 1-3 -3.5 Ov 218.0 -160
22 12/08 Portland W 108-96 Away 10-12 2-3 +3.5 Un 223.0 +154
23 12/09 Utah W 104-90 Away 11-12 3-3 +8.5 Un 206.5 +311
24 12/11 Sacramento L 93-94 Away 11-13 3-3 +1.5 Un 207.5 +117
25 12/14 Denver L 102-110 Away 11-14 3-4 +5.5 Ov 204.0 +211
26 12/16 Chicago W 109-106 Home 12-14 3-4 -5.0 Ov 208.5 -211
27 12/18 Memphis W 126-122 Home 13-14 3-4 -6.5 Ov 219.5 -254
28 12/20 Phoenix W 126-108 Home 14-14 3-4 -3.0 Ov 224.5 -147
29 12/22 LA Clippers W 118-112 Home 15-14 3-4 -1.0 Ov 223.5 -111
30 12/26 Memphis L 97-110 Home 15-15 3-4 -5.5 Un 224.0 -242
.. ... ... ... ... ... ... ... ... ...
53 02/09 Boston 3:30 pm Home -- -- --
54 02/11 San Antonio 8:00 pm Home -- -- --
55 02/13 New Orleans 8:00 pm Away -- -- --
56 02/21 Denver 8:00 pm Home -- -- --
57 02/23 San Antonio 7:00 pm Home -- -- --
58 02/25 Chicago 8:00 pm Away -- -- --
59 02/27 Sacramento 8:00 pm Home -- -- --
60 02/28 Milwaukee 8:00 pm Away -- -- --
61 03/03 LA Clippers 8:00 pm Home -- -- --
62 03/04 Detroit 7:00 pm Away -- -- --
63 03/06 New York 7:30 pm Away -- -- --
64 03/08 Boston 6:00 pm Away -- -- --
65 03/11 Utah 8:00 pm Home -- -- --
66 03/13 Minnesota 8:00 pm Home -- -- --
67 03/15 Washington 6:00 pm Away -- -- --
68 03/17 Memphis 8:00 pm Away -- -- --
69 03/18 Atlanta 7:30 pm Away -- -- --
70 03/20 Denver 8:00 pm Home -- -- --
71 03/23 Miami 7:30 pm Away -- -- --
72 03/26 Charlotte 8:00 pm Home -- -- --
73 03/28 Golden State 8:30 pm Away -- -- --
74 03/30 Denver 9:00 pm Away -- -- --
75 04/01 Phoenix 8:00 pm Home -- -- --
76 04/04 LA Clippers 3:30 pm Away -- -- --
77 04/05 LA Lakers 9:30 pm Away -- -- --
78 04/07 Brooklyn 8:00 pm Home -- -- --
79 04/10 New York 8:00 pm Home -- -- --
80 04/11 Memphis 8:00 pm Away -- -- --
81 04/13 Utah 8:00 pm Home -- -- --
82 04/15 Dallas 7:30 pm Away -- -- --
[82 rows x 9 columns]

Print formatting to file in Python

I wish to write to a file in a formatted way. I have been searching how I can do this, and the best solution I have managed to get is this:
write_to_file.write('{:20} {:20} {:3}\n'.format(w[0][0], w[0][1], w[1]))
However, when I do this I do not get precise formatting.
det er 6
er det 5
den er 5
du kan 4
hva er 3
har en 3
er død 3
å gjøre 3
jeg vil 3
har vi 3
et dikt 2
når du 2
det var 2
må være 2
kan skrive 2
hva gjør 2
ha et 2
jeg har 2
du skal 2
vi kan 2
jeg kan 2
en vakker 2
er du 2
når man 2
får jeg 2
I get things printed in the fashion above. I need everything to align perfectly.
That's because you're still dealing with bytes. Once you start dealing with actual characters you'll find that they align perfectly.
write_to_file.write('{:20} {:20} {:3}\n'.format(u'å', u'gjøre', u'3'))
The DataFrame command in the Pandas library would also be worth looking into.

Categories

Resources