I am trying, and so far failing to find a way to extract textual information via Spacy and present it in a table.
An example text would be:
lines = 'From June 2020 to November 2020 the total rent was 800 Euro. It was composed of a basic rent of 600 Euro, a premium for the Heating of 100 Euro and another premium for the Garage of 100 Euro. From Dezember 2020 to January 2021 the total rent was 1000 Euro, then composed of a basic rent of 800 Euro, a premium for the Heating of 100 Euro and another premium for the Garage of 100 Euro.'
The output I would like to achieve is as follows:
| Period | Total Rent | Basic Rent | Heating Premium | Garage Premium |
|------------------------|------------|------------|-----------------|----------------|
| June 2020-November 2020 | 800 Euro | 600 Euro | 100 Euro | 100 Euro |
| Dezember 2020-January 2021 | 1000 Euro | 800 Euro | 100 Euro | 100 Euro |
So far I have tokenized the text and which seems useful. Then I have iterated over the tokens and only displayed Nouns and Numbers:
print("Iteriere über die Tokens und sage wortart vorher:")
for token in doc:
# Drucke den Text und die vorhergesagte Wortart
if token.pos_ == "NOUN" or token.pos_ == "NUM" or token.pos_ == "PROPN":
print(token.text, token.pos_)
The result ist:
June PROPN
2020 NUM
November PROPN
2020 NUM
rent NOUN
800 NUM
Euro PROPN
rent NOUN
600 NUM
Euro PROPN
premium NOUN
Heating PROPN
100 NUM
Euro PROPN
premium NOUN
Garage PROPN
100 NUM
Euro PROPN
Dezember PROPN
2020 NUM
January PROPN
2021 NUM
rent NOUN
1000 NUM
Euro PROPN
rent NOUN
800 NUM
Euro PROPN
premium NOUN
Heating PROPN
100 NUM
Euro PROPN
premium NOUN
Garage PROPN
100 NUM
Euro PROPN
This seems helpful, because it contains the main parts that shall be displayed in the table.
However, there may not be a way to automatically get the table done.
Does anybody have an idea?
Thanks in advance.
I have a dataset that I daily download from amazon aws. Problem is that there are some lines bad downloaded (see image. Also can download the sample here). Those 2 lines that start with "ref" should be append in the previous row that starts with "001ec214-97e6-4d84-a64a-d1bee0079fd8" in order to have that row correct. I have like a hundred of this cases to solve, so I wonder if there's a method to do this kind of append with pandas read_csv function. Also i can't just ignore those rows cause I'd be missing useful data.
This oneliner should do:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(open('filename.csv').read().replace('\n\\\nref :', ' ref :')), sep="\t", header=None)
It reads the file and replaces the redundant newlines before loading the string representation of the csv file into pandas with StringIO.
Given the text from the Google doc in the questions:
001ea9a6-2c30-4ce6-b4ee-f445803303db 596d89bf-e641-4e9c-9374-695241694a6d 45640 MUNDO JURIDICO 26204 20350 \N A0000000031010 10530377 MATEO FLORES ELMER ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 visanet 650011581 1 0 mpos contactless 0 0 0 visa 421410 5969 12 24 c6d06de1-c9f0-4b4a-992b-ad3a5f263039 000066 666407 2021-07-20 09:31:23 000 301201522830426 995212014496392 158348 \N \N \N PEN 200.00 200.00 0.00 1 0 0.00 200026214 5999 ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 MUNDO JURIDICO 947761525 AUTHORIZED 2021-07-20 09:31:24 2021-07-20 09:32:00 210720 093123
001ec0fd-8d0e-4332-851a-bcd93fdf0a37 ee32d094-8fc4-4d92-b788-58ca2ae2a590 36750 Chifa Maryori 18923 25313 \N A0000000031010 46753818 Chifa Maryori San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 5765 04 26 7d00f10e-d0b7-40ec-9b97-d20590cb7710 000708 620744 2021-06-27 16:52:21 000 301178787424243 996211783887688 100513 \N \N \N PEN 17.00 17.00 0.00 0 0 0.00 400034782 5499 San isidro Peru Chifa Maryori +51 01 5179805 AUTHORIZED 2021-06-27 16:52:23 2021-06-27 16:52:23 210627 165221
001ec214-97e6-4d84-a64a-d1bee0079fd8 3c4a98cc-d279-4f9e-af8b-5198647889c5 33214 Inversiones Polleria el rey MPOS 15699 23053 \N \N 88846910264 JoseOrbegoso Puelle D 5 Lt 2 urb Mariscal
\
ref : altura de elektra visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 2123 09 22 8b3fd975-140a-42d5-8601-1b4bcc16bd60 000022 170790 2020-10-20 11:32:44 687 \N \N \N \N \N \N PEN 1.00 1.00 0.00 0 0 0.00 200020168 5999 D 5 Lt 2 urb Mariscal
\
ref : altura de elektra Inversiones Polleria el rey MPOS 964974226 DENIED 2020-10-20 11:32:44 2020-10-20 11:32:44 201020 113244
001ec66f-6350-4a33-a1fe-b9375dac2161 34169a7a-a66f-4258-80c2-8c0d4512aa36 27044 ABRAHAM LORENZO MELGAREJO 10353 13074 \N \N 99944748991 ABRAHAM LORENZO MELGAREJO JR RENOVACION 399 visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 4712 08 24 83915520-1a1f-4d5e-b118-f0c57a9d96ae 000367 161286 2020-10-14 15:59:51 000 300288755920408 995202889603642 242407 \N \N \N PEN 15.00 15.00 0.00 1 0 0.00 200012523 5811 JR RENOVACION 399 ABRAHAM LORENZO MELGAREJO 957888909 AUTHORIZED 2020-10-14 15:59:52 2020-10-14 16:00:21 201014 155951
001eebaf-bccc-47a7-87a3-be8b09eb971b c5e14889-d61c-4f18-8000-d3bac0564cfb 27605 Polleria Arroyo Express 10792 21904 \N A0000000031010 41429707 Polleria Arroyo Express San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 9238 09 25 3c5268e8-5731-4fea-905d-45b03ed623d2 000835 379849 2021-01-30 19:43:36 000 301031026163928 995210303271986 928110 \N \N \N PEN 24.00 24.00 0.00 0 0 0.00 400026444 5499 San isidro Peru Polleria Arroyo Express +51 01 5179805 AUTHORIZED 2021-01-30 19:43:37 2021-01-30 19:43:37 210130 194336
...and using the code from Concatenate lines with previous line based on number of letters in first column, if the "good lines" all start 001 (that's what the regex is checking for below) you can try...
Code:
import re
import pandas as pd
all_the_data = ""
pattern = r'^001.*$'
with open('aws.txt') as f, open('aws_out.csv','w') as output:
all_the_data = ""
for line in f:
if not re.search(pattern, line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
output.write(all_the_data)
df = pd.read_csv('aws_out.csv', header=None, sep='\t')
print(df)
Output example:
Currently I am trying to translate a Pandas dataframe with Amazon Translate however, the max length of request text allowed is 5000 bytes and the dataframe contains multiple strings that exceed this limit.
Therefore I want to implement a solution that is able to cut the string of the "Content" column in chunks below 5000 bytes, amount of chunks dependent on the original string size, so that the limit is not exceeded and there is no lost of text.
To be more precise: the dataframe contains 3 columns:
Newspaper Date Content
6 Trouw 2018 Het is de laatste kolenmijn in Duitsland de Pr...
7 Trouw 2018 Liever wat meer kwijt aan energieheffing dan r...
8 Trouw 2018 De VVD doet een voorstel dat op het Binnenhof ...
9 Trouw 2018 In Nederland bestaat grote weerstand tegen ker...
10 Trouw 2017 Theo Potma 1932 2017 had zijn blocnote altijd...
11 Trouw 2017 Hoe en hoe snel kan Nederland zijn beloften op...
12 Trouw 2017 transitie Hoe en hoe snel kan Nederland zijn ...
14 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
15 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
16 Trouw 2015 Rapport Dwing burger CO\n Nederland heeft e...
And only the "Content" column should be checked for string size and cut in chunks but keep the original "Newspaper" and "Date" column data. By doing so I can still trace back the text to the original row.
Is there anyone who can help with such a solution?
I'm currently scraping the following wiki page: https://en.wikipedia.org/wiki/Cargo_aircraft, there is only one table beginning at comparisons. I am trying to scrape the entire table and output it to pandas. I get how to add the initial column, Aircraft, but have trouble scraping the columns beginning from volume.
How can I add all rows of the table into the dataframe, or columns? Not sure which is the better approach.
from bs4 import BeautifulSoup
import requests
import pandas as pd
#this will use request library to call wikipedia
page = requests.get('https://en.wikipedia.org/wiki/Cargo_aircraft')
#create beautifulsoup object
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', attrs={'class':'wikitable sortable'})
tabledata = table.findAll('tbody')
links = table.findAll('a')
aircraft = []
for link in links:
aircraft.append(link.get('title'))
print(aircraft)
#pull table from Wikipedia
df = pd.DataFrame()
df['Aircraft'] = aircraft
df['Test'] = 'test'
Using pandas.read_html
Bypass beautifulsoup and read the table directly into pandas.
Read HTML tables into a list of DataFrame objects
In this case the table is at index [1]
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
# df view
Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270 m³ 37,000 kg (82,000 lb) 780 km/h (420 kn) 6,390 km (3,450 nmi) Military
1 Airbus A300-600F 391.4 m³ 48,000 kg (106,000 lb) – 7,400 km (4,000 nmi) Commercial
2 Airbus A330-200F 475 m³ 70,000 kg (154,000 lb) 871 km/h (470 kn) 7,400 km (4,000 nmi) Commercial
3 Airbus Beluga 1210 m³ 47,000 kg (104,000 lb) – 4,632 km (2,500 nmi) Commercial
4 Airbus Beluga XL 2615 m³ 53,000 kg (117,000 lb) – 4,074 km (2,200 nmi) Commercial
5 Antonov An-124 1028 m³ 150,000 kg (331,000 lb) 800 km/h (430 kn) 5,400 km (2,900 nmi) Both
6 Antonov An-225 1300 m³ 250,000 kg (551,000 lb) 800 km/h (430 kn) 15,400 km (8,316 nmi) Commercial
7 Boeing C-17 – 77,519 kg (170,900 lb) 830 km/h (450 kn) 4,482 km (2,420 nmi) Military
8 Boeing 737-700C 107.6 m³ 18,200 kg (40,000 lb) 931 km/h (503 kn) 5,330 km (2,880 nmi) Commercial
9 Boeing 757-200F 239 m³ 39,780 kg (87,700 lb) 955 km/h (516 kn) 5,834 km (3,150 nmi) Commercial
10 Boeing 747-8F 854.5 m³ 134,200 kg (295,900 lb) 908 km/h (490 kn) 8,288 km (4,475 nmi) Commercial
11 Boeing 747 LCF 1840 m³ 83,325 kg (183,700 lb) 878 km/h (474 kn) 7,800 km (4,200 nmi) Commercial
12 Boeing 767-300F 438.2 m³ 52,700 kg (116,200 lb) 850 km/h (461 kn) 6,025 km (3,225 nmi) Commercial
13 Boeing 777F 653 m³ 103,000 kg (227,000 lb) 896 km/h (484 kn) 9,070 km (4,900 nmi) Commercial
14 Bombardier Dash 8-100 39 m³ 4,700 kg (10,400 lb) 491 km/h (265 kn) 2,039 km (1,100 nmi) Commercial
15 Lockheed C-5 – 122,470 kg (270,000 lb) 919 km/h 4,440 km (2,400 nmi) Military
16 Lockheed C-130 – 20,400 kg (45,000 lb) 540 km/h (292 kn) 3,800 km (2,050 nmi) Military
17 Douglas DC-10-30 – 77,000 kg (170,000 lb) 908 km/h (490 kn) 5,790 km (3,127 nmi) Commercial
18 McDonnell Douglas MD-11 440 m³ 91,670 kg (202,100 lb) 945 km/h (520 kn) 7,320 km (3,950 nmi) Commercial
You can try:
df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
df['Volume'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Volume'].str.split()]).astype(float)
df['Payload'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Payload'].str.split()]).astype(int)
df['Cruise'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Cruise'].str.split()]).astype(float)
df['Range'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Range'].str.split()]).astype(int)
Result:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
Aircraft 19 non-null object
Volume 15 non-null float64
Payload 19 non-null int64
Cruise 16 non-null float64
Range 19 non-null int64
Usage 19 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB
print(df)
Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270.0 37000 780.0 6390 Military
1 Airbus A300-600F 391.4 48000 NaN 7400 Commercial
2 Airbus A330-200F 475.0 70000 871.0 7400 Commercial
3 Airbus Beluga 1210.0 47000 NaN 4632 Commercial
4 Airbus Beluga XL 2615.0 53000 NaN 4074 Commercial
5 Antonov An-124 1028.0 150000 800.0 5400 Both
6 Antonov An-225 1300.0 250000 800.0 15400 Commercial
7 Boeing C-17 NaN 77519 830.0 4482 Military
8 Boeing 737-700C 107.6 18200 931.0 5330 Commercial
9 Boeing 757-200F 239.0 39780 955.0 5834 Commercial
10 Boeing 747-8F 854.5 134200 908.0 8288 Commercial
11 Boeing 747 LCF 1840.0 83325 878.0 7800 Commercial
12 Boeing 767-300F 438.2 52700 850.0 6025 Commercial
13 Boeing 777F 653.0 103000 896.0 9070 Commercial
14 Bombardier Dash 8-100 39.0 4700 491.0 2039 Commercial
15 Lockheed C-5 NaN 122470 919.0 4440 Military
16 Lockheed C-130 NaN 20400 540.0 3800 Military
17 Douglas DC-10-30 NaN 77000 908.0 5790 Commercial
18 McDonnell Douglas MD-11 440.0 91670 945.0 7320 Commercial
I'm very new to python and don't have much of a clue what I'm doing. I have a series of data that describes the performance ('Leistung') of different people ('Leistungserbringer'). Each performance is also linked to a specific value ('Taxpunke'). I'd like to display the top 10 performances for each person, defined by the value of the performance.
byleistung = df.groupby('Leistungserbringer')
df2 = byleistung['Taxpunkte'].describe()
df2.sort_values(['mean'], ascending=[False])
count mean std min 25% 50% 75% max
Leistungserbringer
Larsson William 6188.0 99.799108 231.765598 2.50 15.81 31.61 111.71 3909.72
Karlsson Oliwer 5645.0 93.344057 277.989424 3.61 15.81 31.61 94.83 9122.68
McGregor Sean 1250.0 89.100800 136.175528 3.61 18.35 34.78 111.71 998.64
Groeneveld Arno 4045.0 84.859498 202.230230 1.93 15.81 31.61 63.23 3323.52
Heepe Simon 3776.0 82.662950 359.970010 3.61 15.81 31.61 50.47 13597.60
Bitar Wahib 7814.0 72.190337 142.399537 3.61 15.81 31.61 61.75 3634.15
Cox James 4746.0 72.036013 132.240942 2.50 15.81 31.61 50.65 1664.40
Carvalho Tomas 7415.0 60.868030 156.889297 2.86 15.81 15.81 41.50 2099.20
The 'count' is the amount of performances the specific person did. In total there are 330 different performances these people have done. As an example:
byleistung = df.groupby('Leistung')
byleistung['Taxpktwert'].describe()
count unique top freq
Leistung
'(+) %-Zuschlag für Notfall B, ' 2 1 KVG 2
'+ Bronchoalveoläre Lavage (BAL)' 1 1 KVG 1
'+ Bürstenabstrich bei Bronchoskopie' 8 1 KVG 8
'+ Endobronchialer Ultraschall mit Punktion' 1 1 KVG 1
'XOLAIR Trockensub 150 mg c Solv Durchstf' 109 1 KVG 109
my DataFrame looks like this (has 40'000 more rows):
df.head()
Leistungserbringer Anzahl Leistung AL TL Taxpktwert Taxpunkte
0 Groeneveld Arno 12 'Beratung' 147.28 87.47 KVG 234.75
1 Groeneveld Arno 12 'Konsilium' 147.28 87.47 KVG 234.75
2 Groeneveld Arno 12 'Ultra' 147.28 87.47 KVG 234.75
3 Groeneveld Arno 12 'O2-Druck' 147.28 87.47 KVG 234.75
4 Groeneveld Arno 12 'Funktion' 147.28 87.47 KVG 234.75
I want my endresult to kinda look like this for each of the people. The ranking should be based on the product of counts per performance ('Anzahl') * value ('Taxpunkte'):
Leistungserbringer Leistung Anzahl Taxpunkte Total Taxpkt
Larsson William 1 x a x*a
2 y b y*b
.
.
10 z c z*c
...
McGregor Sean 1 x a x*a
2 y b y*b
.
.
10 z c z*b
Any hints or recommendations of approach would be greatly appreciated.