Text to PDF Positioning Lines

Text to PDF Positioning Lines - python

I have a text file that i am reading and writing line by line into a PDF. The lines are out of position on the PDF because the FPDF library is left aligning all my lines. I am using the property set x so i can position each line to my liking. I am trying to reposition the headers until "RATE CODE CY" the would like all the data under the columns to come after. Then another header appears. I would like to align all the headers that come after the data. I know a for loop needs to be done to bring rest of the data...the issue is a header will come again and there is where i have to make the change with set_x property.
pdf = FPDF("L", "mm", "A4")
pdf.add_page()
pdf.set_font('arial', style='', size=10.0)
lines = file.readlines()
header8 = lines[7]
header8_1 = " ".join(lines[8].split()[:4])
header8_2 = " ".join(lines[8].split()[4:])
header9_1 = " ".join(lines[9].split()[:5])
header9_2 = " ".join(lines[9].split()[5:])
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header8_1, border=0)
pdf.set_x(125)
pdf.cell(ln=1, h=5.0, align='L', w=0, txt=header8_2, border=0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header9_1, border=0)
pdf.set_x(125)
pdfcell(ln=1, h=5.0, align='L', w=0, txt=header9_2, border=0)
Current PDF file:
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49

Related

Python Regex Capturing Multiple Matches in separate observations

I am trying to create variables location; contract items; contract code; federal aid using regex on the following text:
PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E
This text is from one word file; I'll be looping my code on multiple doc files. In the text above are three location; contract items; contract code; federal aid pairs. But when I use regex to create variables, only the first instance of each pair is included.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
all_bod = []
all_cn = []
all_location = []
all_fedaid = []
all_contractcode = []
all_contractitems = []
all_file = []
text = ' PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E'
bod1 = re.search('BID OPENING DATE \s+ (\d+\/\d+\/\d+)', text)
bod2 = re.search('BID OPENING DATE\n\n(\d+\/\d+\/\d+)', text)
if not(bod1 is None):
bod = bod1.group(1)
elif not(bod2 is None):
bod = bod2.group(1)
else:
bod = 'NA'
all_bod.append(bod)
# creating contract number
cn1 = re.search('CONTRACT NUMBER\n+(.*)', text)
cn2 = re.search('CONTRACT NUMBER\s+(.........)', text)
if not(cn1 is None):
cn = cn1.group(1)
elif not(cn2 is None):
cn = cn2.group(1)
else:
cn = 'NA'
all_cn.append(cn)
# location
location1 = re.search('LOCATION \s+\S+', text)
location2 = re.search('LOCATION \n+\S+', text)
if not(location1 is None):
location = location1.group(0)
elif not(location2 is None):
location = location2.group(0)
else:
location = 'NA'
all_location.append(location)
# federal aid
fedaid = re.search('FEDERAL AID\s+\S+', text)
fedaid = fedaid.group(0)
all_fedaid.append(fedaid)
# contract code
contractcode = re.search('CONTRACT CODE\s+\S+', text)
contractcode = contractcode.group(0)
all_contractcode.append(contractcode)
# contract items
contractitems = re.search('\d+ CONTRACT ITEMS', text)
contractitems = contractitems.group(0)
all_contractitems.append(contractitems)
This code parses the only first instance of these variables in the text.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
But, I am trying to figure out a way to get all possible instances in different observations.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
03-2H6804
03-ED-0999-VAR
13
C
NONE
07-296304
07-LA-405-R21.5/26.3
55
B
ACIM-405-3(056)E
The all_variables in the code are for looping over multiple word files - we can ignore that if we want :).
Any leads would be super helpful. Thanks so much!

import re
data = []
df = pd.DataFrame()
regex_contract_number =r"(?:CONTRACT NUMBER\s+(?P<contract_number>\S+?)\s)"
regex_location = r"(?:LOCATION\s+(?P<location>\S+))"
regex_contract_items = r"(?:(?P<contract_items>\d+)\sCONTRACT ITEMS)"
regex_federal_aid =r"(?:FEDERAL AID\s+(?P<federal_aid>\S+?)\s)"
regex_contract_code =r"(?:CONTRACT CODE\s+\'(?P<contract_code>\S+?)\s)"
regexes = [regex_contract_number,regex_location,regex_contract_items,regex_federal_aid,regex_contract_code]
for regex in regexes:
for match in re.finditer(regex, text):
data.append(match.groupdict())
df = pd.concat([df, pd.DataFrame(data)], axis=1)
data = []
df

I cant print horizontally

I want to make a book catalog, the output is to print horizontally and change line for every 3 books. I understand that we can do a print horizontal by using:
end = ""
BUT that only works for 1 line. As my output has 3 line like Title, ISBN, Price, if I using end = "", it can't get it done.
Below is my code
line_format = "{:50s} \n{:6s} - {:13s} \n{:11s}"
books = db.get_books(kel)
for book in books:
print((line_format.format(str(book.title),
str(book.isbn),
"Rp. {:,}".format(book.price).replace(",","."))))
What I got is:
Deaver - Never Game A/UK
9780008303778
Rp. 161.000
Poirot - DEATH ON THE NILE (Exp]
9780008328948
Rp. 28.000
Alchemist - 25th Anniv ed
9780062355300
Rp. 160.000
Finn- Woman in the Window [MTI]
9780062906137
Rp. 162.000
Mahurin- Blood & Honey
9780063041172
Rp. 62.000
What I want for the output is:
Deaver - Never Game DEATH ON THE NILE (Exp] Alchemist
9780008303778 9780008328948 9780062355300
Rp. 161.000 Rp. 28.000 Rp. 160.000
Woman in the Window Blood & Honey
9780062906137 9780063041172
Rp. 162.000 Rp. 62.000

Different Behavior for Parsing Two Similar Wikipedia Infoboxes

I have two infoboxes that look exactly the same to me, but I'm getting different behavior in mwparserfromhell. In the first instance I'm getting what I expect - the entire infobox is captured as a template object. In the second instance parts of the infobox are extracted as separate templates. This is confusing since the infoboxes look very similar to me, and I was expecting that the entire infobox could be extracted in the second case.
This is the code I'm using:
mwparserfromhell.parse(text.strip().lower()).filter_templates()
Text 1 Input:
txt1 = """{{Infobox building
| name = 666 Fifth Avenue
| former_names = Tishman Building
| status = Complete
| image = 666 Fifth Avenue by David Shankbone.jpg
| image_size = 300px
| caption =
| location = 666 Fifth Avenue<br>[[Manhattan]], [[New York (state)|New York]] 10103
| coordinates = {{coord|40.760163|-73.976204|format=dms}}
| start_date =
| completion_date = 1957
| architect = [[Carson & Lundin]]
| owner = [[Brookfield Properties]]
| cost = $40 million
| floor_area = {{convert|1,463,892|sqft|m2|abbr=on}}
| top_floor =
| floor_count = 41
| references =
| map_type =
| building_type = Office
| antenna_spire =
| roof = {{convert|483|ft|m|abbr=on}}
| elevator_count = 24 (20 passenger, 4 freight)
| structural_engineer =
| main_contractor =
| opening = November 25, 1957
| developer = Tishman Realty and Construction
| management =
}}"""
Text 1 Output:
['{{infobox building\n| name = 666 fifth avenue\n| former_names = tishman building\n| status = complete\n| image = 666 fifth avenue by david shankbone.jpg\n| image_size = 300px\n| caption = \n| location = 666 fifth avenue<br>[[manhattan]], [[new york (state)|new york]] 10103\n| coordinates = {{coord|40.760163|-73.976204|format=dms}}\n| start_date = \n| completion_date = 1957\n| architect = [[carson & lundin]]\n| owner = [[brookfield properties]]\n| cost = $40 million\n| floor_area = {{convert|1,463,892|sqft|m2|abbr=on}}\n| top_floor = \n| floor_count = 41\n| references = \n| map_type = \n| building_type = office\n| antenna_spire = \n| roof = {{convert|483|ft|m|abbr=on}}\n| elevator_count = 24 (20 passenger, 4 freight)\n| structural_engineer = \n| main_contractor = \n| opening = november 25, 1957\n| developer = tishman realty and construction\n| management = \n}}',
'{{coord|40.760163|-73.976204|format=dms}}',
'{{convert|1,463,892|sqft|m2|abbr=on}}',
'{{convert|483|ft|m|abbr=on}}']
Text 2 Input:
txt2 = """{{Infobox building
| name = Central Park Tower
| alternate_names = Nordstrom Tower
| image = Central Park Tower April 2020.jpg
| caption = Central Park Tower on April 25, 2020
| location = 225 [[57th Street (Manhattan)|West 57th Street]]<br/>[[Manhattan]], [[New York City]], [[New York (state)|New York]], [[United States|U.S.]]
| coordinates = {{coord|40.7663|-73.9810|type:landmark_globe:earth_region:US-NY|display=inline,title}}
| status = Topped Out
| start_date = 2014
| est_completion = 2020<ref name=curbed>{{cite news |author=Amy Plitt |url=https://ny.curbed.com/2017/6/1/15714666/central-park-tower-offering-plan-approval-sales-launch |title=Central Park Tower is now one step closer to launching sales |date=June 1, 2017 |access-date=August 30, 2017 |work=Curbed}}</ref>
| building_type = [[Residential]], [[retail]]
| architectural_style = [[Modern architecture|Modern]]
| architectural = {{cvt|1550|ft|0}}
| floor_count = 131<ref>{{cite web |url=https://www.architecturaldigest.com/story/new-york-city-central-park-tower-worlds-tallest-residential-building </ref><ref>{{cite web |url=https://archpaper.com/2019/09/central-park-tower-tops-out/</ref> (98 habitable floors)<ref name="auto">{{Cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=Central Park Tower - The Skyscraper Center |website=www.skyscrapercenter.com |access-date=October 10, 2018}}</ref>
| elevator_count = 11
| cost = $3 billion<ref name="Tase">{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=All in Good TASE: The Crisis for the American Cohort in Tel Aviv Is Essentially Over|date=April 4, 2019|work=Commercial Observer|last=Gourarie|first=Chava}}</ref>
| floor_area = {{convert|1,285,308|sqft|m2}}<ref name="auto" />
| architect = [[Adrian Smith + Gordon Gill Architecture]]
| structural_engineer = [[WSP Global]]
| main_contractor = [[Lendlease]]
| developer = [[Extell Development Company]]
}}"""
Text 2 Output:
['{{coord|40.7663|-73.9810|type:landmark_globe:earth_region:us-ny|display=inline,title}}',
'{{cite news |author=amy plitt |url=https://ny.curbed.com/2017/6/1/15714666/central-park-tower-offering-plan-approval-sales-launch |title=central park tower is now one step closer to launching sales |date=june 1, 2017 |access-date=august 30, 2017 |work=curbed}}',
'{{cvt|1550|ft|0}}',
'{{cite web |url=https://archpaper.com/2019/09/central-park-tower-tops-out/</ref> (98 habitable floors)<ref name="auto">{{cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=central park tower - the skyscraper center |website=www.skyscrapercenter.com |access-date=october 10, 2018}}</ref>\n| elevator_count = 11\n| cost = $3 billion<ref name="tase">{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=all in good tase: the crisis for the american cohort in tel aviv is essentially over|date=april 4, 2019|work=commercial observer|last=gourarie|first=chava}}</ref>\n| floor_area = {{convert|1,285,308|sqft|m2}}<ref name="auto" />\n| architect = [[adrian smith + gordon gill architecture]]\n| structural_engineer = [[wsp global]]\n| main_contractor = [[lendlease]]\n| developer = [[extell development company]]\n}}',
'{{cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=central park tower - the skyscraper center |website=www.skyscrapercenter.com |access-date=october 10, 2018}}',
'{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=all in good tase: the crisis for the american cohort in tel aviv is essentially over|date=april 4, 2019|work=commercial observer|last=gourarie|first=chava}}',
'{{convert|1,285,308|sqft|m2}}']

This is a known bug for mwparserfromhell. My workaround was to create an on-the-fly regex pattern that would remove the ref link but keep the rest of the text intact.
{{ this is text <ref>this is a ref link}}</ref>
to
{{ this is text }}
Here's the code:
def get_regex_str_pattern(reg_str):
""" Creates regex string to remove specific patterns that are embedded in larger strings
:param reg_str: String to tokenize
:return: Regex pattern
"""
return r'.*'.join([f"({re.escape(r.strip())})" for r in reg_str.split()])
def get_ref_clean_str(txt):
""" Removes badly formed ref strings from wiki text
:param txt: Wiki text
:return: Parser object
"""
clean_txt = txt
wiki_text = mwparserfromhell.parse(txt)
for r in wiki_text.filter_tags():
if str(r.tag) in ('ref', 'br'):
clean_txt = re.sub(get_regex_str_pattern(r), ' ', clean_txt)
return mwparserfromhell.parse(clean_txt)

Is there a way to properly convert data from lists to a CSV file using BeautifulSoup?

I am trying to create a webscraper for a website. The problem is that after the collected data is stored in a list, I'm not able to write this to a csv file properly. I have been stuck for ages with this problem and hopefully someone has an idea about how to fix this one!
The loop to get the data from the web pages:
import csv
from htmlrequest import simple_get
from htmlrequest import BeautifulSoup
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while (count <= max):
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start=' + str(count) + '&s=h&t=SicCodeSearch&location=&sicCode=93120'
raw_html = simple_get(start)
soup = BeautifulSoup(raw_html, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
output example if printed:
Companies
(AMG) AGILITY MANAGEMENT GROUP LTD
(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)
1 SPORT ORGANISATION LIMITED
100UK LTD
1066 GYMNASTICS
1066 SPECIALS
10COACHING LIMITED
147 LOUNGE LTD
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED
Locations
ENGLAND, BH8 9PS
LONDON, EC2M 2PL
ENGLAND, LS7 3JB
ENGLAND, LE2 8FN
UNITED KINGDOM, N18 2QX
AVON, BS5 0JH
UNITED KINGDOM, WC2H 9JQ
UNITED KINGDOM, SE18 5SZ
UNITED KINGDOM, EC1V 2NX
I've tried to get it into a CSV file by using this code but I can't figure out how to properly format my output! Any suggestions are welcome.
# writing to csv
with open('test.csv', 'w') as csvfile:
write = csv.writer(csvfile, delimiter=',')
write.writerow(['Name','Location'])
write.writerow([listData[0],listData[1]])
print("Writing has been done!")
I want the code to be able to format it properly in the csv file to be able to import the two rows in a database.
This is the output when I write the data on 'test.csv'
which will result into this when opened up
The expected outcome would be something like this!

I'm not sure how it is improperly formatted, but maybe you just need to replace with open('test.csv', 'w') with with open('test.csv', 'w+', newline='')
I've combined your code (taking out htmlrequests for requests and bs4 modules and also not using listData, but instead creating my own lists. I've left your lists but they do nothing):
import csv
import bs4
import requests
# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
company_list = []
locations_list = []
plus = 15
max = 30
count = 0
# while loop to repeat process till max is reached
while count <= max:
start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start={}&s=h&t=SicCodeSearch&location=&sicCode=93120'.format(count)
res = requests.get(start)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
listData[0] = listData[0].strip() + div.text
company_list.append(div.text.strip())
for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
listData[1] = listData[1].strip() + div2.text
locations_list.append(div2.text.strip())
# This is extra information
# for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
# listData[2] = listData[2].strip() + div3.text
count = count + plus
if len(company_list) == len(locations_list):
with open('test.csv', 'w+', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(['Name', 'Location'])
for i in range(len(company_list)):
writer.writerow([company_list[i], locations_list[i]])
Which generates a csv file like:
Name,Location
(AMG) AGILITY MANAGEMENT GROUP LTD,"UNITED KINGDOM, M6 6DE"
"(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)","ENGLAND, BD1 2PX"
0161 STUDIOS LTD,"UNITED KINGDOM, HD6 3AX"
1 CLICK SPORTS MANAGEMENT LIMITED,"ENGLAND, E10 5PW"
1 SPORT ORGANISATION LIMITED,"UNITED KINGDOM, CR2 6NF"
100UK LTD,"UNITED KINGDOM, BN14 9EJ"
1066 GYMNASTICS,"EAST SUSSEX, BN21 4PT"
1066 SPECIALS,"EAST SUSSEX, TN40 1HE"
10COACHING LIMITED,"UNITED KINGDOM, SW6 6LR"
10IS ACADEMY LIMITED,"ENGLAND, PE15 9PS"
"10TH MAN LIMITED
(Dissolved)","GLASGOW, G3 6AN"
12 GAUGE EAST MANCHESTER COMMUNITY MMA LTD,"ENGLAND, OL9 8DQ"
121 MAKING WAVES LIMITED,"TYNE AND WEAR, NE30 1AR"
121 WAVES LTD,"TYNE AND WEAR, NE30 1AR"
1-2-KICK LTD,"ENGLAND, BH8 9PS"
"147 HAVANA LIMITED
(Liquidation)","LONDON, EC2M 2PL"
147 LOUNGE LTD,"ENGLAND, LS7 3JB"
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED,"ENGLAND, LE2 8FN"
1ACTIVE LTD,"UNITED KINGDOM, N18 2QX"
1ON1 KING LTD,"AVON, BS5 0JH"
1PUTT LTD,"UNITED KINGDOM, WC2H 9JQ"
1ST SPORTS LTD,"UNITED KINGDOM, SE18 5SZ"
2 BRO PRO EVENTS LTD,"UNITED KINGDOM, EC1V 2NX"
2 SPLASH SWIM SCHOOL LTD,"ENGLAND, B36 0EY"
2 STEPPERS C.I.C.,"SURREY, CR0 6BX"
2017 MOTO LIMITED,"UNITED KINGDOM, ME2 4NW"
2020 ARCHERY LTD,"LONDON, SE16 6SS"
21 LEISURE LIMITED,"LONDON, EC4M 7WS"
261 FEARLESS CLUB UNITED KINGDOM CIC,"LANCASHIRE, LA2 8RF"
2AIM4 LIMITED,"HERTFORDSHIRE, SG2 0JD"
2POINT4 FM LTD,"LONDON, NW10 8LW"
3 LIONS SCHOOL OF SPORT LTD,"BRISTOL, BS20 8BU"
3 PT LTD,"ANTRIM, BT40 2FB"
3 PUTT LIFE LTD,"UNITED KINGDOM, LU3 2DP"
3 THIRTY SEVEN LTD,"KENT, DA9 9RS"
3:30 SOCCER SCHOOL LTD,"UNITED KINGDOM, EH6 7JB"
30 MINUTE WORKOUT (LLANISHEN) LTD,"PONTYCLUN, CF72 9UA"
321 RELAX LTD,"MID GLAMORGAN, CF83 3HL"
360 MOTOR RACING CLUB LTD,"HALSTEAD, CO9 2ET"
3LIONSATHLETICS LIMITED,"ENGLAND, S3 8DB"
3S SWIM ROMFORD LTD,"UNITED KINGDOM, DA9 9DR"
3XL EVENT MANAGEMENT LIMITED,"KENT, BR3 4NW"
3XL MOTORSPORT MANAGEMENT LIMITED,"KENT, BR3 4NW"
4 CORNER FOOTBALL LTD,"BROMLEY, BR1 5DD"
4 PRO LTD,"UNITED KINGDOM, FY5 5HT"
Which seems fine to me, but your post was very unclear about how you expected it to be formatted so I really have no idea

How to split string into column names with python/panda?

Do you know how to solve this in python? I would like to have a dataframe with data arranged in the correct column.
Thanks in advance!
Here is an example of a string from a dataframe.
' Huidigefuncties Michael Jordan 2015 - present Director Marketing & Indirect Channels, Ricoh Nederland 2010 - present Basketball Center, Center for Business-Expertise Loopbaan Michael Jordan 2012 - 2015 Director Marketing & Business Development, Ricoh Opleiding Michael Jordan 1988 - 1992 Marketing , Harvard '
Preferred result
type from to function organization
current 2015 present Director Marketing & Indirect Channels Ricoh Nederland
current 2010 present Owner & Consultant Basketball Center
old 2012 2015 Director Marketing & Business Development Ricoh
school 1988 1992 Marketing Harvard
Current df
Name Data
Michael Jordan ' Huidigefuncties Michael Jordan 2015 - present Director Marketing & Indirect Channels, Ricoh Nederland 2010 - present Basketball Center, Center for Business-Expertise Loopbaan Michael Jordan 2012 - 2015 Director Marketing & Business Development, Ricoh Opleiding Michael Jordan 1988 - 1992 Marketing , Harvard '

Well, this is a solution that I did for this problem
import pandas as pd
beautiful_data = 'Huidigefuncties Michael Jordan 2015 - present Director Marketing & Indirect Channels, Ricoh Nederland 2010 - present Basketball Center, Center for Business-Expertise Loopbaan Michael Jordan 2012 - 2015 Director Marketing & Business Development, Ricoh Opleiding Michael Jordan 1988 - 1992 Marketing , Harvard'
main_dict = {'type':[], 'from':[], 'to':[], 'function':[], 'organization': []}
data = beautiful_data.split(' ')
i = 0
huidi_index = data.index('Huidigefuncties')
loopbaan_index = data.index('Loopbaan')
ople_index = data.index('Opleiding')
# print(data)
while i < len(data):
if data[i] == 'Huidigefuncties':
line = ' '.join(data[i + 1: loopbaan_index])
i = loopbaan_index
print(line)
type_data = 'current'
elif data[i] == 'Loopbaan':
line = ' '.join(data[i + 1: ople_index])
i = ople_index
print(line)
type_data = 'old'
elif data[i] == 'Opleiding':
line = ' '.join(data[i+1: ])
i = len(data)
print(line)
type_data = 'school'
else:
i += 1
data_line = line.split('-')
if len(data_line) == 2:
print(type_data)
main_dict['type'].append(type_data)
from_data = data_line[0].strip().split(' ')[-1]
print(from_data)
main_dict['from'].append(from_data)
to_data = data_line[1].strip().split(' ')[0]
print(to_data)
main_dict['to'].append(to_data)
function_data = ' '.join(data_line[1].strip().split(' ')[1:-1])[:-1]
print(function_data)
main_dict['function'].append(function_data)
organization_data = data_line[1].split(',')[-1].strip()
print(organization_data)
main_dict['organization'].append(organization_data)
elif len(data_line) > 2:
j = 0
while j < len(data_line):
register_data = data_line[j:j+2]
if len(register_data) > 1:
if len(register_data[0].split(' ')) > 1 and len(register_data[1].split(' ')) > 1:
if j == 0:
print(register_data)
print('----------')
print(type_data)
main_dict['type'].append(type_data)
from_data = register_data[0].strip().split(' ')[-1]
print(from_data)
main_dict['from'].append(from_data)
to_data = register_data[1].strip().split(' ')[0]
print(to_data)
main_dict['to'].append(to_data)
function_org = register_data[1].strip().split(',')
function_data = ' '.join(function_org[0].split(' ')[1:])
print(function_data)
main_dict['function'].append(function_data)
org_data = ' '.join(function_org[1].split(' ')[:-1]).strip()
print(org_data)
main_dict['organization'].append(org_data)
print('-----------')
else:
print('-----------')
print(register_data)
print(type_data)
main_dict['type'].append(type_data)
from_data = register_data[0].strip().split(' ')[-1]
print(from_data)
main_dict['from'].append(from_data)
to_data = register_data[1].strip().split(' ')[0]
print(to_data)
main_dict['to'].append(to_data)
function_org = register_data[1].strip().split(',')
function_data = ' '.join(function_org[0].split(' ')[1:])
print(function_data)
main_dict['function'].append(function_data)
org_data = ' '.join(function_org[1].split(' ')).strip()
print(org_data)
main_dict['organization'].append(org_data)
print('-----------')
j += 1
df = pd.DataFrame(main_dict)
Tested

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text to PDF Positioning Lines - python

Related

Python Regex Capturing Multiple Matches in separate observations

I cant print horizontally

Different Behavior for Parsing Two Similar Wikipedia Infoboxes

Is there a way to properly convert data from lists to a CSV file using BeautifulSoup?

How to split string into column names with python/panda?

Categories

Resources