how to Get item from nested arrays in python - python

I am scraping web table and append scraped data to array. After append data to array i get array like this (there are arrays in array):
[['Action'], ['1 796004', '35', '2022-04-28', '2013 FORD FUSION TITANIUM', '43004432', '3FA6P0RU3DR297126', 'CA', 'Copart', 'Batumi, Georgia', 'CAIU7608231EBKG03172414', '2022-05-02', '2022-05-02', '0000-00-00', '', 'dock receipt', 'YES', '', 'No', '', '5/3/2022 Per auction, the title is present and will be prepared for mail out; Follow up for a tracking number-Clara5/9/2022 Per auction, they are still working on
mailing out this title; Follow up required for a tracking number-Clara5/11/2022 Per auction, the title was mailed out; tr#776771949089-Clara[Add notes]', 'A779937', '', '', '', '[edit]', ''], ['2 763189', '43', '2022-01-10', '2018 TOYOTA CAMRY', '43241241', '4T1B11HK7JU080162', 'GA', 'Copart', 'Poti, Georgia', 'MRKU5529916217682189', '2022-01-25', '2022-01-28', '2022-06-20', '2022-06-27', 'dock receipt', 'YES', '2022-01-28', 'Yes', '', '[Add notes]', 'A774742', '', '', '', '[edit]', ''], ['3 762850', '37', '2022-01-07', '2017 VOLKSWAGEN TOUAREG', '65835511', 'WVGRF7BP3HD000549', 'CA', 'Copart', 'Batumi, Georgia', 'MSDU7281152EBKG02708589', '2022-02-09', '2022-02-09', '2022-06-07', '2022-06-14', 'dock receipt', 'YES', '2022-01-20', 'Yes', '', '[Add notes]', 'A774650', '', '', '', '[edit]', ''],]
Now i want to get 4th(5) items (it is actually car model, e.g. for firs appended array it is "2013 FORD FUSION TITANIUM") from these updated data (array), i want to have :"2013 FORD FUSION TITANIUM, "2018 TOYOTA CAMRY" etc.
How can i achive that?

Loop from the first index of the array to the end to avoid the first subarray.
At every iteration, select the ith subarray and get the 3rd index.
The result will be an array of the strings that you wanted.
prompt = [
['Action'],
['1 796004', '35', '2022-04-28', '2013 FORD FUSION TITANIUM', '43004432', '3FA6P0RU3DR297126', 'CA', 'Copart', 'Batumi, Georgia', 'CAIU7608231EBKG03172414', '2022-05-02', '2022-05-02', '0000-00-00', '', 'dock receipt', 'YES', '', 'No', '', '5/3/2022 Per auction, the title is present and will be prepared for mail out; Follow up for a tracking number-Clara5/9/2022 Per auction, they are still working on mailing out this title; Follow up required for a tracking number-Clara5/11/2022 Per auction, the title was mailed out; tr#776771949089-Clara[Add notes]', 'A779937', '', '', '', '[edit]', ''],
['2 763189', '43', '2022-01-10', '2018 TOYOTA CAMRY', '43241241', '4T1B11HK7JU080162', 'GA', 'Copart', 'Poti, Georgia', 'MRKU5529916217682189', '2022-01-25','2022-01-28', '2022-06-20', '2022-06-27', 'dock receipt', 'YES', '2022-01-28', 'Yes', '', '[Add notes]', 'A774742', '', '', '', '[edit]', ''],
['3 762850', '37', '2022-01-07', '2017 VOLKSWAGEN TOUAREG', '65835511', 'WVGRF7BP3HD000549', 'CA', 'Copart', 'Batumi, Georgia', 'MSDU7281152EBKG02708589', '2022-02-09', '2022-02-09', '2022-06-07', '2022-06-14', 'dock receipt', 'YES', '2022-01-20', 'Yes', '', '[Add notes]', 'A774650', '', '', '', '[edit]', ''],
]
res = [prompt[i][3] for i in range(1, len(prompt))]
print(res) # ['2013 FORD FUSION TITANIUM', '2018 TOYOTA CAMRY', '2017 VOLKSWAGEN TOUAREG']
If I misunderstood the question, please let me know.

Related

BeautifulSoup Only Scraping Last Result

I am trying to scrape runner names and number of tips from this page: https://www.horseracing.net/racecards/newmarket/13-05-21
It is only returning the last runner name in the final race. I've been over and over it but can't see what I have done wrong.
Can anyone see the issue?
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.horseracing.net/racecards/newmarket/13-05-21"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
date = []
course = []
time = []
runner = []
tips = []
runner_div = soup.find_all('div', class_='row-cell-right')
for container in runner_div:
runner_name = container.h5.a.text
runner.append(runner_name)
tips_no = container.find('span', class_='tip-text number-tip').text if container.find('span', class_='tip-text number-tip') else ''
tips.append(tips_no)
print(runner_name, tips_no)
Try print(runner, tips) instead of print(runner_name, tips_no):
Output:
print(runner, tips)
# ['Babindi', 'Turandot', 'Sharla', "Serena's Queen", 'Bellazada', 'Baby Alya', 'Adelita', 'Florence Street', 'Allerby', 'Puy Mary', 'Roman Mist', 'Lunar Shadow', 'Breakfastatiffanys', 'General Panic', 'Gidwa', 'Point Lynas', 'Three Dons', 'Wrought Iron', 'Desert Dreamer', 'Adatorio', 'Showmedemoney', 'The Charmer',
# 'Bascinet', 'Dashing Rat', 'Appellation', 'Cambridgeshire', 'Danni California', 'Drifting Sands', 'Lunar Gold', 'Malathaat', 'Miss Calacatta', 'Sunrise Valley', 'Sweet Expectation', 'White Lady', 'Riknnah', 'Aaddeey', 'High Commissioner', 'Kaloor', 'Rodrigo Diaz', 'Mukha Magic', 'Gauntlet', 'Hawridge Flyer', 'Clovis Point', 'Franco Grasso', 'Kemari', 'Magical Land', 'Mobarhin', 'Movin Time', 'Night Of Dreams', 'Punta Arenas', 'Constanta', 'Cosmic George', 'Taravara', 'Basilicata', 'Top Brass', 'Without Revenge', 'Grand Scheme', 'Easy Equation', 'Mr Excellency', 'Colonel Faulkner', 'Urban War', 'Freak Out', 'Alabama Boy', 'Anghaam', 'Arqoob', 'Fiordland', 'Dickens', "Shuv H'Penny King"]
# ['5', '3', '1', '3', '1', '', '1', '', '', '', '1', '', '', '', '', '1', '', '', '12', '1', '', '', '', '', '', '', '5', '', '1', '', '', '7', '', '', '1', '11', '1', '', '', '', '', '2', '', '', '1', '3', '2', '9', '', '', '', '', '5', '1', '4', '', '5', '', '1', '4', '2', '1', '3', '2', '1', '', '', '']

How to sort a list by date correctly?

Im trying to sort a list of data, but it seems like the sort isn't working. I first took out the date portion of the list and tried to compare it to the current date.
from datetime import datetime, date
test = {
'3001265': ['Samsung', 'phone', '1200', '12/1/2023', ''],
'1009453': ['Lenovo', 'tower', '599', '10/1/2020', ''],
'1167234': ['Apple', 'phone', '534', '2/1/2021', ''],
'2390112': ['Dell', 'laptop', '799', '7/2/2020', ''],
'9034210': ['Dell', 'tower', '345', '5/27/2020', ''],
'7346234': ['Lenovo', 'laptop', '239', '9/1/2020', 'damaged'],
'2347800': ['Apple', 'laptop', '999', '7/3/2020', '']
}
test_dates = []
for value, index in test.items():
if index[3] not in test_dates:
test_dates.append(index[3])
test_day = []
today = date.today()
for date in test_dates:
other_date = datetime.date(datetime.strptime(date, '%m/%d/%Y'))
if other_date < today:
other_date = other_date.strftime('X%m/X%d/%Y').replace('X0','X').replace('X','')
test_day.append(other_date)
print(test_day)
table = range(len(test_day))
for num in table:
entry_list = []
for row, entry in test.items():
entry = [row] + entry
for line in entry:
if test_day[num] in line:
entry_list.append(entry)
sorted_entry = sorted(entry_list, key=lambda x: x[4])
for object in sorted_entry:
print(object)
My goal is to find all the dates that are before today and output all thier information sorted by the date.
This is the output:
['1009453', 'Lenovo', 'tower', '599', '10/1/2020', '']
['2390112', 'Dell', 'laptop', '799', '7/2/2020', '']
['9034210', 'Dell', 'tower', '345', '5/27/2020', '']
['7346234', 'Lenovo', 'laptop', '239', '9/1/2020', 'damaged']
['2347800', 'Apple', 'laptop', '999', '7/3/2020', '']
The ideal output is with the rows ordered by the date or the 4th column:
['9034210', 'Dell', 'tower', '345', '5/27/2020', '']
['2390112', 'Dell', 'laptop', '799', '7/2/2020', '']
['2347800', 'Apple', 'laptop', '999', '7/3/2020', '']
['7346234', 'Lenovo', 'laptop', '239', '9/1/2020', 'damaged']
['1009453', 'Lenovo', 'tower', '599', '10/1/2020', '']
The reason for the error is that the sorting was done using a string instead of a time.
Your code is too long, and it's actually very easy to do this sort.
from datetime import datetime
test = {
'3001265': ['Samsung', 'phone', '1200', '12/1/2023', ''],
'1009453': ['Lenovo', 'tower', '599', '10/1/2020', ''],
'1167234': ['Apple', 'phone', '534', '2/1/2021', ''],
'2390112': ['Dell', 'laptop', '799', '7/2/2020', ''],
'9034210': ['Dell', 'tower', '345', '5/27/2020', ''],
'7346234': ['Lenovo', 'laptop', '239', '9/1/2020', 'damaged'],
'2347800': ['Apple', 'laptop', '999', '7/3/2020', '']
}
data = list(test.values())
result = sorted(data, key=lambda x:datetime.strptime(x[3], '%m/%d/%Y'))
print(result)
Try changing this line:
sorted_entry = sorted(entry_list, key=lambda x: x[4])
To:
sorted_entry = sorted(entry_list, key=lambda x: datetime.strptime(x[4], '%m/%d/%Y'))

How to extract table content from CSV file after been converted from PDF with Python?

I have a PDF that contains many tables. )I converted this PDF to CSV online, to extract the needed data more easily.)
The CSV rows are composed of many columns, but each table contains only 3 columns, so it is hard to know which column refers to a cell.
I also should mention that one cell can be composed of more than one line and column.
An example of a table.
So is there any solution to extract these cells?
import csv
import re
class PDF_EXTRACTOR:
FILE_NAME=None
Ttableau=None
NUMBER_OF_PAGES=None
def __init__(self,fn):
self.FILE_NAME=fn
self.Ttableau=0
self.NUMBER_OF_PAGES=0
def EXTRACT_CELLULE(self):
csv.register_dialect('mydialect',delimiter =',',skipinitialspace=True)
print(csv.list_dialects())
with open(self.FILE_NAME,'r',encoding='utf8',errors='ignore') as csvFile:
reader = csv.reader(csvFile, dialect='mydialect')
for index, row in enumerate(reader):
print(row)
I expected an output like this:
["Region 1","Region 2", "Region3]
["8,3-9","AUXILIAIRES DE LA MÉTÉOROLOGIE 5.54A 5.54B 5.54C",""]
["70-72","70-90","70-72"]
["RADIONAVIGATION 5.60","FIXE","MARITIME 5.60"]
But instead I got this:
['7 450-8 100', 'FIXE', '', '', '', '']
['', 'MOBILE sauf mobile aéronautique (R)', '', '', '', '']
['', '5.144', '', '', '', '']
['8 100-8 195', 'FIXE', '', '', '', '']
['', 'MOBILE MARITIME', '', '', '', '']
['8 195-8 815', 'MOBILE MARITIME', '5.109', '5.11', '5.132', '5.145']
['', '5.111', '', '', '', '']
['8 815-8 965', 'MOBILE AÉRONAUTIQUE (R)', '', '', '', '']
['8 965-9 040', 'MOBILE AÉRONAUTIQUE (OR)', '', '', '', '']
['9 040-9 305', '9 040-9 400', '', '', '', '9 040-9 305']
['FIXE', 'FIXE', '', '', '', 'FIXE']
['9 305-9 355', '', '', '', '', '9 305-9 355']
['FIXE', '', '', '', '', 'FIXE']
['Radiolocalisation 5.145A', '', '', '', '', 'Radiolocalisation 5.145A']
['5.145B', '', '', '', '', '']
['9 355-9 400', '', '', '', '', '9 355-9 400']

Limiting data in pd.DataFrame

I am trying to implement the following with loading an internal data structure to pandas:
df = pd.DataFrame(self.data,
nrows=num_rows+500,
skiprows=skip_rows,
header=header_row,
usecols=limit_cols)
However, it doesn't appear to implement any of those (like it does when reading a csv file, other than the data). Is there another method I can use to have more control over the data that I'm ingesting? Or, do I need to rebuild the data before loading it into pandas?
My input data looks like this:
data = [
['ABC', 'es-419', 'US', 'Movie', 'Full Extract', 'PARIAH', '', '', 'EST', 'Features - EST', 'HD', '2017-05-12 00:00:00', 'Open', 'WSP', '10.5000', '', '', '', '', '10.5240/8847-7152-6775-8B59-ADE0-Y', '10.5240/FFE3-D036-A9A4-9E7A-D833-1', '', '', '', '04065', '', '', '2011', '', '', '', '', '', '', '', '', '', '', '', '113811', '', '', '', '', '', '04065', '', 'Spanish (LAS)', 'US', '10', 'USA NATL SALE', '2017-05-11 00:00:00', 'TIER 3', '21', '', '', 'USA NATL SALE-SPANISH LANGUAGE', 'SPAN'],
['ABC', 'es-419', 'US', 'Movie', 'Full Extract', 'PATCH ADAMS', '', '', 'EST', 'Features - EST', 'HD', '2017-05-12 00:00:00', 'Open', 'WSP', '10.5000', '', '', '', '', '10.5240/DD84-FBF4-8F67-D6F3-47FF-1', '10.5240/B091-00D4-8215-39D8-0F33-8', '', '', '', 'U2254', '', '', '1998', '', '', '', '', '', '', '', '', '', '', '', '113811', '', '', '', '', '', 'U2254', '', 'Spanish (LAS)', 'US', '10', 'USA NATL SALE', '2017-05-11 00:00:00', 'TIER 3', '21', '', '', 'USA NATL SALE-SPANISH LANGUAGE', 'SPAN']
]
And so I'm looking to be able to state which rows it should load (or skip) and which columns it should skip (usecols). Is that possible to do with an internal python data structure?

Flatten Json Output

How can I turn this data into a flat data frame?
I've tried using json_normalize and pivot, but I can't seem to get the format right.
This is my desired out put format:
SiteName|SiteId|...|CompressorMeterRefID|TankID|TankNumber...|TankID|TankNumber...|TankID|... DateandTime|...
Please advise
[{'SiteName': 'Reinschmiedt 1-4H (CRP 11)',
'SiteId': 57,
'SiteRefId': 'OK10020',
'Choke': '',
'GasMeter1': 53.25,
'GasMeter1Name': 'Check Meter',
'GasMeter1RefId': '',
'GasMeter2Name': '',
'GasMeter2RefId': '',
'GasMeter3Name': '',
'GasMeter3RefId': '',
'OilMeter1Name': '',
'OilMeter1RefId': '',
'OilMeter2Name': '',
'OilMeter2RefId': '',
'WaterMeter1': 0.0,
'WaterMeter1Name': 'Water Meter',
'WaterMeter1RefId': '',
'WaterMeter2Name': '',
'WaterMeter2RefId': '',
'FlareMeterName': '',
'FlareMeterRefId': '',
'GasLiftMeterName': '',
'GasLiftMeterRefId': '',
'CompressorMeterName': '',
'CompressorMeterRefId': '',
'TankEntries': [{'TankId': 138,
'TankNumber': 2,
'TankLevelDateTime': '2018-07-01T12:00:00.0000000Z',
'TankLevelDateTimeLocal': '2018-07-01T07:00:00.0000000Z',
'TankTopGauge': 35.99,
'TankName': 'Oil Tank 209206',
'TankRefId': 0,
'TankRefId2': '',
'TankRefId3': ''},
{'TankId': 139,
'TankNumber': 3,
'TankLevelDateTime': '2018-07-01T12:00:00.0000000Z',
'TankLevelDateTimeLocal': '2018-07-01T07:00:00.0000000Z',
'TankTopGauge': 109.5,
'TankName': 'Oil Tank 209207',
'TankRefId': 0,
'TankRefId2': '',
'TankRefId3': ''}],
'DateAndTime': '2018-07-01T12:00:00.0000000Z',
'DateAndTimeLocal': '2018-07-01T07:00:00.0000000Z',
'UserName': 'ScadaVisor',
'Notes': ''},
{'SiteName': 'Allen 1-11H (CRP 8)',
.....
.....
.....
In r you can do it like this using jsonlite package:
result<- as.data.frame(jsonlite::stream_in(textConnection(data)))

Categories

Resources