The second row and third row should be a single row - python

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[1]
except AttributeError as e:
print 'No tables found, exiting'
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:]
except AttributeError as e:
print 'No table row found, exiting'
results = []
firstRow = first.find_all('td')
results.append([header.get_text() for header in firstRow])
for row in allRows:
table_headers = row.find_all('th')
table_data = row.find_all('td')
if table_headers :
results.append([headers.get_text() for headers in table_headers])
if table_data :
results.append([data.get_text() for data in table_data])
df = pd.DataFrame(data = results)
df
Desired output:
Margin Teams Venue Season
Innings and 579 runs | England (903-7 d) beat Australia (201 & 123) | The Oval, London | 1938
Innings and 360 runs | Australia (652–7 d) beat South Africa (159 & ..| New Wanderers Stadium, Johannesburg | 2001–02
Innings and 336 runs | West Indies (614–5 d) beat India (124 & 154) | Eden Gardens, Kolkata | 1958–59
Innings and 332 runs | Australia (645) beat England (141 & 172) | Brisbane Cricket Ground | 1946–47
Innings and 324 runs | Pakistan (643) beat New Zealand (73 & 246) | Gaddafi Stadium, Lahore | 2002

You need to collect both th and td tags:
for row in allRows:
results.append([data.get_text() for data in row.find_all(['th', 'td'])])
And, don't forget to omit the last row, it has only Last updated: ... text inside:
allRows = table.find_all('tr')[1:-1]
Additionally, if you want to have column names in your dataframe matching table headers on a page, you need to specify columns keyword argument while creating a dataframe:
headers = [header.get_text() for header in first.find_all('td')]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
print(df)
Produces:
Margin Teams \
0 Innings and 579 runs  England (903-7 d) beat Australia (201 & 123)
1 Innings and 360 runs  Australia (652–7 d) beat South Africa (159 & ...
2 Innings and 336 runs  West Indies (614–5 d) beat India (124 & 154)
3 Innings and 332 runs  Australia (645) beat England (141 & 172)
4 Innings and 324 runs  Pakistan (643) beat New Zealand (73 & 246)
Venue Season
0 The Oval, London 1938
1 New Wanderers Stadium, Johannesburg 2001–02
2 Eden Gardens, Kolkata 1958–59
3 Brisbane Cricket Ground 1946–47
4 Gaddafi Stadium, Lahore 2002

Related

Returning only last item and splitting into columns

I'm having a couple of issues - I seem to be only returning the last item on this list. Can someone help me here please? I also want to split the df into columns filtering all of the postcodes into one column. Not sure where to start with this. Help much appreciated. Many thanks in advance!
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="dealer-overview")
company_elements = results.find_all("article")
for company_element in company_elements:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
print (company_info)
data = {company_info}
df = pd.DataFrame(data)
df.shape
df
IIUC, you need to replace the loop with:
df = pd.DataFrame({'info': [e.getText(separator=u', ')
.replace('Find out more »', '')
for e in company_elements]})
output:
info
0 ESP Bathrooms & Interiors, Queens Retail Park,...
1 Paul Scarr & Son Ltd, Supreme Centre, Haws Hil...
2 Stonebridge Interiors, 19 Main Street, Pontela...
3 Bathe Distinctive Bathrooms, 55 Pottery Road, ...
4 Draw A Bath Ltd, 68 Telegraph Road, Heswall, W...
.. ...
346 Warren Keys, Unit B Carrs Lane, Tromode, Dougl...
347 Haldane Fisher, Isle of Man Business Park, Coo...
348 David Scott (Agencies) Ltd, Supreme Centre, 11...
349 Ballycastle Homecare Ltd, 2 The Diamond, Bally...
350 Beggs & Partners, Great Patrick Street, Belfas...
[351 rows x 1 columns]

Split lists with uncertain elements into different categories (using pandas)

I am having trouble with a pandas split. So I have a column of data that looks something like this:
Initial Dataframe
index | Address
0 | [123 New York St]
1 | [Amazing Building, 23 New Jersey St, 2F]
2 | [98 New Mexico Ave, 16F]
3 | [White House, 1600 Pennsylvania Ave, PH]
4 | [221 Baker Street]
5 | [Hogwarts]
As you can see, the list contains varying categories and number of elements. Some have building names along with addresses. Some only have addresses with building floors. I want to sort them out by category (building name, address, unit/floor number) but I'm having trouble coming up with a solution to this, as I'm a beginner python & pandas learner.
How do I split the addresses into different categories to get an output that looks like this, assuming the building names ALL start with an alphabet and I can put Null for categories with missing value?
Desired Output:
index | Building Name | Address | Unit Number
0 | Null | 123 New York St | Null
1 | Amazing Building | 23 New Jersery St. | 2F
2 | Null | 98 New Mexico Ave. | 16F
3 | White House | 1600 Pennsylvania Ave | PH
4 | Null | 221B Baker St | Null
5 | Hogwarts | Null | Null
The main thing I need is for all addresses to be in the Address Column. Thanks for any help!
preconditional condition : Building name starts with a character, not a number
If the building name starts with a number, the wrong result can be output.
import pandas as pd
df = pd.DataFrame({'addr' : ['123 New York St',
'Amzing Building, 23 New Jersey St, 2F',
'98 New Mexico Ave, 16F']})
# Check the number of items in the address value
df['addr'] = df['addr'].str.split(',')
df['cnt'] = df['addr'].apply(lambda x: len(x)).values
# function, Building name start letter check
def CheckInt(s):
try:
int(s[0])
return True
except ValueError:
return False
for i, v in df.iterrows():
# One item of address value
if v.cnt == 1:
df.loc[i,'Address'] = v.addr
# Three items of address value
elif v.cnt == 3:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
df.loc[i,'Unit'] = v.addr[2]
# Two items of address value
else:
if CheckInt(v.addr[0]):
df.loc[i,'Address'] = v.addr[0]
df.loc[i,'Unit'] = v.addr[1]
else:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
We can get the output for your input dataframe as below.
If the data is different, you may have to tinker around.
df['com_Address'] = df[' Address'].apply(lambda x: x.replace('[','').replace(']','')).str.split(',')
st_list= ['St','Ave']
df['St_Address']=df.apply(lambda x: [a if st in a else '' for st in st_list for a in x['com_Address']],axis=1)
df['St_Address']=df['St_Address'].apply(lambda x:[i for i in x if i]).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: [x['com_Address'][0] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: np.where((len(x['com_Address'])==1) & (x['St_Address']==''),x['com_Address'][0],x['Building Name']),axis=1)
df['Unit Number']=df.apply(lambda x: [x['com_Address'][2] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Unit Number']=df.apply(lambda x: np.where((len(x['com_Address'])==2) & (x['St_Address']!=''),x['com_Address'][-1],x['Unit Number']),axis=1)
df
Column "com_Address" is optional. I had to create it because the 'Address' from your input came to me as a string & not as a list. If you already have it as list, you don't need this & you will have to update "com_Address" with 'Address' in the code.
Output
index Address com_Address Building Name St_Address Unit Number
0 0 [123 New York St] [ 123 New York St] Null 123 New York St Null
1 1 [Amazing Building, 23 New Jersey St, 2F] [ Amazing Building, 23 New Jersey St, 2F] Amazing Building 23 New Jersey St 2F
2 2 [98 New Mexico Ave, 16F] [ 98 New Mexico Ave, 16F] Null 98 New Mexico Ave 16F
3 3 [White House, 1600 Pennsylvania Ave, PH] [ White House, 1600 Pennsylvania Ave, PH] White House 1600 Pennsylvania Ave PH
4 4 [221 Baker Street] [ 221 Baker Street] Null 221 Baker Street Null
5 5 [Hogwarts] [ Hogwarts] Hogwarts Null

printing multiple sections of text between two markers in python

I converted this page (it's squad lists for different sports teams) from PDF to text using this code:
import PyPDF3
import sys
import tabula
import pandas as pd
#One method
pdfFileObj = open(sys.argv[1],'rb')
pdfReader = PyPDF3.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)
The output looks like this:
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF
I wanted to transform this output to a tab delimited file with three columns: team name, player name, and number. So for the example I gave, the output would look like:
Bohemians James Talbot 1
Bohemians Derek Pender 2
Bohemians Darragh Leahy 3
Cork City Mark McNulty 1
Cork City Colm Horgan 2
Cork City Alan Bennett 3
Derry City Peter Cherrie 1
Derry City Conor McDermott 2
Derry City Ciaran Coll 3
I know I need to first (1) Divide the file into sections based on team, and then (2) within each team section; combine each name + number field into pairs to assign each number to a name.
I wrote this little bit of code to parse the big file into each sports team:
import sys
fileopen = open(sys.argv[1])
recording = False
for line in fileopen:
if not recording:
if line.startswith('PREMI'):
recording = True
elif line.startswith('2019 SEA'):
recording = False
else:
print(line)
But I'm stuck, because the above code won't divide up the block of text per team (i.e. i need multiple blocks of text extracted to separate strings or lists?). Can someone advise how to divide up the text file I have per team (so in this example, I should be left with three blocks of text...and then somehow I can work on each team-divided block of text to pair numbers and names).
Soooo, not necessarily true to form and I don't take into consideration the other libraries you'd used, but it was designed to give you a start. You can reformat it however you wish.
>>> string = '''2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF'''
>>> def reorder(string):
import re
headers = ['Team', 'Name', 'Number']
print('\n')
print(headers)
print()
paragraphs = re.findall('2019[\S\s]+?(?=2019|$)', string)
for paragraph in paragraphs:
club = re.findall('(?i)CLUB:[\s]*([\S\s]+?)\n', paragraph)
names_numbers = re.findall('(?i)([\d]+)[\n]{1,3}[\s]*([\S\ ]+)', paragraph)
for i in range(len(names_numbers)):
if len(club) == 1:
print(club[0]+' | '+names_numbers[i][1]+' | '+names_numbers[i][0])
>>> reorder(string)
['Team', 'Name', 'Number']
BOHEMIANS | James Talbot | 1
BOHEMIANS | Derek Pender | 2
BOHEMIANS | Darragh Leahy | 3
CORK CITY | Mark McNulty | 1
CORK CITY | Colm Horgan | 2
CORK CITY | Alan Bennett | 3
DERRY CITY | Peter Cherrie | 1
DERRY CITY | Conor McDermott | 2
DERRY CITY | Ciaran Coll | 3

Splitting multiple pipe delimited values in multiple columns of a comma delimited CSV and mapping them to each other

I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result

How to get all the unique words in the data frame?

I have a dataframe with a list of products and its respective review
+---------+------------------------------------------------+
| product | review |
+---------+------------------------------------------------+
| product_a | It's good for a casual lunch |
+---------+------------------------------------------------+
| product_b | Avery is one of the most knowledgable baristas |
+---------+------------------------------------------------+
| product_c | The tour guide told us the secrets |
+---------+------------------------------------------------+
How can I get all the unique words in the data frame?
I made a function:
def count_words(text):
try:
text = text.lower()
words = text.split()
count_words = Counter(words)
except Exception, AttributeError:
count_words = {'':0}
return count_words
And applied the function to the DataFrame, but that only gives me the words count for each row.
reviews['words_count'] = reviews['review'].apply(count_words)
Starting with this:
dfx
review
0 United Kingdom
1 The United Kingdom
2 Dublin, Ireland
3 Mardan, Pakistan
To get all words in the "review" column:
list(dfx['review'].str.split(' ', expand=True).stack().unique())
['United', 'Kingdom', 'The', 'Dublin,', 'Ireland', 'Mardan,', 'Pakistan']
To get counts of "review" column:
dfx['review'].str.split(' ', expand=True).stack().value_counts()
United 2
Kingdom 2
Mardan, 1
The 1
Ireland 1
Dublin, 1
Pakistan 1
dtype: int64 ​

Categories

Resources