I am opening and reading one .csv file at a time from a folder and printing them out as follows:
ownerfiles = os.listdir(filepath)
for ownerfile in ownerfiles:
if ownerfile.endswith(".csv"):
eachfile = (filepath + ownerfile) #loops over each file in ownerfiles
with open (eachfile, 'r', encoding="UTF-8") as input_file:
next(input_file)
print(eachfile)
for idx, line in enumerate(input_file.readlines()) :
line = line.strip().split(",")
print(line)
However, when I do print(line) the files are printing as follows:
/Users/Sulz/Desktop/MSBA/Applied Data Analytics/Test_File/ownerfile_138.csv
['']
['2010-01-01 11:28:35', '16', '54', '59', '0000000040400', 'O.Coffee Hot Small', 'I', ' ', ' ', '14', '1', '0', '0.3241', '1.4900', '1.4900', '1.4900', '0.0000', '1', '0', '0', '0', '0.0000', '0.0000', '1', '44', '0', '0.00000000', '1', '0', '0', '0.0000', '0', '0', '', '0', '5', '0', '0', '0', '0', 'NULL', '0', 'NULL', '', '0', '20436', '1', '0', '0', '1']
How can I get rid of [''] before the list of all the data ??
EDIT:
I now tried reading it with the .csv module like this:
ownerfiles = os.listdir(filepath)
for ownerfile in ownerfiles:
if ownerfile.endswith(".csv"):
eachfile = (filepath + ownerfile) #loops over each file in ownerfiles
with open (eachfile, 'r', encoding="UTF-8") as input_file:
next(input_file)
reader = csv.reader(input_file, delimiter=',', quotechar='|')
for row in reader :
print(row)
However, it still prints output like this:
[] ['2010-01-01 11:28:35', '16', '54', '59', '0000000040400', 'O.Coffee Hot Small', 'I', ' ', ' ', '14', '1', '0', '0.3241', '1.4900', '1.4900', '1.4900', '0.0000', '1', '0', '0', '0', '0.0000', '0.0000', '1', '44', '0', '0.00000000', '1', '0', '0', '0.0000', '0', '0', '', '0', '5', '0', '0', '0', '0', 'NULL', '0', 'NULL', '', '0', '20436', '1', '0', '0', '1']
That's just Python's list syntax being printed. You are splitting each line on a comma which is generating a list. If you print the line before the split you'll probably get what you're looking for:
line = line.strip()
print(line)
line = line.split(",")
By the way, Python has a built in CSV module for reading and writing csv files, in case you didn't know.
EDIT: Sorry, I misread your question. Add this to the start of your readlines loop:
line = line.strip()
if not line:
continue
Related
I am trying to use beautiful soup to pull the table corresponding to the HTML code below
<table class="sortable stats_table now_sortable" id="team_pitching" data-cols-to-freeze=",2">
<caption>Team Pitching</caption>
from https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2. Here is a screenshot of the site layout and HTML code I am trying to extract from.
I was using the code
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
res = requests.get(url)
soup1 = BS(res.content, 'html.parser')
table1 = soup1.find('table',{'id':'team_pitching'})
table1
I can't seem to figure out how to get this working. The table above can be extracted with the line
table1 = soup1.find('table',{'id':'team_batting'})
and I figured similar code should work for the one below. Additionally, is there a way to extract this using the table class "sortable stats_table now_sortable" rather than id?
The problem is that if you open the page normally it shows all the tables, however if you load the page with Developer Tools just the first table is shown. So, when you do your request the left tables are not included into the HTML you're getting. The table you're looking for is not shown until "Show team pitchin" button is pressed, to do this you could use Selenium and get the full HTML response.
That is because the table you are looking for - i.e. <table> with id="team_pitching" is present as a comment inside the soup. You can check it for yourself by printing soup.
You need to
Extract that comment from the soup
Convert it into a soup object
Extract the table data from the soup object.
Here is the complete code that does the above mentioned steps.
from bs4 import BeautifulSoup, Comment
import requests
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
main_div = soup.find('div', {'id': 'all_team_pitching'})
# Extracting the comment from the above selected <div>
for comments in main_div.find_all(text=lambda x: isinstance(x, Comment)):
temp = comments.extract()
# Converting the above extracted comment to a soup object
s = BeautifulSoup(temp, 'lxml')
trs = s.find('table', {'id': 'team_pitching'}).find_all('tr')
# Printing the first five entries of the table
for tr in trs[1:5]:
print(list(tr.stripped_strings))
The first 5 entries from the table
['1', 'Tyler Ahearn', '21', '1', '0', '1.000', '1.93', '6', '0', '0', '1', '9.1', '8', '5', '2', '0', '4', '14', '0', '0', '0', '42', '1.286', '7.7', '0.0', '3.9', '13.5', '3.50']
['2', 'Jack Anderson', '20', '2', '0', '1.000', '0.79', '4', '1', '0', '0', '11.1', '6', '4', '1', '0', '3', '11', '1', '0', '0', '45', '0.794', '4.8', '0.0', '2.4', '8.7', '3.67']
['3', 'Shane Drohan', '*', '21', '0', '1', '.000', '4.08', '4', '4', '0', '0', '17.2', '15', '12', '8', '0', '11', '27', '1', '0', '2', '82', '1.472', '7.6', '0.0', '5.6', '13.8', '2.45']
['4', 'Conor Grady', '21', '2', '0', '1.000', '3.00', '4', '4', '0', '0', '15.0', '10', '5', '5', '3', '8', '15', '1', '0', '2', '68', '1.200', '6.0', '1.8', '4.8', '9.0', '1.88']
Given the following csv file:
['offre_bfr.entreprise', 'offre_bfr.nombreemp', 'offre_bfr.ca2020', 'offre_bfr.ca2019', 'offre_bfr.ca2018', 'offre_bfr.benefice2020', 'offre_bfr.benefice2019', 'offre_bfr.benefice2018', 'offre_bfr.tauxrenta2020', 'offre_bfr.tauxrenta2019', 'offre_bfr.tauxrenta2018', 'offre_bfr.tauximposition', 'offre_bfr.chargesalariale', 'offre_bfr.chargesfixes', 'offre_bfr.agedirigeant', 'offre_bfr.partdirigeant', 'offre_bfr.agemoyact', 'offre_bfr.parttotaleact', 'offre_bfr.mtdmdcred', 'offre_bfr.creditusuel', 'offre_bfr.capipropres', 'offre_bfr.dettefin', 'offre_bfr.dettenonfin', 'offre_bfr.stock', 'offre_bfr.creances', 'offre_bfr.actifimmobilise', 'offre_bfr.passiftotal', 'offre_bfr.tresorerie', 'offre_bfr.capitalisation2020', 'offre_bfr.capitalisation2019', 'offre_bfr.capitalisation2018', 'offre_bfr.nivrisque', 'offre_bfr.indconfiance', 'offre_bfr.indperseverance', 'offre_bfr.score']
['1', '15', '1.84', '5.18', '7.96', '0.48', '1.19', '0.11', '26.086956', '22.972973', '1.3819095', '17.9', '0.035295', '1.2', '55', '33', '69', '67', '10', '14.98', '0.05', '0.04', '0.21', '0.1', '0.08', '0.41', '0.8', '0.0', '7.5', '52.8', '0.16', 'Bas', '4', '4', '5.0']
['3', '3030', '546.7', '589.7', '430.9', '62.58', '20.63', '99.06', '11.446863', '3.498389', '22.989092', '17.4', '7.12959', '270.9', '46', '37', '69', '73', '2973', '1567.3', '46.97', '13.39', '61.92', '3.0', '8.0', '145.0', '278.4', '-51.0', '1063.5', '3047.8', '538.08', 'Eleve', '4', '4', '3.0']
['4', '42', '4.28', '9.13', '8.99', '0.45', '0.59', '0.08', '10.514019', '6.4622126', '0.8898776', '31.5', '0.098826', '2.2', '70', '32', '53', '68', '9', '22.4', '0.13', '0.06', '0.31', '0.1', '0.07', '0.92', '1.7', '-0.3', '42.5', '69.5', '2.73', 'Eleve', '4', '4', '3.0']
['5', '497', '92.2', '62.5', '40.3', '20.14', '6.91', '4.92', '21.843819', '11.056', '12.208437', '32.2', '1.169441', '5.1', '64', '32', '70', '68', '197', '195.0', '6.07', '1.83', '12.49', '5.9', '3.83', '16.41', '16.5', '-2.7', '1048.3', '618.8', '11.24', 'Moyen', '4', '4', '4.0']
['8', '122', '67.8', '24.5', '91.4', '12.67', '5.69', '8.43', '18.687315', '23.22449', '9.223195', '24.8', '0.287066', '19.5', '53', '35', '61', '65', '424', '183.7', '1.64', '1.92', '6.48', '4.9', '2.45', '23.6', '23.7', '-3.5', '204.2', '109.5', '5.33', 'Eleve', '4', '4', '3.0']
['11', '310', '77.5', '78.7', '24.9', '8.05', '21.76', '1.79', '10.387096', '27.649302', '7.188755', '29.0', '0.72943', '12.0', '47', '32', '65', '68', '38', '181.1', '6.55', '3.27', '8.16', '5.1', '2.08', '15.09', '36.3', '-7.0', '669.8', '705.3', '22.95', 'Eleve', '4', '4', '3.0']
['14', '283', '91.9', '52.9', '51.9', '10.48', '7.01', '12.57', '11.4037', '13.251418', '24.219654', '24.2', '0.665899', '2.3', '61', '29', '58', '71', '60', '196.7', '8.02', '2.93', '7.79', '7.0', '3.87', '25.1', '42.7', '-4.4', '434.0', '143.4', '17.18', 'Eleve', '4', '4', '3.0']
['16', '41', '5.54', '6.48', '5.5', '1.55', '1.51', '0.73', '27.97834', '23.30247', '13.272727', '15.9', '0.096473', '2.4', '71', '39', '56', '61', '29', '17.52', '0.41', '0.11', '0.62', '0.3', '0.17', '1.47', '2.4', '0.0', '36.7', '76.0', '4.2', 'Bas', '4', '4', '5.0']
I would like to create a bar chart from columns 0 and 34 of the csv file.
Here is the python script I am running:
# -*-coding:Latin-1 -*
#!/usr/bin/python
#!/usr/bin/env python
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import csv
x = []
y = []
Bfr = csv.reader(open('/home/cloudera/PMGE/Bfr.csv'))
linesBfr = list(Bfr)
i=1
for l in linesBfr:
x.append(l[i][0])
y.append(int(l[i][34]))
plt.bar(x, y, color = 'g', width = 0.72, label = "Score")
plt.xlabel('Entreprise')
plt.ylabel('Scores')
plt.title('Scores des entreprises en BFR')
plt.legend()
plt.show()
But i'm getting the following error:
Traceback (most recent call last):
File "barplot.py", line 20, in <module>
y.append(int(l[i][34]))
IndexError: string index out of range
Can someone help me out?
Python lists are zero-indexed. You are trying to iterate to the 35th element in a 34 element list.
Firstly, there are 35 elements from 0 to 34. This means that starting your indexing i at i=1 will look for an element at the 35th index, which does not exist, or be an "index out of range". To be more specific, your code is looking for a list that does not exist. Secondly, this is not the standard way to use 2d lists in python. I suggest using a method more as such:
https://www.kite.com/python/answers/how-to-append-to-a-2d-list-in-python#:~:text=Append%20a%20list%20to%20a,list%20to%20the%202D%20list.
Hope this was helpful.
You probably meant to write this:
x = []
y = []
Bfr = csv.reader(open('/home/cloudera/PMGE/Bfr.csv'))
next(Bfr , None) # skip the header
for l in Bfr:
x.append(int(l[0]))
y.append(int(l[34]))
...
(See this question about skipping the header of a csv)
I am new to python and I am currently learning. I got text file with variable number of spaces in between words per line:
I am trying to read it as follows:
import re
...: results = []
...: with open ("../../103.Immune_gene_families/Immune_genes/Human/human_immunegene.hits") as file:
...: for line in file:
...: if not line.startswith("#"):
...: line = re.sub("\s\s+" , " ", line)
...: #print(line)
...: ens_id = line.split(" ")[1]
...: print(ens_id)
...:
But I got the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-469f5598d359> in <module>
6 line = re.sub("\s\s+" , " ", line)
7 #print(line)
----> 8 ens_id = line.split(" ")[1]
9 print(ens_id)
10
IndexError: list index out of range
Example lines I get with
print(line)
['ENSG00000128016', '115', '138', '107', '147', 'TF106503', '9', '32', '5.9', '8.3', '0', 'No_clan', '']
['ENSG00000128016', '135', '169', '130', '172', 'TF317698', '454', '488', '18.0', '0.00073', '0', 'No_clan', '']
['ENSG00000128016', '137', '175', '134', '196', 'TF318914', '95', '132', '21.9', '8e-05', '0', 'No_clan', '']
['ENSG00000128016', '137', '167', '130', '173', 'TF326635', '1096', '1127', '5.7', '3.3', '0', 'No_clan', '']
['ENSG00000128016', '138', '170', '133', '173', 'TF329017', '881', '912', '5.3', '4.3', '0', 'No_clan', '']
['ENSG00000128016', '139', '166', '129', '173', 'TF105541', '764', '791', '9.3', '0.38', '0', 'No_clan', '']
['ENSG00000128016', '139', '166', '132', '172', 'TF105970', '278', '305', '8.4', '0.6', '0', 'No_clan', '']
['ENSG00000128016', '140', '170', '131', '174', 'TF314946', '110', '140', '4.5', '6.3', '0', 'No_clan', '']
['ENSG00000128016', '142', '167', '134', '184', 'TF329287', '9', '33', '6.8', '2.3', '0', 'No_clan', '']
If you could help me on this regard, much appreciated.
Thank you,
AK
Welcome to SO!
If you run
string = 'abc'
print(string.split(' '))
you will see that the result is
['abc']
If you tried to string.split(' ')[1], you would generate an IndexError.
So what is happening is that, somewhere, you likely don't have the character that you are splitting on.
You get the index error because there is less than 2 elements in line.split(" "),
also meaning there was less than 2 spaces in line. Try line.split(" ")[0] instead:
import re
results = []
with open ("../../103.Immune_gene_families/Immune_genes/Human/human_immunegene.hits") as file:
for line in file:
if not line.startswith("#"):
line = re.sub("\s\s+" , " ", line)
#print(line)
ens_id = line.split(" ")[0]
print(ens_id)
I'm trying to scrape the following website:
http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=2018&batting_team=119&batter=571771&pitching_team=133&pitcher=641941
(this is an example URL with a certain pitcher/batter matchup)
I'm able to enter the player codes and team codes easily with this function:
def matchupURL(season, batter, batterTeam, pitcher, pitcherTeam):
return "http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=" + str(season)+ "&batting_team="+str(teamNumDict[batterTeam])+"&batter="+str(batter)+"&pitching_team="+str(teamNumDict[pitcherTeam])+"&pitcher="+str(pitcher);
which works nicely, and the returned string works when pasted into my browser.
But when i make a request a la
newURL = matchupURL(2018,i.id,x.home_team,j.id,x.away_team)
print(i+ " vs " + j)
newSes = requests.get(newURL);
html = BeautifulSoup(newSes.text, "lxml")
mydivs = html.findAll("td",{"class":"dg-ops"})
#do something with this div
I'm unable to find the div. Infact, the entire format of the HTML returned changes. Further, adding headers didnt help, nor did using urllib instead of requests.
This page is a dynamic, i.e., the content is dynamically generated by javascript and showed in the front. That is the reason you can't detect the div tag.
But in this case you can scrape easier. With inspect tool from your browser you can detect that the data comes from a GET request to an URL. For your example, you only have to provide the players id :
import requests
url = 'http://lookup-service-prod.mlb.com/json/named.stats_batter_vs_pitcher_composed.bam'
params = {"sport_code":"'mlb'","game_type":"'R'","player_id":"571771","pitcher_id":"641941"}
resp = requests.get(url, params=params).json()
print(resp)
That prints:
{'stats_batter_vs_pitcher_composed': {'stats_batter_vs_pitcher_total': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'bats': 'R', 'xbh': '1', 'bb': '0', 'slg': '4.000', 'avg': '1.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sf': '0', 'tpa': '1', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'go': '0'}}}, 'copyRight': ' Copyright 2018 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt ', 'stats_batter_vs_pitcher': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'opponent': 'Oakland Athletics', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'xbh': '1', 'bats': 'R', 'bb': '0', 'avg': '1.000', 'slg': '4.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sport': 'MLB', 'sf': '0', 'team': 'Los Angeles Dodgers', 'tpa': '1', 'league': 'NL', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'season': '2018', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'opponent_league': 'AL', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'opponent_id': '133', 'team_id': '119', 'go': '0', 'opponent_sport': 'MLB'}}}}}
I have a csv files with player attributes:
['Peter Regin', '2', 'DAN', 'N', '1987', '6', '6', '199', '74', '2', '608000', '', '77', '52', '74', '72', '58', '72', '71', '72', '70', '72', '74', '68', '74', '41', '40', '51']
['Andrej Sekera', '8', 'SVK', 'N', '1987', '6', '6', '198', '72', '3', '1323000', '', '65', '39', '89', '78', '75', '70', '72', '56', '53', '56', '57', '72', '57', '59', '70', '51']
For example, I want to check if a player is a CENTER ('2' in position 1 in my list) and after I want to modify the 12 element (which is '77' for Peter Regin)
How can I do that using the CSV module ?
import csv
class ManipulationFichier:
def __init__(self, fichier):
self.fichier = fichier
def read(self):
with open(self.fichier) as f:
reader = csv.reader(f)
for row in reader:
print(row)
def write(self):
with open(self.fichier) as f:
writer = csv.writer(f)
for row in f:
if row[1] == 2:
writer.writerows(row[1] for row in f)
Which do nothing important..
Thanks,
In general, CSV files cannot be reliably modified in-place.
Read the entire file into memory (usually a list of lists, as in your example), modify the data, then write the entire file back.
Unless your file is really huge, and you do this really often, the performance hit will be negligible.