I am currently using the below code to web scrape data and then store it in a CSV file.
from bs4 import BeautifulSoup
import requests
url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')
news_items = []
for item in soup.findAll('item'):
news_item = {}
news_item['title'] = item.title.text
news_item['excerpt'] = item.description.text
print(item.link.text)
s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')
news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_item['Category'] = 'Company'
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)
When displaying the data frame, the results look fine as attached.enter image description here
But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.
The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.
If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.
A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.
But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.
Related
I'm trying to extract table from a pdf that had a lot of name of media sources. The desired output is a comprehensive csv file with a column with all the sources listed.
I'm trying to write a simple python script to extract table data from a pdf. The output I was able to reach is a CSV for every table that I try to combine. Then I use the concat function to merge all the files.
The result is messy, I have redundant punctuation and a lot of spaces in the file.
Can somebody help me reach a better result?
Code:
from camelot import read_pdf
import glob
import os
import pandas as pd
import numpy as np
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Get all the tables within the file
all_tables = read_pdf("/Users/zago/code/pdftext/pdftextvenv/mimesiweb.pdf", pages = 'all')
# Show the total number of tables in the file
print("Total number of table: {}".format(all_tables.n))
# print all the tables in the file
for t in range(all_tables.n):
print("Table n°{}".format(t))
print((all_tables[t].df).head())
#convert to excel or csv
#all_tables.export('table.xlsx', f="excel")
all_tables.export('table.csv', f="csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',') for f in all_filenames ])
#export to csv
combined_csv.to_csv("combined_csv_tables.csv", index=False, encoding="utf-8")
Starting point PDF
Result for 1 csv
Combined csv
Thanks
Select only the first column before concatenating and then save.
Just use this line of code:
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',').iloc[:,0] for f in all_filenames ])
Output:
In [25]: combined_csv
Out[25]:
0 Interni.it
1 Intima.fr
2 Intimo Retail
3 Intimoda Magazine - En
4 Intorno Tirano.it
...
47 Alessandria Oggi
48 Aleteia.org
49 Alibi Online
50 Alimentando
51 All About Italy.net
Length: 2824, dtype: object
And final csv output:
There are oddities to beware of when using CSV format.
ALL PDF pages are generally stored as just one column of text from page edge to edge, unless tagged to be multi column pages areas. (One reason data extractors are required to generate a split into tabular text.)
In this case of a single column of text a standard CSV file output/input is no different, except for ONE entry that includes a comma :- (no other entry needs a comma in a CSV output) so if the above PDF is import/exported to Excel it will look like a column.
Thus the only command needed is to export pdftotxt
add " to each line end and rename to csv.
HOWEVER see comment after
pdftotext -layout -nopgbrk -enc UTF-8 mimesiweb.pdf
for /f "tokens=*" %t in (mimesiweb.txt) do echo "%t" >>mimesiweb.csv
This will correctly generate the desired output for open in Excel on command line
We correctly out put UTF-8 characters in the text.csv but my old Excel always cause the UTF to be lost on import, (e.g. Accènto becomes Accènto) even if I use CHCP 65001 (Unicode) to invoke it.
Here exported as UTF8.csv (the file reads accented in notepad) no issue, but reimported again as UTF8.csv the symbology is lost ! Newer Excel may fare better ?
So that is a failing in Excel where without better excel import for me would be to simply cut and paste the 2880 lines of text so that those accents are preserved !! The alternative is to import text into Libre Office that will support UTF-8
However remember to either UNCHECK comma OR pad "each, line" as I did earlier for csv with existing , otherwise the 2nd column is generated. :-)
I've found pdfplumber simpler to use for such tasks.
import pdfplumber
pdf = pdfplumber.open("Downloads/mimesiweb.pdf")
rows = []
for page in pdf.pages:
rows.extend(page.extract_text().splitlines())
>>> len(rows)
2881
>>> rows[:3]
['WEB', 'Ultimo aggiornamento: 03 06 2020', '01net.it']
>>> rows[-3:]
['Zoneombratv.it', 'Zoom 24', 'ZOOMsud.it']
I think this is simple but I am not finding an answer that works. The data importing seems to work but separating the "/" numbers doesnt code is below. thanks for the help.
import urllib.request
opener = urllib.request.FancyURLopener({})
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
f = opener.open(url)
content = f.read()
# below are the 3 different ways I tried to separate the data
content.encode('string-escape').split("\\x")
content.split('\r')
content.split('\\')
I highly recommend Pandas for reading and analysing this kind of file. It supports reading directly from a url and also gives meaningful analysis ability.
import pandas
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
df = pandas.read_table(url, sep="\t+", engine='python', index_col="Year")
Note that you have multiple repeated tabs as separators in that file, which is handled by the sep="\t+". The repeats also means you have to use the python engine.
Now that the file is read into a dataframe, we can do easy plotting for instance:
df[['ChickPrice', 'BeefPrice']].plot()
Simply use a csv.reader or csv.DictReader to parse the contents. Make sure to set the delimiter to tabs, in this case:
import requests
import csv
import re
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
response = requests.get(url)
response.raise_for_status()
text = re.sub("\t{1,}", "\t", response.text)
reader = csv.DictReader(text.splitlines(), delimiter="\t")
for row in reader:
print(row)
I like csv.DictReader better in this case, because it consumes the header line for you and each "row" is a dictionary. Your specific text file sometimes seperates fields with repeated tabs to make it look prettier, so you'll have to take that into account in some way. In my snippet, I used a regular expression to replace all tab-clusters with a single tab.
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".
I'm a beginner in Python, and I'm trying to extract data from the web and display it in a table :
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
from datetime import datetime
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print name
price_box = soup.find('div', attrs={'class':'price'})
price = price_box.text
print price
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name, price, datetime.now()])
this is a very basic code that extract data from bloomberg and display it in an csv file.
It should display the name in a column, the price in an other one and the date in the third one.
But actually it copy all this data in the first row : Result of the index.csv file .
Am I missing something with my code?
Thank you for your help !
Wikipedia:
In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
The problem isn't related to your Python code! Your script actually writes the plain text file with the fields separated by commas. It is your csv file viewer which doesn't take commas as separators. You should check in the preferences of your csv file viewer.
It looks like when you are importing your CSV into Excel that it isn't being interpreted right. When I imported it into Excel, I noticed that the comma in the "2,337.58" is messing up the CSV data, putting the ,337.58" into a column of it's own. When you import the data into excel, you should get a popup that will ask how the data is represented. You should pick the delimited option and then select delimiter: comma. Finally, click finish.
So I am quite the beginner in Python, but what I'm trying to do is to download each CSV file for the NYSE. In an excel file I have every symbol. The Yahoo API allows you to download the CSV file by adding the symbol to the base url.
My first instinct was to use pandas, but pandas doesn't store strings.
So what I have
import urllib
strs = ["" for x in range(3297)]
#strs makes the blank string array for each symbol
#somehow I need to be able to iterate each symbol into the blank array spots
while y < 3297:
strs[y] = "symbol of each company from csv"
y = y+1
#loop for downloading each file from the link with strs[y].
while i < 3297:
N = urllib.URLopener()
N.retrieve('http://ichart.yahoo.com/table.csv?s='strs[y],'File')
i = i+1
Perhaps the solution is simpler than what I am doing.
From what I can see in this question you can't see how to connect your list of stock symbols to how you read the data in Pandas. e.g. 'http://ichart.yahoo.com/table.csv?s='strs[y] is not valid syntax.
Valid syntax for this is
pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(strs[y]))
It would be helpful if you could add a few sample lines from your csv file to the question. Guessing at your structure you would do something like:
import pandas as pd
symbol_df = pd.read_csv('path_to_csv_file')
for stock_symbol in symbol_df.symbol_column_name:
df = pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(stock_symbol))
# process your dataframe here
Assuming you take that Excel file w/ the symbols and output as a CSV, you can use Python's built-in CSV reader to parse it:
import csv
base_url = 'http://ichart.yahoo.com/table.csv?s={}'
reader = csv.reader(open('symbols.csv'))
for row in reader:
symbol = row[0]
data_csv = urllib.urlopen(base_url.format(symbol)).read()
# save out to file, or parse with CSV library...