I think this is simple but I am not finding an answer that works. The data importing seems to work but separating the "/" numbers doesnt code is below. thanks for the help.
import urllib.request
opener = urllib.request.FancyURLopener({})
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
f = opener.open(url)
content = f.read()
# below are the 3 different ways I tried to separate the data
content.encode('string-escape').split("\\x")
content.split('\r')
content.split('\\')
I highly recommend Pandas for reading and analysing this kind of file. It supports reading directly from a url and also gives meaningful analysis ability.
import pandas
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
df = pandas.read_table(url, sep="\t+", engine='python', index_col="Year")
Note that you have multiple repeated tabs as separators in that file, which is handled by the sep="\t+". The repeats also means you have to use the python engine.
Now that the file is read into a dataframe, we can do easy plotting for instance:
df[['ChickPrice', 'BeefPrice']].plot()
Simply use a csv.reader or csv.DictReader to parse the contents. Make sure to set the delimiter to tabs, in this case:
import requests
import csv
import re
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
response = requests.get(url)
response.raise_for_status()
text = re.sub("\t{1,}", "\t", response.text)
reader = csv.DictReader(text.splitlines(), delimiter="\t")
for row in reader:
print(row)
I like csv.DictReader better in this case, because it consumes the header line for you and each "row" is a dictionary. Your specific text file sometimes seperates fields with repeated tabs to make it look prettier, so you'll have to take that into account in some way. In my snippet, I used a regular expression to replace all tab-clusters with a single tab.
Related
Dear all i often need to concat csv files with identical headers (i.e. but them into one big file). Usually i just use pandas, but i now need to operate in an enviroment were i am not at liberty to install any library. The csv and the html libs do exsit.
I also need to remove all remaining html tags like %amp; for the apercent symbol which are still within the data. I do know in which columns these can come up.
I thought aboug doing it like this, and the concat part of my code seems to work fine:
import CSV
import html
for file in files: # files is a list of csv files.
with open(file, "rt", encoding="utf-8-sig") as source, open(outfilePath, "at", newline='',
encoding='utf-8-sig') as result:
d_reader = csv.DictReader(source,delimiter=";")
# Set header based on first file in file_list:
if file == test_files[0]:
Common_header = d_reader.fieldnames
# Define DcitwriterObject
wtr = csv.DictWriter(result, fieldnames=Common_header, lineterminator='\n', delimiter=";")
# Write Header only once to emtpy file
if result.tell() == 0:
wtr.writeheader()
# If i remove this block i get my concateneated singe csv file as a result
# Howerver the html tags/encoded symbols are sill present.
for row in d_reader:
print(html.unescape(row['ColA'])) # This prints the unescaped Values in the column correctly
# If i kepp these two lines, i get an empty file with just the header as a result of the concatenation
row['ColA'] = html.unescape(row['ColA'])
row['ColB'] = html.unescape(row['ColB'])
wtr.writerows(d_reader)
I would have thought the simply suppling the encoding='utf-8-sig' part to the result file would be sufficient to get rid of the html symbols but that does not work. If you could give me a hint what i am doint wrong in the usage of the code containing the html.unescape function in my code that would be nice.
Thank you in advance
Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.
I am currently using the below code to web scrape data and then store it in a CSV file.
from bs4 import BeautifulSoup
import requests
url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')
news_items = []
for item in soup.findAll('item'):
news_item = {}
news_item['title'] = item.title.text
news_item['excerpt'] = item.description.text
print(item.link.text)
s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')
news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_item['Category'] = 'Company'
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)
When displaying the data frame, the results look fine as attached.enter image description here
But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.
The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.
If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.
A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.
But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.
I'm trying to read data from Google Fusion Tables API into Python using the csv library. It seems like querying the API returns CSV data, but when I try and use it with csv.reader, it seems to mangle the data and split it up on every character rather than just on the commas and newlines. Am I missing a step? Here's a sample I made to illustrate, using a public table:
#!/usr/bin/python
import csv
import urllib2, urllib
request_url = 'https://www.google.com/fusiontables/api/query'
query = 'SELECT * FROM 1140242 LIMIT 10'
url = "%s?%s" % (request_url, urllib.urlencode({'sql': query}))
serv_req = urllib2.Request(url=url)
serv_resp = urllib2.urlopen(serv_req)
reader = csv.reader(serv_resp.read())
for row in reader:
print row #prints out each character of each cell and the column headings
Ultimately I'd be using the csv.DictReader class, but the base reader shows the issue as well
csv.reader() takes in a file-like object.
Change
reader = csv.reader(serv_resp.read())
to
reader = csv.reader(serv_resp)
Alternatively, you could do:
reader = csv.DictReader(serv_resp)
It's not the CSV module that's causing the problem. Take a look at the output from serv_resp.read(). Try using serv_resp.readlines() instead.
I have a two-column csv which I have uploaded via an HTML page to be operated on by a python cgi script. Looking at the file on the server side, it looks to be a long string i.e for a file called test.csv with the contents.
col1, col2
x,y
has become
('upfile', 'test.csv', 'col1,col2'\t\r\nx,y')
Col1 contains the data I want to operate on (i.e. x) and col 2 contains its identifier (y). Is there a better way of doing the uploading or do I need to manually extract the fields I want - this seems potentially very error-prone
thanks
If you're using the cgi module in python, you should be able to do something like:
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile.file)
header = reader.next() # list of column names
for row in reader:
# row is a list of fields
process_row(row)
See, for example, cgi programming or the python cgi module docs.
Can't you use the csv module to parse this? It certantly better than rolling your own.
Something along the lines of
import csv
import cgi
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile, delimiter=',')
for row in reader:
for field in row:
doThing()
EDIT: Correcting my answer from the ars answer posted below.
Looks like your file is becoming modified by the HTML upload. Is there anything stopping you from just ftp'ing in and dropping the csv file where you need it?
Once the CSV file is more proper, here is a quick function that will put it into a 2D array:
def genTableFrCsv(incsv):
table = []
fin = open(incsv, 'rb')
reader = csv.reader(fin)
for row in reader:
table.append(row)
fin.close()
return table
From here you can then operate on the whole list in memory rather than pulling bit by bit from the file as in Vitor's solution.
The easy solution is rows = [row.split('\t') for r in csv_string.split('\r\n')]. It's only error proned if you have users from different platforms submit data. They might submit comas or tabs and their line breaks could be \n, \r\n, \r, or ^M. The easiest solution is to use regular expressions. Book mark this page if you don't know regular expressions:
http://regexlib.com/CheatSheet.aspx
And here's the solution:
import re
csv_string = 'col1,col2'\t\r\nx,y' #obviously your csv opening code goes here
rows = re.findall(r'(.*?)[\t,](.*?)',csv_string)
rows = rows[1:] # remove header
Rows is now a list of tuples for all of the rows.