I want to learn how to get CSV files from URLs.
While I can make the code below work by hard coding the name of the CSV string variable, I want to learn how to iterate through many CSV strings.
import csv
import requests
CSV_URL_1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv'
CSV_URL_2 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv'
csv_list = []
for i in range(1,3):
concat = "CSV_URL_" + str(i)
csv_list.append(concat)
with requests.Session() as s:
csv_list_dict = {}
for i in csv_list:
download = s.get(i) #This part is the problem
decoded_content = download.content.decode('utf-8')
cr= csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
csv_list_dict[i]=my_list
csv_list_dict
In case it's not clear, I want the "i" in download = s.get(i) to become "CSV_URL_1" on the first iteration and "CSV_URL_2" on the second (I can copy the code twice and hard code these values to get the correct result), but I can't figure out how to make this iteration work. Instead, I get a missing schema error.
What am I doing wrong?
When you do this:
concat = "CSV_URL_" + str(i)
csv_list.append(concat)
You are putting the strings "CSV_URL_1" and "CSV_URL_2" in csv_list.
But the first time your code does this:
download = s.get(i)
you are clearly expecting this to mean
download = s.get('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv')
but it doesn't. It means
download = s.get("CSV_URL_1")
Now you can see why you are getting a missing schema error. There is no https:// in the URL. Your code is computing a variable name and then trying to use that name as if it were a variable.
Do this instead:
CSV_URL = ['https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv', 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv']
for i in range(2):
download = s.get(CSV_URL[i])
Related
from re import I
from requests import get
res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
kek = []
for x in res:
kek.append(x)
lnk = res[kek[0]]['downloads']
anime_name = res[kek[0]]['show']
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{anime_name}:\n\n{quality}: {links}\n\n"
print(data)
in this code how can i prevent repeating of anime name
if i add this outside of the loop only 1 link be printed
you can separate you string, 1st half outside the loop, 2nd inside the loop:
print(f"{anime_name}:\n\n")
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{quality}: {links}\n\n"
print(data)
Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict)
from requests import get
data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
for show, info in data.items():
print(show, '\n')
for download in info['downloads']:
print(download['magnet'])
print(download['res'])
print('\n')
Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.
I am reading my data from a CSV file using pandas and it works well with range 700. But as soon as I go above 700 and trying to append to a list in python it is showing me list index out of range. But the CSV has around 500K of rows
Can anyone help me with that why is it happening?
Thanks in advance.
import pandas as pd
df_email = pd.read_csv('emails.csv',nrows=800)
test_email = df_email.iloc[:,-1]
list_of_emails = []
for i in range(len(test_email)):
var_email = test_email[i].split("\n") #this code takes one single email splits based on a new line giving a python list of all the strings in the email
email = {}
message_body = ''
for _ in var_email:
if ":" in _:
var_sentence = _.split(":") #this part actually uses the ":" to find the elements in the list that have ":" present
for j in range(len(var_sentence)):
if var_sentence[j].lower().strip() == "from":
email['from'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == "to":
email['to'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == 'subject':
if var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip() == 're':
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+2])].lower().strip()
else:
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif ":" not in _:
message_body += _.strip()
email['body'] = message_body
list_of_emails.append(email)
I am not sure of what you are trying to say here (might as well put example inputs and outputs here), but I came across this problem, which might be of the same nature, sometime weeks ago.
CSV files are comma-separated, which means it always takes note of every comma in a line to separate them in columns. If some dirty input from strings in your CSV file are present, then it will mess up the columns that you are expecting to have.
Best solution here is have some code to cleanup your CSV file, change its delimiter to another character (probably '|', '&', or anything that also doesn't mess up with the data), and revise your code to reflect these changes to the CSV.
use the pandas library to read the file.
it is very efficient and saves you time in writing the code yourself.
eg :
import pandas as pd
training_data = pd.read_csv( "train.csv", sep = ",", header = None )
Im trying to make a very simple program where I have two csv files with lists of domains or blog post urls in them. Im trying to import the first one and make the column of domains and column of prices into a dictionary which I have managed to do.
Now I want to import the second csv file which is just a single column of blog post urls.
After I import the urls from the second file and print them out every url seems to be wrapped in [] within a list like so:
[['http://www.gardening-blog.com/post-1'],['http://www.foodie-blog.com/post-2'],['http://www.someotherblog.com/post-1'].... etc etc
is this something to do with importing with csv reader?
Also I have another question, what is the best way to strip the 'http://' and 'www.' from the list of urls? I have 2 ways Ive tried below using map and join (commented out) but they wont work, I have a feeling thats to do with the list problem though. I have done this with the dictionary but I can't use replace with a list.
thanks
import csv
reader = csv.reader(open("domains_prices.csv", 'r'))
reader2 = csv.reader(open('orders_list.csv', 'r'))
domains_prices={}
orders_list = list(reader2) #import all blog post urls into a list
for domain, price in reader:
domain = domain.replace('http://', '').replace('www', '')
domains_prices[domain] = price
#orders_list = ''.join(orders_list).replace('http://','').split()
#map(str.strip, orders_list)
print orders_list
EDIT
here's what Ive changed and seems to work now:
orders_list = []
for row in reader2:
orders_list.append(','.join(row))
orders_list = [s.replace('http://', '').replace('www.','') for s in orders_list]
So, basically csv.reader reads custom csv file, and his next() method gives next row, and in python this row is represented as list, even if it consists of a single field. That is why you are receiving list of lists with single element, instead of implicit reading like list(reader2) probably you want to it explicitly:
orders_list = [row[0] for row in reader2]
And as you want to remove "http://" and "www" from urls you can do it right inside that construction:
orders_list = [row[0].replace("http://", "").replace("www.", "") for row in reader2]
But I would suggest to be more smart with removing of http's and www, as schema might be either "http" or "https", and I guess you want to remove only "www" from the start of link. So you can take a look at urllib2.urlparse module, and also check net location (link) if it starts with "www":
url = url.replace("www.", "", 1) if url.startswith("www.") else url
Note: 1 stands in url.replace("www.", "", 1) to avoid removing "www" from inside of url address, for example if you have something like this: "www.facebook.com/best-www-address".
And yes finally you can come to something like this:
links = []
for row in reader2:
edited_link = row[0].replace("http://", "", 1) if row[0].startswith("http://") else row[0]
edited_link = edited_link.replace("https://", "", 1) if edited_link.startswith("https://") else edited_link
edited_link = edited_link.replace("www.", "", 1) if edited_link.startswith("www.") else edited_link
links.append(edited_link)
My purpose is to extract one certain column from the multiple data files.
So, I tried to use glob module to read files and tried to extract one column from each file with for statements like below:
filin = diri + '*_7.txt'
FileList=sorted(glob.glob(filin))
for INPUT in FileList:
a = []
b = []
c = []
T = []
f = open(INPUT,'r')
f.seek(0,0)
for columns in ( raw.strip().split() for raw in f):
b.append(columns[11])
t = np.array(b, float)
print t
t = list(t)
T = T + [t]
f.close()
print T
The number of data files which I used is 32. So, I expected the second 'for' statement ran only 32 times while generating only 32 arrays of t. However, the result doesn't look like what I expected.
I assume that it may be due to the influence from the first 'for' statement, but I am not sure.
Any idea or help would be really appreciated.
Thank you,
Isaac
You clear T = [] for every file. Move T = [] line before first loop.
I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!