post text from files into a web page form using python - python

I have a set of text files. I need to input them one after the other to a web server. I know how to input text using mechanize but have no idea how to extract text from files stored on computer and input them one after the other. In other words, say I have 10 files on my hard disk, I need to post text from one file, submit, then post another file and the process should go on until all the files are posted. Please help me with suggestions.
Thank you.

First make a for loop to iterate through your files. Then read the files and encode them into a POST request with urllib and urllib2. You need to change the url, filename pattern, and form fields accordingly.
url = "http://www.example.com/form"
import glob
import urllib
import urllib2
for filename in glob.glob("file*.txt"):
filedata = open(filename).read()
data = urllib.urlencode({'data' : filedata})
urllib2.urlopen(url=url, data=data)

Related

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

How to find a particular URL in an HTML file with python?

There is a URL with a .bin attachment in my HTML file.My goal is to extract the full link with my Python script. I am running this script across many HTML files and the location of the .bin URL may change.If I was able to get the index of the beginning of the URL and the end, I could extract it that way.
I tried doing a word search through the HTML files but there are a few .bin URLS, I only want the first one. Any ideas would be appreciated. Or any other methods.
import urllib.request, urllib.error, urllib.parse
html_link = "www.mywebsitelink.com"
response = urllib.request.urlopen(html_link)
webContent = response.read()
I suggest you look at using Regex.
In your example, you will probably be looking for something like:
^http://.+\.bin$
You can test this out and explore what each part of the Regex expression means using this helpful tool: regex101
Your code would probably look something like this:
import re
bin_url = re.search("^http://.+\.bin$", webContent)

Read XML file from URL in Python

I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot
To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)

How to save webpages text content as a text file using python

I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

Python 3, How To Use Python To Save Data From This Page?

I am trying to save price data from this page using Python 3.x.
I want my script to go through every option under the Fund Provider dropdown, and then save the resulting table to a local file.
Unfortunately, when I look at the source code, it appears that all the menu options and table data come from JSON files, and I am not sure where to begin as I can't seem to read the files from a browser. I know how to use urlretrieve, and have used it for simple static web pages, but my skills are not advanced enough to navigate complex multiple resource documents.
Any advice on how I can achieve my goal would be most appreciated.
Sorry for doing an incorrect copy and paste with the URL. Anyway - I found a solution. What I needed to do is:
use Firebug (an extention for Firebug) to identify the location of the json files, along with the posted information.
then use urlretrieve to download the data, including post information with each request
example code:
from urllib.request import urlretrieve
import urllib
url = 'http://www.example.com'
values = {'example_param1' : 'example value 1',
'example_param2' : 'example value 2'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8') # data should be bytes
save_path = save_root + fund_provider + '.json'
urlretrieve(url, save_path, data=data )

Categories

Resources