Corrupt File As Result of Accessing SharePoint - python

I am trying to create a python code that will take 10+ excel tables of data, and append them into a single table. I'd also like to clear the destination table contents prior to doing this, so every time the code runs it will provide updated values from the 10 files.
I've tried to access these files within SharePoint but everytime I do this, the file returns as corrupt. I'm in the first stages of this code, but I'd like for it to at least be able to pull the data into Python for me to start manipulating
import requests
from requests.auth import HTTPBasicAuth
file = "https://ts.company.com/:x:/r/sites/XXXX/Shared%20Documents/General/Folder/XXXX.xlsx?d=w2ef8664034b243c89400e549b0cbf36d&csf=1&e=b5mA10"
username = 'robert.xxxxx#company.com'
password = 'password123'
resp=requests.get(file, auth=HTTPBasicAuth(username, password))
output = open('test.xlsx', 'wb')
output.write(resp.content)
output.close()

Related

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

How do I process API return without writing to file in Python?

I am trying to slice the results of an API response to process just the first n values in Python without writing to a file first.
Specifically I want to do analysis on the "front page" from HN, which is just the first 30 items. However the API (https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty) gives you the first 500 results.
Right now I'm pulling top stories and writing to a file, then importing the file and truncating the string:
import json
import requests
#request JSON data from HN API
topstories = requests.get('https://hacker-news.firebaseio.com/v0/topstories.json')
#write return to .txt file named topstories.txt
with open('topstories.txt','w') as fd:
fd.write(topstories.text)
#truncate the text file to top 30 stories
f = open('topstories.txt','r+')
f.truncate(270)
f.close
This is inelegant and inefficient. I will have to do this again to extract each 8 digit object ID.
How do I process this API return data as much as possible in memory without writing to file?
Suggestion:
User jordanm suggested the code replacement:
fd.write(json.dumps(topstories.json()[:30]))
However that would just move the needle on when I would need to write/read versus doing anything else I want with it.
What you want is the io library
https://docs.python.org/3/library/io.html
Basically:
import io
f = io.StringIO(topstories.text)
f.truncate(270)

How to troubleshoot corrupted BytesIO return in django?

I've got a strange issue going on in django, and am not sure how to even start troubleshooting it. I am using xlsxwriter to generate an Excel report from some data. I then want to pass the generated file back to the user. My view code looks like:
today = datetime.datetime.now()
document = gen_excel(today)
filename = 'export_{}.xlsx'.format(today.strftime('%Y-%m-%d'))
response = HttpResponse(
document.read(),
content_type='application/ms-excel'
)
response['Content-Disposition'] = 'attachment; filename=%s' % filename
return response
and the function gen_excel looks like:
def gen_excel(start_date):
output = io.BytesIO()
workbook = xlsxwriter.Workbook(output, {'in_memory': True})
create_sheets(start_date, workbook) # Creates and fills worksheets with data
workbook.close()
output.seek(0)
return output
So from gen_excel, I am returning the bytesIO stream which was filled by xlsxwriter. My issue is that (using the built-in dev server) the first time I call the view, I get a perfectly readable xlsx file. All subsequent calls, return a corrupted xlsx file, that can be sort of repaired by excel (though it kills all the formatting, etc..). If I looks the file sizes, the first output is 40655 bytes, whereas the subsequent downloads are slightly smaller at 40336 bytes.
The kicker is, once I restart the server, the first output is once again perfect. So it isn't that my gen_excel function is changing the data on the server- it seems to be related to subsequent downloads after the server is running. Also, there are no errors output by django, xlsxwriter, the server, etc. I just get a corrupt file for all subsequent downloads until the server is restarted.
Any ideas on how I can start troubleshooting this?

python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.
I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.
A look at the head of the output from - https://replicate.npmjs.com/_all_docs
gives me following output:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.
If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.
Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()
Is there a reason why you aren't decoding the json before iterating over the lines?
Can you try this:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)

Saving scraped documents in two sheets in an excel file

I've created a scraper which is supposed to parse some documents from a webpage and save it to an excel file creating two sheets. However, when I run it, I can see that It only saves the documents of last link in a single sheet whereas there should be two sheets with documents from two links properly. I even printed the results to see what is happening in the background but i found there nothing wrong. I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file. Thanks in advance to take a look into it.
Here is my code:
import requests
from lxml import html
from pyexcel_ods3 import save_data
name_list = ['Altronix','APC']
def docs_parser(link, name):
res = requests.get(link)
root = html.fromstring(res.text)
vault = {}
for post in root.cssselect(".SubBrandList a"):
if post.text == name:
refining_docs(post.attrib['href'], vault)
def refining_docs(new_link, vault):
res = requests.get(new_link).text
root = html.fromstring(res)
sheet = root.cssselect("#BrandContent h2")[0].text
for elem in root.cssselect(".ProductDetails"):
name_url = elem.cssselect("a[class]")[0].attrib['href']
vault.setdefault(sheet, []).append([str(name_url)])
save_data("docs.ods", vault)
if __name__ == '__main__':
for name in name_list:
docs_parser("http://store.immediasys.com/brands/" , name)
But, the same way when I write code for another site, it meets the expectation creating different sheets and saving documents in those. Here is the link:
https://www.dropbox.com/s/bgyh1xxhew8hcvm/Pyexcel_so.txt?dl=0
Question: I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file.
You overwrite the Workbook File on every Link that's be appended.
You should never call save_data(... within a loop, only once at the End of your Script.
Comparing you Two Scripts there is No difference, both behave the same, again and again overwriting the Workbook File. Maybe the File IO get overloaded as you doing more than 160 Times overwriting the Workbook File within a short Time.
The First Script should create 13 Sheets:
data sheet:powerpivot-etc links:20
data sheet:flappy-owl-videos links:1
data sheet:reporting-services-videos links:20
data sheet:csharp links:14
data sheet:excel-videos links:9
data sheet:excel-vba-videos links:20
data sheet:sql-server-videos links:9
data sheet:report-builder-2016-videos links:4
data sheet:ssrs-2016-videos links:5
data sheet:sql-videos links:20
data sheet:integration-services links:19
data sheet:excel-vba-user-form links:20
data sheet:archived-videos links:16
The Second Script should create 2 Sheets:
vault sheet:Altronix links:16
vault sheet:APC links:16

Categories

Resources