This is my first post on stack.
I'm looking to gather a large amount of data from a multitude of files on PW so I can quantify a few things about the records.
The directories I'm working with have unique numbers and offer files that are all similar to files in other folders.
Is there a library from python I can use or any other useful tips for taking on this task?
It could potentially save many hours of work if I can do this with code.
A pseudocode example may look like.
for element in dataField:
search(folder)
if folder found:
search(file)
if file found
extract certain data from file X
extractedData.append(data)
Thank you,
R
Based off a quick web search for projectwise api, there is a web-based REST API available, so you'll definitely want to look into that more. You'll need to read the docs carefully to figure out which endpoint does what, but once you know what information you need to send and what kind of data you'll receive, programming a basic Python interface shouldn't be too difficult. One may already exist, I didn't look too hard.
Related
I work in Python and have to generate spreadsheets frequently to share my data with programming-naive colleagues. I embed large blocks of text explaining the contents of the spreadsheet and how it was generated into the first page of these spreadsheets routinely. I don't like relying on an associated document to explain definitions, criteria, algorithms, and reliability when I send my results out into the world.
It's really awkward to edit and store the long strings that make up these blocks of text. I'd love to store them in dedicated files that I can work with using a tool INTENDED to edit large blocks of text. I'm wondering how other people deal with this kind of situation. JSON files? YAML? Some obvious built-in functionality in Python I don't know about?
This is obviously a very open-ended question. I'm sure there a lot of different approaches and solutions out there. It's a difficult thing to search for online as there are a lot of obfuscating factors when you search for things like 'python large strings' or 'python text files'. I'm hoping to hear about a number of different approaches.
To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.
Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example
Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)
We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.
The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...
Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.
The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?
We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".
For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.
To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way?
I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?
And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?
Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.
Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.
If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.
If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).
These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.
So I trying to come up with a script to check and report files from my sharepoint according to some criteria. I need however authenticate using the company Active Directory.
I have done some research and all I've found seem far too advanced for what I intend and would prefer not to spend the whole time learning django custom auth. Haufe.sharepoint seems simpler, but I didn't manage to list all folders from a URL with it.
I must confess I don't really understand Sharepoint very much either, though I'm assuming it could behave sorta like a file repository for my purpose. So I will not be offended if someone points out that what I’m trying to do doesn’t make sense.
Anyways, I have an url like http://intranet/sharedir/Archive/Projects/Customer1 and under it many folders containing subfolders and files.
What I want is, given the URL (like the one above), list all directories and files contained. After that I will iterate over the items and apply the rules I’m interested.
If someone could provide some Python code example or reference would be great.
We have lots of data and some charts repesenting one logical item. Charts and data is stored in various files. As a result, most users can easily access and re-use the information in their applications.
However, this not exactly a good way of storing data. Amongst other reasons, charts belong to some data, the charts and data have some meta-information that is not reflected in the file system, there are a lot of files, etc.
Ideally, we want
one big "file" that can store all
information (text, data and charts)
the "file" is human readable,
portable and accessible by
non-technical users
allows typical office applications
like MS Word or MS Excel to extract
text, data and charts easily.
light-weight, easy solution. Quick
and dirty is sufficient. Not many
users.
I am happy to use some scripting language like Python to generate the "file", third-party tools (ideally free as in beer), and everything that you find on a typical Windows-centric office computer.
Some ideas that we currently ponder:
using VB or pywin32 to script MS Word or Excel
creating html and publish it on a RESTful web server
Could you expand on the ideas above? Do you have any other ideas? What should we consider?
I can only agree with Reef on the general concepts he presented:
You will almost certainly prefer the data in a database than in a single large file
You should not worry that the data is not directly manipulated by users because as Reef mentioned, it can only go wrong. And you would be suprised at how ugly it can get
Concerning the usage of MS Office integration tools I disagree with Reef. You can quite easily create an ActiveX Server (in Python if you like) that is accessible from the MS Office suite. As long as you have a solid infrastructure that allows some sort of file share, you could use that shared area to keep your code. I guess the mess Reef was talking about mostly is about keeping users' versions of your extract/import code in sync. If you do not use some sort of shared repository (a simple shared folder) or if your infrastructure fails you often so that the shared folder becomes unavailable you will be in great pain. Note what is also somewhat painful if you do not have the appropriate tools but deal with many users: The ActiveX Server is best registered on each machine.
So.. I just said MS Office integration is very doable. But whether it is the best thing to do is a different matter. I strongly believe you will serve your users better if you build a web-site that handles their data for them. This sort of tool however almost certainly becomes an "ongoing project". Often, even as an "ongoing project", the time saved by your users could still make it worth it. But sometimes, strategically, you want to give your users a poorer experience to control project costs. In that case the ActiveX Server I mentioned could be what you want.
Instead of using one big file, You should use a database. Yes, You can store various types of files like gifs in the database if You like to.
The file would not be human readable or accessible by non-technical users, but this is good.
The database would have a website that Your non-technical users would use to insert, update and get data from. They would be able to display it on the page or export it to csv (or even xls - it's not that hard, I've seen some csv->xls converters). You could look into some open standard document formats, I think it should be quite easy to output data with in it. Do not try to output in "doc" format (but You could try "docx"). You should be able to easily teach the users how to export their data to a CSV and upload it to the site, or they could use the web interface to insert the data if they like to.
If You will allow Your users to mess with the raw data, they will break it (i have tried that, You have no idea how those guys could do that). The only way to prevent it is to make a web form that only allows them to perform certain actions that You exactly know how that they should suppose to perform.
The database + web page solution is the good one. Using VB or pywin32 to script MSOffice will get You in so much trouble I cannot even imagine.
You could use gnuplot or some other graphics library to draw (pretty straightforward to implement, it does all the hard work for You).
I am afraid that the "quick" and dirty solution is tempting, but I only can say one thing: it will not be quick. In a few weeks You will find that hacking around with MSOffice scripting is messy, buggy and unreliable and the non-technical guys will hate it and say that in other companies they used to have a simple web panel that did that. Then You will find that You will not be able to ask about the scripting because everyone uses the web interfaces nowadays, as they are quite easy to implement and maintain.
This is not a small project, it's a medium sized one, You need to remember this while writing it. It will take some time to do it and test it and You will have to add new features as the non-technical guys will start using it. I knew some passionate php teenagers who would be able to write this panel in a week, but as I understand You have some better resources so I hope You will come with a really reliable, modular, extensible solution with good usability and happy users.
Good luck!