I recently asked about accessing data from SPSS and got some absolutely wonderful help here. I now have an almost identical need to read data from a Confirmit data file. Not finding a ton of confirmit data file format on the web. It appears that Confirmit can export to SPSS *.sav files. This might be one avenue for me. Here's the exact needs:
I need to be able to extract two different but related types of info from a market research study done using ConfirmIt:
I need to be able to discover the data "schema", as in what questions are being asked (the text of the questions) and what the type of the answer is (multiple choice, yes/no, text) and what text labels are associated with each answer.
I need to be able to read respondents answers and populate my data model. So for each of the questions discovered as part of step 1 above, I need to build a table of respondent answers.
With SPSS this was easy thanks to a data access module available freely available by IBM and a nice Python wrapper by Albert-Jan Roskam. Googling I'm not finding much info. Any insight into this is helpful. Something like a Python or Java class to read the confirmit data would be perfect!
Assuming my best option ends up being to export to SPSS *.sav file, does anyone know if it will meet both of my use cases above (contain the questions, answers schema and also contain each participant's results)?
You can get the data schema from Excel definition export from Confirmit
You can export from Confirmit txt file with the same template
I was recently given a data set from confirmit. There are almost 4000 columns in the excel file. I want to enter it into a mysql db. There is not way they are just doing that output from one table. Do you know how the table schema works for confirmit?
Related
I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.
The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3
Thanks for taking the time to read my question.
I am working on a personal project to learn python scripting for excel, and I want to learn how to move data from one workbook to another.
In this example, I am emulating a company employee ledger that has name, position, address, and more (The organizations is by row so every employee takes up one row). But the project is to have a selected number of people be transferred to a new ledger (another excel file). So I have a list of emails in a .txt file (it could even be another excel file but I thought .txt would be easier), and I would want the script to run through the .txt file, get the emails, and look for any rows that have a matching email address(all emails are in cell 'B'). And if any are found, then copy that entire row to the new excel file.
I tried a lot of ways to make this work, but I could not figure it out. I am really new to python so I am not even sure if this is possible. Would really appreciate some help!
You have essentially two packages that will allow manipulation of Excel files. For reading in data and performing analysis the standard package for use is pandas. You can save the files as .xlsx however you are only really working with base table data and not the file itself (IE, you are extracing data FROM the file, not working WITH the file)
However what you need is really to perform manipulation on Excel files directly which is better done with openpyxl
You can also read files (such as your text file) using with open function that is native to Python and is not a third party import like pandas or openpyxl.
Part of learning to program includes learning how to use documentation.
As such, here is the documentation you require with sufficient examples to learn openpyxl: https://openpyxl.readthedocs.io/en/stable/
And you can learn about pandas here: https://pandas.pydata.org/docs/user_guide/index.html
And you can learn about python with open here: https://docs.python.org/3/tutorial/inputoutput.html
Hope this helps.
EDIT: It's possible I or another person can give you a specific example using your data / code etc, but you would have to provide it fully. Since you're learning, I suggest using the documentation or youtube.
To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.
Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example
Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)
I've been asked to create a Python script to automate a server deployment for 80 retail stores.
As part of this script, I have a secondary script that I call to change multiple values in 9 XML files, however, the values are unique for each store, so this script needs to be changed each time, but after I am gone, this is going to be done by semi / non-technical people, so we don't want them to change the Python scripts directly for fear of breaking them.
This in mind, I would like to have these people input the store details into an XLS sheet, and a python file read this sheet and put the data it finds into the existing python script with the data to be changed.
The file will be 2 columns, with the required data in the 2nd one.
I'm sorry if this is a long explanation, but that is the gist of it. I'm using python 2.6. Does anyone have a clue about how I can do this? Or which language might be better for this. I also know Bash and Javascript.
Thanks in advance
Depending on the complexity and the volume of your data
for small Openpyxl,
for large pandas
We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.
The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...
Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.
The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?
We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".
For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.
To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way?
I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?
And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?
Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.
Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.
If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.
If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).
These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.