Python : Working with text file OR with list - python

I am currently working on a script to process some log files (few MB). I am quite new to Python and up to now, my method was to extract some information and pass it to another text file and so on. I would then compare different text files and work that way. Even if I would delete the intermediate text files that I used to get my final output, I found it a bit messy.
As I have started to become more acquainted with lists and I am now trying to use lists instead of text files to store and manage data.
I was wondering what is the best method to use. Should I try to use lists more instead of text files or does that not really matter? I would tend to think lists are better for obvious reasons but I wanted to make sure. I hope it is not too much of a silly question. Thanks
EDIT
Quick examples: I would create two text file from the log files and then comparing those text files when now I am doing the same thing with lists

List in python have many methods and features that make it very flexible and manageable.
There are also other types that is similar to lists like generator.
Best of luck.

Related

Storing / Working With Large Text Blocks to Be Inserted Into Documents in Python

I work in Python and have to generate spreadsheets frequently to share my data with programming-naive colleagues. I embed large blocks of text explaining the contents of the spreadsheet and how it was generated into the first page of these spreadsheets routinely. I don't like relying on an associated document to explain definitions, criteria, algorithms, and reliability when I send my results out into the world.
It's really awkward to edit and store the long strings that make up these blocks of text. I'd love to store them in dedicated files that I can work with using a tool INTENDED to edit large blocks of text. I'm wondering how other people deal with this kind of situation. JSON files? YAML? Some obvious built-in functionality in Python I don't know about?
This is obviously a very open-ended question. I'm sure there a lot of different approaches and solutions out there. It's a difficult thing to search for online as there are a lot of obfuscating factors when you search for things like 'python large strings' or 'python text files'. I'm hoping to hear about a number of different approaches.

Python JSON API for linked data, with flat files

We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.
The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...
Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.
The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?
We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".
For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.
To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way?
I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?
And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?
Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.
Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.
If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.
If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).
These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.

Looking for text in logs

I have a new project to develop a log reader in Python 3.5 from txt files and I don't know how to start. The main goal is to extract pieces of logs (substrings) from a large and complex log txt file and display it in a structured way on a webpage. Would be possible please any help with libraries and commands in order to start with? I'm sorry but I'm quite new to Python. Thanks!
I would say you convert the .txt files you mentioned into a list of strings, then use a for-loop:
for a in txt_files:
Then use some if statements to look out for keywords, and print certain messages depending on the input
using this method you could also have it look out for certain words in a certain order, by having "previous_a = a" at the end of each loop

python advanced search library

I have around 80,000 text files and I want to be able to do an advanced search on them.
Let's say I have two lists of keywords and I want to return all the files that include at least one of the keywords in the first list and at least one in the second list.
Is there already a library that would do that, I don't want to rewrite it if it exists.
As you need to search the documents multiple times, you most likely want to index the text files to makes such searches as fast as possible.
Implementing a reasonable index yourself is certainly possible, but a quick search lead me to:
https://pypi.python.org/pypi/Whoosh/
http://pythonhosted.org/Whoosh/
Take a look at the documentation. It should hopefully be rather trivial to achieve the desired behaviour.
I just get a feeling you want to use MapReduce type of processing for the search. It should be very scalable, Python should have MapReduce packages.

comparing two files and saving the report in any other file

I would like to compare the dat of two files and store the report in another file.I tried using winmerge by invoking cmd.exe using subprocess module in python3.2.i was able to get the difference report but wasnt able to save that report.Is there a way with winmerge or with any other comparing tools(diffmerge/kdiff3) to save the difference report using cmd.exe in windows7?please help
Though your question is quite old, I wonder it wasn't answered yet. I was searching myself for an answer and funnily I found yours. Perhaps you mix quite a lot questions into one mail. So I decided to answer the main headline, where I suppose you try to compare human readable file contents.
To compare two files, there is a difflib library which is part of the Python distribution.
By the way an example how to generate a utility to compare files can be found on Python's documentation website.
The link is here: Helpers for computing deltas
From there you can learn to create an option and save deltas to a e.g. textfile or something. Some of these examples contain also a git-difference-like output, which possibly helps you to solve your question.
This means, if you are able to execute your script, then other delta tools are not required. It makes not soo much sense to call other tools via Python on CMD and try to control them... :)
Maybe also this Website with explanations and code examples may help you:
difflib – Compare sequences
I hope that helps you a bit.
EDIT: I forgot to mention, that the last site contains a straightforward example how to generate an HTML output:
HTML Output

Categories

Resources