input data from multiple source into hadoop(HDFS) - python

how i can put different data from multiple sources into HDFS using python
i already tried SQL file using pyspark(in Pycharm IDEA) and it worked.
Now i need more functions that allowed me to ingest diffrent others data into HDFS

PySpark is very versatile - it can read multiple inputs via Streaming/SQL. You'll need to be more specific about what sources you are trying to load from.
However, if you want a more accessible way to ingest lots of data, that is what apache-kafka was explicitly built for. If you prefer not having to write lots of code, then you may also look at apache-nifi, which integrates nicely within the Hadoop ecosystem.

Related

Uploading files using design pattern with python

I used to upload csv, excel, json or geojson files in my a postegreSQL using Python/Django.
I noticed that the scripts is redundant and sometimes difficult to maintain when we need to update key or columns. Is there a way to use design pattern? I have never used it before.
Any suggestion or links could be hep!

Using symbolic links to organize data without actually changing locations?

So I'm building a pipeline that is not going into a real production environment. Basically, I have some data in a defined folder structure, and I want to access it from different stages in my pipeline. Right now, data is ordered kind of like this
.../data/03-14-2019/unprocessed/raw/student_id_12345-raw.csv
or
.../data/05-04-2020/processed/position/student_id_1234345-position.csv
Now, I wrote a modular pipeline that looks in a folder and runs the pipeline all the .csv files in all the contained directories. If I point it at .../data/03-14-2019/unprocessed/raw/ then my pipeline will process all of the raw data for every student. I built this under the assumption that we were going to rename all the files to a more manageable schema, but things may have changed. My question is this: Using the os.link() functionality in python3, would it be possible to make an alternate filepath system that includes what I want? For example, one way I may want to go through files might be:
.../data/unprocessed/student_id_12345/2019/03/14/raw/student_id_12345-raw.csv
or maybe
.../data/unprocessed/raw/2019/03/14/student_id_12345/student_id_12345-raw.csv
depending on if I want to process a certain batch of students or only raw data from a certain day. I remember using a batch renaming tool as part of either Total Commander or Nautilus, but I don't remember if it could do symbolic links. Basically, I want to use symbolic links to build a directory structure on top of an existing structure.
I was going to try implementing this, but I figured I should check if anyone has already done this or there were already solutions before I started, as well as maybe some suggestions of where to start. Thanks!

Is there another/similiar method for sparks.read.format.load outisde of databricks?

I am trying to load an avro file into a sparks dataframe so I can convert it to a pandas and eventually a dictionary. The method I want to use:
df = spark.read.format("avro").load(avro_file_in_memory)
(Note: the avro file data I'm trying to load into the dataframe is already in memory as a response from a request response from python requests)
However, this function uses sparks native to databricks environment, which I am not working in (I looked into pysparks for a similar function/code but could not see anything myself).
Is there any function similar that I can use outside of data bricks to produce the same results?
That Databricks library is open source, but was actually added to core Spark in 2.4 (though still an external library)
In any case, there's a native avro Python library, as well as fastavro, so I'm not entirely sure if you want to be starting up a JVM (because you're using Spark), just to load Avro data into a dictionary. Besides that, an Avro file consists of multiple records, so it would at the very least be a list of dictionaries
Basically, I think you're better off using the approach from your previous question, but start with writing the Avro data to disk, since that seems to be your current issue
Otherwise, maybe a little more searching for what you're looking for would solve this XY problem you're having
https://github.com/ynqa/pandavro

Python JSON API for linked data, with flat files

We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.
The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...
Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.
The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?
We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".
For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.
To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way?
I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?
And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?
Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.
Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.
If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.
If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).
These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.

Automating IBM SPSS Data Collection survey export?

I'm so sorry for the vague question here, but I'm hoping an SPSS expert will be able to help me out here. We have some surveys that are done via SPSS, from which we extract data for an internal report. Right now the process is very cumbersome and requires going to the SPSS Data Collection Interviewer Server Administration page and manually exporting data from two different projects (which takes hours at a time!). We then take that data, massage it, and upload it to another database that drives the internal report.
My question is, does anyone out there know how to automate this process? Is there a SQL Server database behind the SPSS data? Where does the .mdd file come in to play? Can my team (who is well-versed in extracting data from various sources) tap into the SQL Server database behind SPSS to get our data? Or do we need some sort of Python script and plugin?
If I'm missing information that would be helpful in answering the question, please let me know. I'm happy to provide it; I just don't know what to provide.
Thanks so much.
As mentioned by other contributors, there are a few ways to achieve this. The simplest I can suggest is using the DMS (data management script) and windows scheduler. Ideally you should follow below steps.
Prerequisite:
1. You should have access to the server running IBM Data collection
2. Basic knowledge of windows task scheduler
3. Knowledge of DMS scripting
Approach:
1. Create a new DMS script from the template
2. If you want to perform only data extract / transformation, you only need input and output data source
3. In the input data source, create/build the connection string pointing to your survey on IBM Data collection server. Use the data source as SQL
4. In the select query: use "Select * from VDATA" if you want to export all variables
5. Set the output data connection string by selecting the output data format as SPSS (if you want to export it in SPSS)
6. run the script manually and see if the SPSS export is what is expected
7. Create batch file using text editor (save with .bat extension). Add below lines
cd "C:\Program Files\IBM\SPSS\DataCollection\6\DDL\Scripts\Data Management\DMS"
Call DMSRun YOURDMSFILENAME.dms
Then add a line to copy (using XCOPY) the data / files extracted to the location where you want to further process it.
Save the file and open windows scheduler to schedule the execution of this batch file for data extraction.
If you want to do any further processing, you create an mrs or dms file and add to the batch file.
Hope this helps!
There are a number of different ways you can accomplish easing this task and even automate it completely. However, if you are not an IBM SPSS Data Collection expert and don't have access to somebody who is or have the time to become one, I'd suggest getting in touch with some of the consultants who offer services on the platform. Internally IBM doesn't have many skilled SPSS resources available, so they rely heavily on external partners to do services on a lot of their products. This goes for IBM SPSS Data Collection in particular, but is also largely true for SPSS Statistics.
As noted by previous contributors there is an approach using Python for data cleaning, merging and other transformations and then loading that output into your report database. For maintenance reasons I'd probably not suggest this approach. Though you are most likely able to automate the export of data from SPSS Data Collection to a sav file with a simple SPSS Syntax (and an SPSS add-on data component), it is extremely error prone when upgrading either SPSS Statistics or SPSS Data Collection.
From a best practice standpoint, you ought to use the SPSS Data Collection Data Management module. It is very flexible and hardly requires any maintenance on upgrades, because you are working within the same data model framework (e.g. survey metadata, survey versions, labels etc. is handled implicitly) right until you load your transformed data into your reporting database.
Ideally the approach would be to build the mentioned SPSS Data Collection Data Management script and trigger it at the end of each completed interview. In this way your reporting will be close to real-time (you can make it actual real-time by triggering the DM script during the interview using the interview script events - just a FYI).
All scripting on the SPSS Data Collection platform including Data Management scripting is very VB-like, so for most people knowing VB, it is very easy to get started and it is documented very well in the SPSS Data Collection DDL. There you'll also be able to find examples of extracting survey data from SPSS Data Collection surveys (as well as reading and writing data to/from other databases, files etc.). There are also many examples of data manipulation and transformation.
Lastly, to answer your specific questions:
Yes, there is always an MS SQL Server behind SPSS Data Collection -
no exceptions. However, generally speaking the data model is way to
complex to read out data directly from it. If you have a look in it,
you'll quickly realize this.
The MDD file (short for Meta Data Document) is containing all survey meta
data including data source specifications, version history etc.
Without it you'll not be able to make anything of the survey data in
the database, which is the main reason I'd suggest to stay within the
SPSS Data Collection platform for as large part of your data handling
as possible. However, it is indeed just a readable XML file.
Note that the SPSS Data Collection Data Management Module requires a separate license and if the scripting needed is large or complex, you'd probably want base professional too, if that's not what you already use for developing the questionnaires and handling the surveys.
Hope that helps.
This isn't as clean as working directly with whatever database is holding the data, but you could do something with an exported data set:
There may or may not be a way for you to write and run an export script from inside your Admin panel or whatever. If not, you could write a simple Python script using Selenium WebDriver which logs into your admin panel and exports all data to a *.sav data file.
Then you can use the Python SPSS extensions to write your analysis scripts. Note that these scripts have to run on a machine that has a copy of SPSS installed.
Once you have your data and analysis results accessible to Python, you should be able to easily write that to your other database.

Categories

Resources