Aggregate the output of several threads' Dataframes into a single Pandas Dataframe

Aggregate the output of several threads' Dataframes into a single Pandas Dataframe - python

My use case appears to be different from the suggested answers to similar questions. I need to iterate over a list of Git repos using the GitPython module, do a shallow clone, iterate over each branch, and then run an operation on the contents of each branch. The result of each operation will be captured as a Dataframe with data in specific columns.
It's been suggested that I could use a ThreadPoolExecutor to possibly do this and to grab the Dataframe object resulting from each repo's output and then aggregating them into a single dataframe. I could use the to_csv() function to create a single file for each repo and branch and then aggregate them when the pool finishes, but I'm wondering if I can do the same without going the file creation route of the CSVs and do it all in memory. Or is it possible for each thread to add rows to a single aggregate dataframe without overwriting data?
Any feedback on the pros and cons of various approaches would be appreciated.

My Intention would be as well not to write everything to a CSV file. That costs time which might be avoided if you would be able to write the output to a list. Just create a list of DFs, each time a sub process is finished with reading the file add the result to a list. Once all DFs have been loaded and added to a list it is pretty easy to merge (actually it is a concat) a list of df to one DF.
Here an example how to merge a list of DFs to one df

Related

Joining tables in Databricks vs. SAS - how to handle duplicate column names

long time SAS user, new to Databricks and am trying to migrate some basic code.
Running into an extremely basic join issue but cannot find a solution.
In SAS (proc sql), when I run the following code, SAS is smart enough to realize that the joining columns are obviously on both the left and right tables, and so only produces one instance of those variables.
e.g.
proc sql;
create table work.test as select * from
data.table1 t1
left join data.table2 t2 on (t1.bene_id=t2.bene_id) and (t1.pde_id=t2.pde_id)
;
quit;
This code runs just fine.
However, when I run the same thing in Databricks, it produces both instances of the bene_id and pde_id fields, and therefore bombs out when it tries to create the same (because its trying to create columns with the same name).
I realize one solution is to not use the * in the select statement, manually specify each field and ensure Im only selecting a single instance of each field, but with the number of joins happening + the number of fields Im dealing with, this is a real waste of time.
I also came across another potential solution is this sort of syntax
%python
from pyspark.sql import *
t1 = spark.read.table("data1")
t2 = spark.read.table("data2")
temp=t1.join(t2,["bene_id","pde_id"],"left")
However, this only suppresses duplicates for the fields being joined upon (i.e. bene_id and pde_id). If there was a 3rd field, say srvc_dt in both tables, but I am not using this field in the join, it will again be generated twice and bomb out.
Finally, I realize another solution is to write some code to dynamically rename the columns in both the left and right table so that all columns will always have unique names. I just feel like there has to be a simple way to achieve what SAS is doing without requiring all the workarounds, and Im just not aware of it.
Thanks for any advice.

You have to either rename the columns, drop one of the duplicates before joining or use aliases as described in this answer.
Spark wants you to be very explicit about which column you want to keep, so that you are not accidentally dropping columns.

How to send data from S3 to redshift without duplicating and truncating table By LAMBDA?

I wish to copy data from S3 to Redshift.
However, the Copy command always duplicates the rows whenever the Lambda function triggers:
cur.execute('copy table from S3...... ' )
Can someone suggest other ways to do it without truncating existing data?
for commenters: I tried to push directly from the dataframe to redshift.. append
There is one library pandas_redshift but it needs s3 connection first which might solve appending issue)
I also tried
#if cur.execute('truncate') it can keep the table empty but I don't have delete rights
cur.execute('select distinct * from ABC.xyz')
cur.execute('copy......')
results keep appending...
Can someone please provide any code or right series of execution.

Unfortunately there is no straight forward option to copy the files to perform upsert which can handle duplicates.
If you don't want to truncate the table, there are two workarounds:
You can create a staging table where you can copy the data first and then perform merge option. That will also act as upsert statement.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
You can use manifest to control which files you want to copy and which needs to be avoided.
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#copy-command-examples-manifest

Combine/merge data from separate elastic search indexes based on #timestamp

I have three separate indexes in elasticsearch, each storing data every second. I would like to combine data from all three indexes at #timestamp T, and return a single document.
How do I achieve this? Is there a query I can write?
I have been reading about denormalization. Do I have to write a script, in something like Python, and create a new document in a new index with the combined data for every unique #timestamp? If so, how does this script get executed to ensure the index is always up to date? manually? cron? triggered by elasticsearch when a conditions is met?
I am new to elasticsearch so any help and sample code is greatly appreciated.

Sorting and loading data from Pandas to Redshift using to_sql

I've built some tools that create front-end list boxes for users that reference dynamic Redshift tables. New items in the table, they appear automatically in the list.
I want to put the list in alphabetical order in the database so the dynamic list boxes will show the data in that order.
After downloading the list from an API, I attempt to sort the list alphabetically in a Pandas dataframe before uploading. This works perfectly:
df.sort_values(['name'], inplace=True, ascending=True, kind='heapsort')
But then when I try to upload to Redshift in that order, it loses the order while it uploads. The data appears in chunks of alphabetically ordered segments.
db_conn = create_engine('<redshift connection>')
obj.to_sql('table_name', db_conn, index = False, if_exists = 'replace')
Because of the way the third party tool (Alteryx) works, I need to have this data in alphabetical order in the database.
How can I modify to_sql to properly upload the data in order?

While ingesting data into redshift, data gets distributed between slices on each node in your redshift cluster.
My suggestion would be to create a sort key on a column which you need to be sorted. Once you have sort key on that column, you can run Vacuum command to get your data sorted.
Sorry! I cannot be of much help on Python/Pandas
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Delete documents from ElasticSearch index in python

Using elasticsearch-py, I would like to remove all documents from a specific index, without removing the index. Given that delete_by_query was moved to a separate plugin, I want to know what is the best way to go about this?

It is highly inefficient to delete all the docs by delete by query. More direct and correct action is:
Getting the current mappings (Assuming you are not using index templates)
Dropping the index by DELETE /indexname
Creating the new index and the mappings.
This will take a second, former will take much, much more time and unnecessary disk I/O

Use a Scroll/Scan API call to gather all Document IDs and then call batch delete on those IDs. This is the recommended replacement for the Delete By Query API based on the official documentation.
EDIT: Requested information for using this specifically in elasticsearch-py. Here is the documentation for the helpers. Use the Scan helper to scan throgh all documents. Use the Bulk helper with the delete action to delete all the ids.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.