Writing to elasticsearch indexes dynamically via pyspark

Writing to elasticsearch indexes dynamically via pyspark - python

I have a pyspark DataFrame like this:
my_df = spark.read.load("some-parquet-path")
I'd like to be able to write it out to some elasticsearch indexes dynamically based on the contents of the "id" column in my DataFrame. I tried doing this:
my_df.write.format(
"org.elasticsearch.spark.sql"
).mode('overwrite').options(**conf).save("my_index_{id}/my_type")
But I get:
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: no such index
How can I do this?
Update
This seems to work when I change the mode from 'overwrite' to 'append'. It would be great to have an explanation of why that is the case...

Related

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.

I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])

I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?

Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Column to Transacction ID for association rules on dataframes from Pandas Python.

I imported a CSV into Python with Pandas and I would like to be able to use one as the columns as a transaction ID in order for me to make association rules.
(link: https://github.com/antonio1695/Python/blob/master/nearBPO/facturas.csv)
I hope someone can help me to:
Use UUID as a transaction ID for me to have a dataframe like the following:
UUID Desc
123ex Meat,Beer
In order for me to get association rules like: {Meat} => {Beer}.
Also, a recommendation on a library to do so in a simple way would be appreciated.
Thank you for your time.

You can aggregate values into a list by doing the following:
df.groupby('UUID')['Desc'].apply(list)
This will give you what you want, if you want the UUID back as a column you can call reset_index on the above:
df.groupby('UUID')['Desc'].apply(list).reset_index()
Also for a Series you can still export this to a csv same as with a df:
df.groupby('UUID')['Desc'].apply(list).to_csv(your_path)
You may need to name your index prior to exporting or if you find it easier just reset_index to restore the index back as a column and then call to_csv

Python Pandas add new column based on existing column - "Length of values does not match length of index"

I have an existing Pandas dataframe that looks like the following:
I want to create a new column in the dataframe that contains a dictionary with word/word counts derived from an existing column that contains a body of text.
I have got this working on a single row from the dataframe with the following transformation:
from collections import Counter
obama['word_count'] = [dict(Counter(" ".join(obama['text']).split(" ")).items())]
creates new column that contains the expected dictionary.
and while this works it gives the following warning:
C:\Anaconda\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
from ipykernel import kernelapp as app
When I use the same transformation above with entire dataframe:
people['word_count'] = [dict(Counter(" ".join(people['text']).split(" ")).items())]
it fails with error message:
ValueError: Length of values does not match length of index
This appears to be an issue of indexes not matching. None of the 'text' values are missing so it is not getting out of sync that way.
I've gone to the url in the Pandas warning and cannot grasp what it is getting at.
I've also done my Google searches but I do not feel found results apply to my issue.
What is need to make this add column procedure work?

There is (at least) two way to do this :
using a list comprehension with something like :
people['word_count'] = \
[dict(Counter(i[1]['text'].split(" ")).items()) for i in people.iterrows()]
using the apply method of the DataFrame, with something like :
people['word_count'] = people.apply(
lambda x: dict(Counter(x['test'].split(" ")).items()), axis=1)
(The second method appears to be a bit faster but also don't seems to be working on the OP DataFrame; some details are in comments)

Creating a Cross Tab Query in SQL Alchemy

I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:

I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing to elasticsearch indexes dynamically via pyspark - python

Related

How to populate DataFrame column from JSON string element in another column

Pandas one-line filtering for the entire dataset - how is it achieved?

Column to Transacction ID for association rules on dataframes from Pandas Python.

Python Pandas add new column based on existing column - "Length of values does not match length of index"

Creating a Cross Tab Query in SQL Alchemy

Categories

Resources