ADX, KQL and pandas - python

I have a table in pandas dataframe, and need to go through a column and check if I have that value in ADX. how can I do that? I was thinking of setting each entry in pandas as a variable, and call it in KQL. Any ideas how?
sth like this, but not sure how:
val=df['col_name'][0]
%%kql
table_name
| where value == $val
thanks!

Note sure I understand the requirement, best if you can specify a minimal example with the df in Python, and a table in ADX (using datatable operator).
Anyway, just fyi, you can copy variables from Jupyter to Kusto using let, see example

Related

How to create a new column of datatype map<string,string> in pyspark

I want to create a new column(not from an existing columns) having MapType(StringType(),StringType()) how do I achieve this.
So far, I was able to write the below code but it's not working.
df = df.withColumn("normalizedvariation",lit(None).cast(MapType(StringType(),StringType())))
Also I would like to know the different methods to achieve the same, Thank you.
This is one way you can try, newMapColumn here woukld be the name of the Map column.
You can see the output I got below. If that is not what you are looking for, please let me know. Thanks!
Also you will have to import the functions using the below line:
from pyspark.sql.functions import col,lit,create_map

Pyspark: Pass parameter to String Column in Dataframe

I'm quite new to PySpark and coming from SAS I still don't get how to handle parameters (or Macro Variables in SAS terminology).
I have a date parameter like "202105" and want to add it as a String Column to a Dataframe.
Something like this:
date = 202105
df = df.withColumn("DATE", lit('{date}'))
I think it's quite trivial but so far, I didn't find an exact answer to my problem, maybe it's just too trivial...
Hope you guys can help me out. Best regards
You can use string interpolations i.e. {}.format() (or) f'{}'.
Example:
df.withColumn("DATE", lit("{0}".format(date)))
df.withColumn("DATE", lit("{}".format(date)))
#or
df.withColumn('DATE', lit(f'{date}'))

How can I create index for python pandas dataframe?

I am importing several csv files into python using Jupyter notebook and pandas and some are created without a proper index column. Instead, the first column, which is data that I need to manipulate is used. How can I create a regular index column as first column? This seems like a trivial matter, but I can't find any useful help anywhere.
What my dataframe looks like
What my dataframe should look like
Could you please try this:
df.reset_index(inplace = True, drop = True)
Let me know if this works.
When you are reading in the csv, use pandas.read_csv(index_col= #, * args). If they don't have a proper index column, set index_col=False.
To change indices of an existing DataFrame df, try the methods df = df.reset_index() or df=df.set_index(#).
When you imported your csv, did you use the index_col argument? It should default to None, according to the documentation. If you don't use the argument, you should be fine.
Either way, you can force it not to use a column by using index_col=False. From the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
Python 3.8.5
pandas==1.2.4
pd.read_csv('file.csv', header=None)
I found the solution in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Writing to elasticsearch indexes dynamically via pyspark

I have a pyspark DataFrame like this:
my_df = spark.read.load("some-parquet-path")
I'd like to be able to write it out to some elasticsearch indexes dynamically based on the contents of the "id" column in my DataFrame. I tried doing this:
my_df.write.format(
"org.elasticsearch.spark.sql"
).mode('overwrite').options(**conf).save("my_index_{id}/my_type")
But I get:
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: no such index
How can I do this?
Update
This seems to work when I change the mode from 'overwrite' to 'append'. It would be great to have an explanation of why that is the case...

Column to Transacction ID for association rules on dataframes from Pandas Python.

I imported a CSV into Python with Pandas and I would like to be able to use one as the columns as a transaction ID in order for me to make association rules.
(link: https://github.com/antonio1695/Python/blob/master/nearBPO/facturas.csv)
I hope someone can help me to:
Use UUID as a transaction ID for me to have a dataframe like the following:
UUID Desc
123ex Meat,Beer
In order for me to get association rules like: {Meat} => {Beer}.
Also, a recommendation on a library to do so in a simple way would be appreciated.
Thank you for your time.
You can aggregate values into a list by doing the following:
df.groupby('UUID')['Desc'].apply(list)
This will give you what you want, if you want the UUID back as a column you can call reset_index on the above:
df.groupby('UUID')['Desc'].apply(list).reset_index()
Also for a Series you can still export this to a csv same as with a df:
df.groupby('UUID')['Desc'].apply(list).to_csv(your_path)
You may need to name your index prior to exporting or if you find it easier just reset_index to restore the index back as a column and then call to_csv

Categories

Resources