I am using dbduck to run sql query on the following dataframe df:
In this sql, I need to pass the values from dataframe col3 using a loop:
aa = ps.sqldf("select * from result where col3= 'id1'")
print(aa)
You can iterate on the values of col3 like this, using Python f-strings:
for v in df["col3"]:
aa = ps.sqldf(f"select * from result where col3='{v}'")
print(aa)
Related
I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.
I have a dataframe as follows:
Items Data
enst.35 abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hoxg|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.18 abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|holg|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.98 abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|ho6g|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.63 abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|horg|,abc|framex|
I want to extract Data based on the Items value within the frame and extract only that data with in the separators (,). I want to match row1 value of col1 to row1 of col2. Similarly, row2 of col1 to row2 of col2....
If match is not found fill with 'NA' in the output columns. There can be multiple occurance of id in the same column, but I want to consider only the first occurrence.
The expected output is:
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
I tried follwing code to generate the output:
import pandas as pd
df=pd.read_table('file1.txt', sep="\t")
keywords=df['Items'].to_list()
df_map=df.Data[df.Data.str.contains('|'.join(as_list))].reindex(df.index)
But the output generated has all the terms containing the keywords:
Data
abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hoxg|abc|framex|gtk4|enst.35|pxc|h5g|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|holg|abc|framex|gtk4|enst.35|pxc|h5g|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|ho6g|abc|framex|gtk4|enst.35|pxc|h5g|
NA
What are the possible changes I can make in the code to generate correct ouput as expected.
Use, DataFrame.apply along the axis=1 and apply the custom function which extracts the string associated with the occurrence of df['Items'] in df['Data']:
import re
def find(s):
mobj = re.search(rf"[^,]+{re.escape(s['Items'])}[^,]+", s['Data'])
if mobj:
return mobj.group(0)
return np.nan
df['Data'] = df.apply(find, axis=1)
OR, Use a more faster solution:
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in df['Items']])
df['Data'] = df['Data'].str.findall(pattern).str.get(0)
# print(df['Data'])
0 abc|framex|gtk4|enst.35|pxc|h5g|
1 abc|frbx|hgk4|enst.18|pif|holg|
2 abc|frame|gtk|enst.98|pc|hg|
3 NaN
Name: Data, dtype: object
we can formally define a key-value pair list as follows:
kvlist = <key>[kvdelim]<value>([pairdelim]<key>[kvdelim]<value>)*
key = <string>|<quoter><string><quoter>
value = <string>|<quoter><string><quoter>
quoter = "
My function get_data returns a tuple: two integer values.
get_data_udf = udf(lambda id: get_data(spark, id), (IntegerType(), IntegerType()))
I need to split them into two columns val1 and val2. How can I do it?
dfnew = df \
.withColumn("val", get_data_udf(col("id")))
Should I save the tuple in a column, e.g. val, and then split it somehow into two columns. Or is there any shorter way?
You can create structFields in udf in order to access later times.
from pyspark.sql.types import *
get_data_udf = udf(lambda id: get_data(spark, id),
StructType([StructField('first', IntegerType()), StructField('second', IntegerType())]))
dfnew = df \
.withColumn("val", get_data_udf(col("id"))) \
.select('*', 'val.`first`'.alias('first'), 'val.`second`'.alias('second'))
tuple's can be indexed just like lists, so you can add the value for column one as get_data()[0] and for the second value in the second column you do get_data()[1]
also you can do v1, v2 = get_data() and this way assign the returned tuple values to the variables v1 and v2.
Take a look at this question here for further clarification.
For example you have a sample dataframe of one column like below
val df = sc.parallelize(Seq(3)).toDF()
df.show()
//Below is a UDF which will return a tuple
def tupleFunction(): (Int,Int) = (1,2)
//we will create two new column from the above UDF
df.withColumn("newCol",typedLit(tupleFunction.toString.replace("(","").replace(")","")
.split(","))).select((0 to 1)
.map(i => col("newCol").getItem(i).alias(s"newColFromTuple$i")):_*).show
I have a pandas dataframe grouped by certain columns. Now I want to insert the mean of the numeric values of four adjacent columns into a new column. This is what I did:
df = pd.read_csv(filename)
# in this line I extract a unique ID from the filename
id = re.search('(\w\w\w)', filename).group(1)
Files look like this:
col1 | col2 | col3
-----------------------
str1a | str1b | float1
My idea was now the following:
# get the numeric values
df2 = pd.DataFrame(df.groupby(['col1', 'col2']).mean()['col3'].T
# insert the id into a new column
df2.insert(0, 'ID', id)
Now loop over all
for j in range(len(df2.values)):
for k in df['col1'].unique():
df2.insert(j+5, (k, 'mean'), df2.values[j])
df2.to_excel('text.xlsx')
But I get the following error, referring to the line with df.insert:
TypeError: not all arguments converted during string formatting
and
if not allow_duplicates and item in self.items:
# Should this be a different kind of error??
raise ValueError('cannot insert %s, already exists' % item)
I am not sure what string formatting refers to here, since I have only numerical values being passed around.
The final output should have all values from col3 in a single row (indexed by id) and every fifth column should be the inserted mean value of the four preceding values.
If I had to work with files like yours I code a function to convert to csv... something like that:
data = []
for lineInFile in file.read().splitlines():
lineInFile_splited = lineInFile.split('|')
if len(lineInFile_splited)>1: ## get only data and not '-------'
data.append(lineInFile_splited)
df = pandas.DataFrame(data, columns = ['A','B'])
Hope it helps!
I need to map a dict to a column in a dataframe but don't know how to do it when the values in the dict are in a list. Ideally I was a new column for the values in pos [0], and another column for the values in pos [1].
dict = {'foo':['A','B'],
'bar':['C','D']}
df = pd.DataFrame([[0,'foo'],
[1,'bar']], columns=['Col1','Col2'])
Below is the usual way I'd go about this task if there was only one dict value.
df['Col3'] = df.Col2.map(dict)
Series.map() will also accept functions, so you can pass in a lambda function that wraps the lookup dictionary and grabs the list item you want.
dict = {'foo':['A','B'],
'bar':['C','D']}
df = pd.DataFrame([[0,'foo'],
[1,'bar']], columns=['Col1','Col2'])
df['Col3'] = df.Col2.map(lambda x: dict[x][0])
df['Col4'] = df.Col2.map(lambda x: dict[x][1])
If I understand your question correctly, it's probably easiest to just turn the dictionary into a dataframe and use pd.DataFrame.merge:
In [40]: d = {'foo':['A','B'],
...: 'bar':['C','D']}
In [41]: df.merge(pd.DataFrame(d, index=["Col3", "Col4"]).T,
left_on='Col2', right_index=True)
Out[41]:
Col1 Col2 Col3 Col4
0 0 foo A B
1 1 bar C D