Multiply two pyspark dataframes - python

I have a PySpark DataFrame, df1, that looks like:
CustomerID CustomerValue CustomerValue2
15 10 2
16 10 3
18 3 3
I have a second PySpark DataFrame, df2
CustomerID CustomerValue
15 2
16 3
18 4
I want to multiply all the columns of df1(I have more than two columns)
with the value of df2 join on customer ID. So i want to have something like that
CustomerID CosineCustVal CosineCustVal
15 20 4
16 30 9
18 12 9

Once you join them, you can run a for loop on the columns of df1:
from pyspark.sql import functions as F
df_joined = df1.join(df2, df1.CustomerID == df2.CustomerID)
for col_name in df_joined.columns:
if col_name != 'CustomerValue':
df_joined = df_joined.withColumn(col_name, F.column(col_name) * F.column('CustomerValue'))
Based on this article, spark will create a smart plan even though the for loop would suggest otherwise (remember that spark only starts the calculations once you call an action, until that you just assign transformations: https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).

Related

Convert directed graph to horizontal data format in pandas python

I just started with python and know almost nothing about pandas.
so now I have a directed graph which look like this:
from ID
to ID
13
22
13
56
14
10
14
15
14
16
now I need to transform it to horizontal data format like this:
from ID
To 0
To 1
To 2
13
22
56
NAN
14
10
15
16
I find something like Pandas Merge duplicate index into single index but it seems it cannot merge rows into different colums.
Will pandas use NAN to fill in the blanks during this process due to the need to add more columns?
You can do something like this to transform your dataframe, pandas will automatically add NaN values for any number of columns you have with this solution.
df = df.groupby('from')['to'].apply(list)
## create the desired df from the lists
df = pd.DataFrame(df.tolist(),index=df.index).reset_index()
## Rename columns
df.columns = ["from ID"] + [f"To {i}" for i in range(1,len(df.columns))]

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Is it normal for pandas to need 30 secs to perform calculations on a 50k-row dataframe?

I use the pandas read_excel function to work with data. I have two excel files with 70k rows and 3 columns (the first column is date), and it only takes 4-5 seconds to combine, align the data, delete any rows with incomplete data and return a new dataframe (df) with 50k rows and 4 columns, where date is the index.
Then, i use the below code to perform some calculations and add another 2 columns in my df:
for i, row in df.iterrows():
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It takes approx 30 seconds for the above code to be executed, even though the calculations are simple. Is this normal, or is there a way to speed up the execution? (i am on win 10, 16GB Ram and i7-8565U processor)
I am not particularly interested in increasing the columns in my database - getting the two new columns on a list would suffice.
Thanks.
Note that the code in your loop contains neither row nor i.
So drop for ... row and execute just:
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It is enough to execute the above code only once, not in a loop.
Your code unnecessarily performs the above operations multiple times
(actually as many times as how many rows has your DataFrame) and this
is why it takes so long.
Edit following question as of 18:59Z
To perform vectorized operations like "check one column and do something
to another column", use the following schema, base on boolean indexing.
Assume that the source df contains:
column1 column4
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
Then if you want to:
select rows with even value in column1,
and add some value (e.g. 200) to column4,
run:
df.loc[df.column1 % 2 == 0, 'column4'] += 200
In this example:
df.column1 % 2 == 0 - provides boolean indexing over rows,
column4 - selects particular column,
+= 200 - performs the actual operation.
The result is:
column1 column4
0 1 11
1 2 212
2 3 13
3 4 214
4 5 15
5 6 216
6 7 17
7 8 218
But there ase more complex cases, when the condition involves calling of
some custom code or you want to update several columns.
In such cases you should use either iterrow of apply, but these
operations are executed much slower.

using pandas, extract data from long format df and add it to wide format df

I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16

Renaming columns in pandas dataframe error [duplicate]

I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')
This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised

Categories

Resources