PySpark: Flag rows of specific columns in df1, that exist in df2? - python

Im using Pypsark. I have two dataframes, call them df1 and df2. I want df1 to create a new column to flag what rows of df1's columns (A, B) exist and do not exist in df2's columns D,E. 1 marking existence and 0 otherwise. An example of the transformation is:
df1
A
B
C
0
0
1
0
0
1
0
0
1
df2
D
E
F
G
1
2
1
2
0
0
1
2
1
2
1
2
Resulting df1
A
B
C
Exist
0
0
1
0
0
0
1
1
0
0
1
0
The focus columns from df1 are A,B and for df2 are D, E. Only the second row of these columns match so df1 has its newly created exist column marked as 1.
How can I achieve this?

df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select a,b,c, case when d is null and e is null then 0 else 1 end exist from table1 left outer join table2 on A=D and B=E").show()

Related

Printing count of a column based on value of another column

I have a data frame:
Dept_Name
Placed
A
1
B
0
C
1
Where 'Placed' column has a boolean value
I want to print the count of rows that have the value '1' in placed grouped by the Dept_Name
Dept_Name
Count(Placed == 1)
A
3
B
4
C
0
If values are 0,1 or True/False you can aggregate sum, last for column Count use Series.reset_index:
df1 = df.groupby('Dept_Name')['Placed'].sum().reset_index(name='Count')
If test some non boolean values - e.g. count values 100:
df2 = df['Placed'].eq(100).groupby(df['Dept_Name']).sum().reset_index(name='Count')
As you have a boolean 0/1 a simple sum will work:
out = df.groupby('Dept_Name', as_index=False).sum()
output:
Dept_Name Placed
0 A 5
1 B 0
2 C 2
For a named column:
out = df.groupby('Dept_Name', as_index=False).agg(**{'Count': ('Placed', 'sum')})
output:
Dept_Name Count
0 A 5
1 B 0
2 C 2

Updated all values in a pandas dataframe based on all instances of a specific value in another dataframe

My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0

create a summary table with count values

my df looks like
group value
A 1
B 1
A 1
B 1
B 0
B 0
A 0
I want to create a df
value 0 1
group
A a b
B c d
where a,b,c,d are the counts of 0s and 1s in groups A and B respectively.
I tried group df.groupby('group').size() but that gave an overall count and did not split 0's and 1s. I tried a groupby count method too but have not been able to achieve the target data frame.
Use pd.crosstab:
pd.crosstab(df['group'], df['value'])
Output:
value 0 1
group
A 1 2
B 2 2
Use pivot table for this:
res = pd.pivot_table(df, index='group', columns='value', aggfunc='size')
>>>print(res)
value 0 1
group
A 1 2
B 2 2

How to enumerize and split csv column in Pandas?

We have a CSV. It has a fixed (repeating) set of strings in a column. Say column is called "Type" and values are [A, B, B, C]. I want to get several new columns equal to the number of unique column values with 0 or 1 in them. like this:
Type Type_1 Type_2 Type_3
A 1 0 0
B 0 1 0
B 0 1 0
C 0 0 1
How to turn a column into a set of 0\1 columns?
Try with str.get_dummies
df.Type.str.get_dummies()
A B C
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Update
df=df.join(df.pop('Type').str.get_dummies())

Pandas left join returning multiple rows

I am using python to merge two dataframe:
join=pd.merge(df1,df2,on=["A","B"],how="left")
Table 1:
A B
a 1
b 2
c 3
Table 2:
A B Flag C
a 1 0 20
b 2 1 40
c 3 0 60
a 1 1 80
b 2 0 10
The result that I get after left join is:
A B Flag C
a 1 0 20
a 1 1 80
b 2 1 40
b 2 0 10
c 3 0 60
Here we see row 1 and row 2 has come twice because of table 2. I want to keep just one row based on Flag column. I want to keep one of the two rows whose Falg value is `= 1
So Final Expected output is:
A B Flag C
a 1 1 80
b 2 1 40
c 3 0 60
Is there any pythonic way to do it?
# raise preferred lines to the top
df2 = df2.sort_values(by='Flag', ascending=False)
# deduplicate
df2 = df2.drop_duplicates(subset=['A','B'], keep='first')
# merge
pd.merge(df1, df2, on=['A','B'])
A B Flag C
0 a 1 1 80
1 b 2 1 40
2 c 3 0 60
The concept is similar to what you would do on SQL: separate a table with the selection criterea (in this case maximums for flag), leaving enough columns to match an observation on the joint table.
join = pd.merge(df1, df2, how="left").reset_index()
maximums = join.groupby(by='A').max()
join = pd.merge(join, maximums, on=['Flag', 'A'])
Try using this join:
join=pd.merge(df1,df2,on=["A","B"],how="left", left_index=True, right_index=True)
print(join)

Categories

Resources