Pyspark Melting Null Columns - python

I have a dataframe that looks like this:
# +----+------------+----------+-----------+
# | id | c_type_1 | c_type_2 | c_type_3 |
# +----+------------+----------+-----------+
# | 1 | null | null | r |
# | 2 | a | null | null |
# | 3 | null | null | null |
# +---+-------------+----------+-----------+
I need to convert it into something like this:
# +----+------------+------------+
# | id | c_type | c_type_val |
# +----+------------+------------+
# | 1 | c_type_3 | r |
# | 2 | c_type_1 | a |
# | 3 | null | null |
# +---+-------------+------------+
Each row only has either one c_type value or all c_type values will be null.
I'm currently melting the rows like so:
def melt(df, id_cols, value_cols, c_type, c_value):
v_arr = []
for c in value_cols:
v_arr.append(struct(lit(c).alias(c_type), col(c).alias(c_value)))
vars_and_vals = array(*v_arr)
tmp = df.withColumn("vars_and_vals", explode(vars_and_vals))
cols = id_cols + [
col("vars_and_vals")[x].alias(x) for x in [c_type, c_value]]
return tmp.select(*cols)
melted = melt(df, df.columns[:1], df.columns[1:4], 'c_type', 'c_type_val')
melted.filter(melted.c_type_val.isNotNull()).show()
The problem is that filtering the null values for c_type_val filters out the row for id == 3 (any rows with null c_type). I need a way to melt and filter to retain the third row as null c_type and value.

I tried using pandas, It may give you a idea to solve this,
temp=df.filter(like='c_type')
df= pd.merge(df,temp[temp.notnull()].stack().reset_index(), left_index=True, right_on=['level_0'],how='left')
df=df[['id','level_1',0]].reset_index(drop=True).rename(columns={'level_1':'c_type',0:'c_type_val'})
print df
Output:
id c_type c_type_val
0 1 c_type_3 r
1 2 c_type_1 a
2 3 NaN NaN

Related

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

How to replace values in DataFrame with values from second DataFrame with condition that it selects different column?

I have a two DataFrames.
df1:
A | B | C
-----|---------|---------|
25zx | b(50gh) | |
50tr | a(70lc) | c(50gh) |
df2:
A | B
-----|-----
25zx | T
50gh | K
50tr | K
70lc | T
I want to replace values in df1. The row that I'm comparing is df2['A'], but the value that I want to put in to df1 is value from the row df['B'].
So the final table would look like:
df3:
A | B | C
-----|---------|---------|
T | b(K) | |
K | a(T) | c(K) |
Cast df2 to dict and use replace:
print (df.replace(df2.set_index("A")["B"].to_dict(), regex=True))
A B C
0 T b(K) None
1 K a(T) c(K)

Iterative merge with tables for values – conditioned on values between an interval (Pandas)

I’m trying to merge two tables, where the rows of the left side stay unchanged and a column gets updated based on the right side. Thereby, the column of the left table is taking the value of the right side, if it is the highest value (i.e., higher then the current one on the left side) but below an individually set threshold.
The threshold is set by the column “Snapshop”; the column “Latest value found” indicates the highest so far observed value (within the threshold).
In order to be memory efficient, the process will work over many small chunks of data and needs to be able to iterate over a list of dataframes. In each dataframe the origin is recorded in column “Table ID”. If the main-dataframe finds a value it stores the origin in its column “Found in”.
Example
Main table (left side)
+----+-------------------------------------------+--------------------+----------+
| ID | Snapshot timestamp (Maximum search value) | Latest value found | Found in |
+----+-------------------------------------------+--------------------+----------+
| 1 | Aug-18 | NULL | NULL |
| 2 | Aug-18 | NULL | NULL |
| 3 | May-18 | NULL | NULL |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+-------------------------------------------+--------------------+----------+
First data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table1 | 1 | Jan-14 |
| 2 | Table1 | 1 | Feb-14 |
| 3 | Table1 | 2 | Jan-14 |
| 4 | Table1 | 2 | Feb-14 |
| 5 | Table1 | 3 | Mar-14 |
+-----+----------+-------------+--------------------+
Result: Left-side after first merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Feb-14 | Table1 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Second data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table2 | 1 | Mar-15 |
| 2 | Table2 | 1 | Apr-15 |
| 3 | Table2 | 2 | Feb-14 |
| 4 | Table2 | 3 | Feb-14 |
| 5 | Table2 | 4 | Aug-19 |
+-----+----------+-------------+--------------------+
Result: Left-side after second merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Apr-15 | Table2 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Code
import pandas as pd
import numpy as np
# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
"Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"], # the maximum interval than can be used
"Latest_value_found": [None,None,None,None,None],
"Found_in": [None,None,None,None,None]}
)
# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
"Customer_ID": [1,1,2,2,3],
"Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
"Customer_ID": [1,1,2,3,4],
"Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)
list_of_data_chunks = [Table1, Table2]
# work: iteration
for data_chunk in list_of_data_chunks:
pass
# here the merging is performed iteratively
Here is my workaround, although I would try not to do this in a loop if it's just two tables. I removed your "idx" column from the joining tables.
df_list = [df,Table1,Table2]
main_df = df_list[0]
count_ = 0
for i in df_list[1:]:
main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
if count_ < 1:
main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
count_ += 1
else:
main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
this_table = []
this_date = []
for i in main_df.index:
curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
curr_foundin = main_df.loc[i,'Found_in']
next_foundin = main_df.loc[i,'Table_ID']
next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
else:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
count_ += 1

Filter-Out Duplicate Table Entries

I want to read in T1 and write it out as T2 (note both are .csv).
T1 contains duplicate rows; I don't want to write duplicates in T2.
T1
+------+------+---------+---------+---------+
| Type | Year | Value 1 | Value 2 | Value 3 |
+------+------+---------+---------+---------+
| a | 8 | x | y | z |
| b | 10 | q | r | s |
+------+------+---------+---------+---------+
T2
+------+------+---------+-------+
| Type | Year | Value # | Value |
+------+------+---------+-------+
| a | 8 | 1 | x |
| a | 8 | 2 | y |
| a | 8 | 3 | z |
| b | 10 | 1 | q |
| ... | ... | ... | ... |
+------+------+---------+-------+
Currently, I have this excruciatingly slow code to filter out duplicates:
no_dupes = []
for row in reader:
type = row[0]
year = row[1]
index = type,age
values_list = row[2:]
if index not in no_dupes:
for i,j in enumerate(values_list):
line = [type, year, str(i+1), str(j)]
writer.writerow(line) #using csv module
no_dupes.append(index)
I cannot exagerate how slow this code is when T1 gets large.
Is there a faster way to filter out duplicates from T1 as I write to T2?
I think you want something like this:
no_dupes = set()
for row in reader:
type, year = row[0], row[1]
values_list = row[2:]
for index, value in enumerate(values_list, start=1):
line = (type, year, index, value)
no_dupes.add(line)
for t in no_dupes:
writer.writerow(t)
If possible convert reader to a set and iterate over the set instead, then there is no possibility of dups

Categories

Resources