I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})
I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]
I have a two DataFrames.
df1:
A | B | C
-----|---------|---------|
25zx | b(50gh) | |
50tr | a(70lc) | c(50gh) |
df2:
A | B
-----|-----
25zx | T
50gh | K
50tr | K
70lc | T
I want to replace values in df1. The row that I'm comparing is df2['A'], but the value that I want to put in to df1 is value from the row df['B'].
So the final table would look like:
df3:
A | B | C
-----|---------|---------|
T | b(K) | |
K | a(T) | c(K) |
Cast df2 to dict and use replace:
print (df.replace(df2.set_index("A")["B"].to_dict(), regex=True))
A B C
0 T b(K) None
1 K a(T) c(K)
I’m trying to merge two tables, where the rows of the left side stay unchanged and a column gets updated based on the right side. Thereby, the column of the left table is taking the value of the right side, if it is the highest value (i.e., higher then the current one on the left side) but below an individually set threshold.
The threshold is set by the column “Snapshop”; the column “Latest value found” indicates the highest so far observed value (within the threshold).
In order to be memory efficient, the process will work over many small chunks of data and needs to be able to iterate over a list of dataframes. In each dataframe the origin is recorded in column “Table ID”. If the main-dataframe finds a value it stores the origin in its column “Found in”.
Example
Main table (left side)
+----+-------------------------------------------+--------------------+----------+
| ID | Snapshot timestamp (Maximum search value) | Latest value found | Found in |
+----+-------------------------------------------+--------------------+----------+
| 1 | Aug-18 | NULL | NULL |
| 2 | Aug-18 | NULL | NULL |
| 3 | May-18 | NULL | NULL |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+-------------------------------------------+--------------------+----------+
First data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table1 | 1 | Jan-14 |
| 2 | Table1 | 1 | Feb-14 |
| 3 | Table1 | 2 | Jan-14 |
| 4 | Table1 | 2 | Feb-14 |
| 5 | Table1 | 3 | Mar-14 |
+-----+----------+-------------+--------------------+
Result: Left-side after first merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Feb-14 | Table1 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Second data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table2 | 1 | Mar-15 |
| 2 | Table2 | 1 | Apr-15 |
| 3 | Table2 | 2 | Feb-14 |
| 4 | Table2 | 3 | Feb-14 |
| 5 | Table2 | 4 | Aug-19 |
+-----+----------+-------------+--------------------+
Result: Left-side after second merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Apr-15 | Table2 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Code
import pandas as pd
import numpy as np
# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
"Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"], # the maximum interval than can be used
"Latest_value_found": [None,None,None,None,None],
"Found_in": [None,None,None,None,None]}
)
# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
"Customer_ID": [1,1,2,2,3],
"Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
"Customer_ID": [1,1,2,3,4],
"Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)
list_of_data_chunks = [Table1, Table2]
# work: iteration
for data_chunk in list_of_data_chunks:
pass
# here the merging is performed iteratively
Here is my workaround, although I would try not to do this in a loop if it's just two tables. I removed your "idx" column from the joining tables.
df_list = [df,Table1,Table2]
main_df = df_list[0]
count_ = 0
for i in df_list[1:]:
main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
if count_ < 1:
main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
count_ += 1
else:
main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
this_table = []
this_date = []
for i in main_df.index:
curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
curr_foundin = main_df.loc[i,'Found_in']
next_foundin = main_df.loc[i,'Table_ID']
next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
else:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
count_ += 1
I want to read in T1 and write it out as T2 (note both are .csv).
T1 contains duplicate rows; I don't want to write duplicates in T2.
T1
+------+------+---------+---------+---------+
| Type | Year | Value 1 | Value 2 | Value 3 |
+------+------+---------+---------+---------+
| a | 8 | x | y | z |
| b | 10 | q | r | s |
+------+------+---------+---------+---------+
T2
+------+------+---------+-------+
| Type | Year | Value # | Value |
+------+------+---------+-------+
| a | 8 | 1 | x |
| a | 8 | 2 | y |
| a | 8 | 3 | z |
| b | 10 | 1 | q |
| ... | ... | ... | ... |
+------+------+---------+-------+
Currently, I have this excruciatingly slow code to filter out duplicates:
no_dupes = []
for row in reader:
type = row[0]
year = row[1]
index = type,age
values_list = row[2:]
if index not in no_dupes:
for i,j in enumerate(values_list):
line = [type, year, str(i+1), str(j)]
writer.writerow(line) #using csv module
no_dupes.append(index)
I cannot exagerate how slow this code is when T1 gets large.
Is there a faster way to filter out duplicates from T1 as I write to T2?
I think you want something like this:
no_dupes = set()
for row in reader:
type, year = row[0], row[1]
values_list = row[2:]
for index, value in enumerate(values_list, start=1):
line = (type, year, index, value)
no_dupes.add(line)
for t in no_dupes:
writer.writerow(t)
If possible convert reader to a set and iterate over the set instead, then there is no possibility of dups