Filter-Out Duplicate Table Entries - python

I want to read in T1 and write it out as T2 (note both are .csv).
T1 contains duplicate rows; I don't want to write duplicates in T2.
T1
+------+------+---------+---------+---------+
| Type | Year | Value 1 | Value 2 | Value 3 |
+------+------+---------+---------+---------+
| a | 8 | x | y | z |
| b | 10 | q | r | s |
+------+------+---------+---------+---------+
T2
+------+------+---------+-------+
| Type | Year | Value # | Value |
+------+------+---------+-------+
| a | 8 | 1 | x |
| a | 8 | 2 | y |
| a | 8 | 3 | z |
| b | 10 | 1 | q |
| ... | ... | ... | ... |
+------+------+---------+-------+
Currently, I have this excruciatingly slow code to filter out duplicates:
no_dupes = []
for row in reader:
type = row[0]
year = row[1]
index = type,age
values_list = row[2:]
if index not in no_dupes:
for i,j in enumerate(values_list):
line = [type, year, str(i+1), str(j)]
writer.writerow(line) #using csv module
no_dupes.append(index)
I cannot exagerate how slow this code is when T1 gets large.
Is there a faster way to filter out duplicates from T1 as I write to T2?

I think you want something like this:
no_dupes = set()
for row in reader:
type, year = row[0], row[1]
values_list = row[2:]
for index, value in enumerate(values_list, start=1):
line = (type, year, index, value)
no_dupes.add(line)
for t in no_dupes:
writer.writerow(t)

If possible convert reader to a set and iterate over the set instead, then there is no possibility of dups

Related

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

How to assign time stamp to the command in Python?

There are several types of Commands in the third column of the text file. So, I am using the regular expression method to grep the number of occurrences for each type of command.
For example, ACTIVE has occurred 3 times, REFRESH 2 times. I desire to enhance the flexibility of my program. So, I wish to assign the time for each command.
Since one command can happen more than 1 time, if the script supports the command being assigned to the time, then the users will know which ACTIVE occurs at what time. Any guidance or suggestions are welcomed.
The idea is to have more flexible support for the script.
My code:
import re
a = a_1 = b = b_1 = c = d = e = 0
lines = open("page_stats.txt", "r").readlines()
for line in lines:
if re.search(r"WRITING_A", line):
a_1 += 1
elif re.search(r"WRITING", line):
a += 1
elif re.search(r"READING_A", line):
b_1 += 1
elif re.search(r"READING", line):
b += 1
elif re.search(r"PRECHARGE", line):
c += 1
elif re.search(r"ACTIVE", line):
d += 1
File content:
-----------------------------------------------------------------
| Number | Time | Command | Data |
-----------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | READING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | PRECHARGE | |
| 14 | 0610 | REFRESH | |
Assuming you want to count the occurrences of each command and store
the timestamps of each command as well, would you please try:
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
else:
timestamps[m[3]] = [m[2]]
# see the limited result (example)
#print(count["ACTIVE"])
#print(timestamps["ACTIVE"])
# see the results
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
Output:
REFRESH : 2, ['0120', '0610']
WRITING : 2, ['0030', '0447']
PRECHARGE : 3, ['0115', '0314', '0503']
ACTIVE : 3, ['0015', '0150', '0318']
READING : 1, ['0200']
WRITING_A : 3, ['0100', '0345', '0430']
m = re.split(r"\s*\|\s*", line) splits line on a pipe character which
may be preceded and/or followed by blank characters.
Then the list elements m[1], m[2], m[3] are assingned to the Number, Time, Command
in order.
The condition if len(m) > 3 and re.match(r"\d+", m[1]) skips the
header lines.
Then the dictionary variables count and timestamps are assigned,
incremented or appended one by one.

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

Generating means of combinations in multiindex dataframe (Pandas)

I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.

Pyspark Melting Null Columns

I have a dataframe that looks like this:
# +----+------------+----------+-----------+
# | id | c_type_1 | c_type_2 | c_type_3 |
# +----+------------+----------+-----------+
# | 1 | null | null | r |
# | 2 | a | null | null |
# | 3 | null | null | null |
# +---+-------------+----------+-----------+
I need to convert it into something like this:
# +----+------------+------------+
# | id | c_type | c_type_val |
# +----+------------+------------+
# | 1 | c_type_3 | r |
# | 2 | c_type_1 | a |
# | 3 | null | null |
# +---+-------------+------------+
Each row only has either one c_type value or all c_type values will be null.
I'm currently melting the rows like so:
def melt(df, id_cols, value_cols, c_type, c_value):
v_arr = []
for c in value_cols:
v_arr.append(struct(lit(c).alias(c_type), col(c).alias(c_value)))
vars_and_vals = array(*v_arr)
tmp = df.withColumn("vars_and_vals", explode(vars_and_vals))
cols = id_cols + [
col("vars_and_vals")[x].alias(x) for x in [c_type, c_value]]
return tmp.select(*cols)
melted = melt(df, df.columns[:1], df.columns[1:4], 'c_type', 'c_type_val')
melted.filter(melted.c_type_val.isNotNull()).show()
The problem is that filtering the null values for c_type_val filters out the row for id == 3 (any rows with null c_type). I need a way to melt and filter to retain the third row as null c_type and value.
I tried using pandas, It may give you a idea to solve this,
temp=df.filter(like='c_type')
df= pd.merge(df,temp[temp.notnull()].stack().reset_index(), left_index=True, right_on=['level_0'],how='left')
df=df[['id','level_1',0]].reset_index(drop=True).rename(columns={'level_1':'c_type',0:'c_type_val'})
print df
Output:
id c_type c_type_val
0 1 c_type_3 r
1 2 c_type_1 a
2 3 NaN NaN

Categories

Resources