Pandas Dataframe removes to much rows - python

I have a dataframe with a lot of tweets and i want to remove the duplicates. The tweets are stored in fh1.df['Tweets']. i counts the amount of non-duplicates. j the amount of duplicates. In the else statement I remove the lines of the duplicates. And in the if I make a new list "tweetChecklist" where I put all the good tweets in.
Ok, if I do i + j , i become the amount of original tweets. So that's good. But in the else, I don't know why, he removes to much rows because the shape of my dataframe is much smaller after the for loop (1/10).
How does the " fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
" line remove to much rows??
tweetChecklist = []
for current_tweet in fh1.df['Tweets']:
if current_tweet not in tweetChecklist:
i = i + 1
tweetChecklist.append(current_tweet)
else:
j = j + 1
fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
fh1.df['Tweets'] = pd.Series(tweetChecklist)

NOTE
Graipher's solution tells you how to generate a unique dataframe. My answer tells you why your current operation removes too many rows (per your question).
END NOTE
When you enter the "else" statement to remove the duplicated tweet you are removing ALL of the rows that have the specified tweet. Let's demonstrate:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randint(0, 10, (10, 5)), columns=list('ABCDE'))
What does this make:
Out[118]:
A B C D E
0 2 7 0 5 4
1 2 8 8 3 7
2 9 7 4 6 2
3 9 7 7 9 2
4 6 5 7 6 8
5 8 8 7 6 7
6 6 1 4 5 3
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
In your method (assume you want to remove duplicates from "A" instead of "Tweets") you would end up with (i.e. only have rows that were not unique).
Out[118]:
A B C D E
5 8 8 7 6 7
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
If you just want to make this unique, implement Graipher's suggestion. If you want to count how many duplicates you have you can do this:
total = df.shape[0]
duplicates = total - df.A.unique().size

In pandas there is usually always a better way than iterating over the dataframe with a for loop.
In this case, what you really want is to group equal tweets together and just retain the first one. This can be achieved with pandas.DataFrame.groupby:
import random
import string
import pandas as pd
# some random one character tweets, so there are many duplicates
df = pd.DataFrame({"Tweets": random.choices(string.ascii_lowercase, k=100),
"Data": [random.random() for _ in range(100)]})
df.groupby("Tweets", as_index=False).first()
# Tweets Data
# 0 a 0.327766
# 1 b 0.677697
# 2 c 0.517186
# 3 d 0.925312
# 4 e 0.748902
# 5 f 0.353826
# 6 g 0.991566
# 7 h 0.761849
# 8 i 0.488769
# 9 j 0.501704
# 10 k 0.737816
# 11 l 0.428117
# 12 m 0.650945
# 13 n 0.530866
# 14 o 0.337835
# 15 p 0.567097
# 16 q 0.130282
# 17 r 0.619664
# 18 s 0.365220
# 19 t 0.005407
# 20 u 0.905659
# 21 v 0.495603
# 22 w 0.511894
# 23 x 0.094989
# 24 y 0.089003
# 25 z 0.511532
Even better, there is even a function explicitly for that, pandas.drop_duplicates, which is about twice as fast:
df.drop_duplicates(subset="Tweets", keep="first")

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Change some values based on condition

Can you help on the following task? I have a dataframe column such as:
index df['Q0']
0 1
1 2
2 3
3 5
4 5
5 6
6 7
7 8
8 3
9 2
10 4
11 7
I want to substitute the values in df.loc[3:8,'Q0'] with the values in df.loc[0:2,'Q0'] if df.loc[0,'Q0']!=df.loc[3,'Q0']
The result should look like the one below:
index df['Q0']
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
9 2
10 4
11 7
I tried the following line:
df.loc[3:8,'Q0'].where(~df.loc[0,'Q0']!=df.loc[3,'Q0']),other=df.loc[0:2,'Q0'],inplace=True)
or
df['Q0'].replace(to_replace=df.loc[3:8,'Q0'], value=df.loc[0:2,'Q0'], inplace=True)
But it doesn't work. Most possible I am doing something wrong.
Any suggestions?
You can use the cycle function:
from itertools import cycle
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df["Q0"][3:8] = [next(c) for _ in range(5)]
Thanks for the replies. I tried the suggestions but I have some issues:
#adnanmuttaleb -
When I applied the function in a dataframe with more than 1 column (e.g. 12x2 or larger) I notice that the value in df.Q0[8] didn't change. Why?
#jezrael -
When I adjust to your suggestion I get the error:
ValueError: cannot copy sequence with size 5 to array axis with dimension 6
When I change the range to 6, I am getting wrong results
import pandas as pd
from itertools import cycle
data={'Q0':[1,2,3,5,5,6,7,8,3,2,4,7],
'Q0_New':[0,0,0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
##### version 1
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df['Q0_New'][3:8] = [next(c) for _ in range(5)]
##### version 2
d = cycle(df.loc[0:3,'Q0'])
if df.Q0[0] != df.Q0[3]:
df.loc[3:8,'Q0_New'] = [next(d) for _ in range(6)]
Why we have different behaviors and what corrections need to be made?
Thanks once more guys.

Python - Create multiple lists and zip

I am looking to produce multiple lists based on the same function which randomises data based on a list. I want to be able to easily change how many of these new lists I want to have and then combine. The code which creates each list is the following:
"""
"""
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
This perturbs each value from the list based on a normal distribution.
To combine them is fine when I just want a handful of lists:
"""
"""
ensemble_form_1,ensemble_form_2,ensemble_form_3 = [],[],[]
ensemble_form_1 = normal_transform(R)
ensemble_form_2 = normal_transform(R)
ensemble_form_3 = normal_transform(R)
zipped_ensemble = list(zip(ensemble_form_1,ensemble_form_2,ensemble_form_3))
df_ensemble = pd.DataFrame(zipped_ensemble, columns = ['Ensemble_1', 'Ensemble_2','Ensemble_3'])
return ensemble_form_1, ensemble_form_2, ensemble_form_3
How could I repeat the same randomisation process to create a fixed number of lists (say 50 or 100), and then combine them into a table? Is there an easy way to do this with a for loop, or any other method? I'd need to be able to pick out each new list/column individually, as I would be combining the results in some way.
Any help would be greatly appreciated.
You can construct multiple lists and a table like this:
import pandas as pd
import numpy as np
# Your function for creating the individual lists
def normal_transform(R):
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
# Construction of multiple lists and the dataframe
NUM_LISTS = 50
R = list(range(100))
data = dict()
for i in range(NUM_LISTS):
data['Ensemble_' + str(i)] = normal_transform(R)
df_ensemble = pd.DataFrame(data)
You can access the individual lists/ columns like this:
df_ensemble['Ensemble_42']
df_ensemble[df_ensemble.columns[42]]
You can use zip() with * to create dataframe with variable number of columns. For example:
import pandas as pd
def generate_list(n):
#... generate your list here
return [*range(n)]
def get_dataframe(n_columns, n):
return pd.DataFrame(zip(*[generate_list(n) for _ in range(n_columns)]), columns=['Ensemble_{}'.format(i) for i in range(1, n_columns+1)])
print(get_dataframe(8, 10))
Prints (8 columns, 10 rows):
Ensemble_1 Ensemble_2 Ensemble_3 Ensemble_4 Ensemble_5 Ensemble_6 Ensemble_7 Ensemble_8
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Categories

Resources