Drop or modify consecutive duplicate rows - python

Suppose we have a DataFrame with two types of data: float and ndarray (shape is always (2,)):
data = [
0.1, np.array([1.0, 0.1]), np.array([1.0, 0.1]),
np.array([1.0, 0.1]), 0.1, 0.1, np.array([0.1, 1.0]), 1.0
]
df = pd.DataFrame(
{'A': data,}, index=[0., 1., 2.0, 2.6, 3., 3.2, 3.4, 4.0]
)
x | A
----|-------------
0.0 | 1.0
1.0 | [1.0, 0.1]
2.0 | [1.0, 0.1]
2.6 | [1.0, 0.1]
3.0 | 0.1
3.2 | 0.1
3.4 | [0.1, 1.0]
4.0 | 1.0
I would like to process consecutive duplicates in order to:
Drop repetitions if they are floats (keeping the first in the "group");
Modifying each element in the "group" using all index values this group if these elements are ndarrays.
The expected result for a given example would be something like (here I tried to proportionally split the range [1., 0.1] onto three regions):
x | A
----|-------------
0.0 | 1.0
1.0 | [1.0, 0.55]
2.0 | [0.55, 0.28]
2.6 | [0.28, 0.1]
3.0 | 0.1
3.4 | [0.1, 1.0]
4.0 | 1.0
To start with, I've tried using df != df.shift() to find duplicates, but it would raise an error when coparisng float with ndarry and would not "group" more than 2 elements.
I was also trying to groupby(by=function), where function is checking the dtype of the element, but it seems that groupby is acting ony on index in this case.
Obviously, I can loop through rows and keep track of repetitions, but it is not very elegant (efficient).
Do you have any suggestions?

Step 1: Drop consecutive equal floats
To check whether 2 elements of a row are equal floats, define
the following function:
def equalFloats(row):
if (type(row.A).__name__ == 'float') and (type(row.B).__name__ == 'float'):
return row.A == row.B
return False
Then, temporarily add to df column B containing the previous value
in A column:
df['B'] = df.A.shift()
And to drop consecutive floats in A column (and also drop
B column) run:
df = df[~df.apply(equalFloats, axis=1)][['A']]
For the time being df contains:
A
0.0 0.1
1.0 [1.0, 0.1]
2.0 [1.0, 0.1]
2.6 [1.0, 0.1]
3.0 0.1
3.4 [0.1, 1.0]
4.0 1
4.5 2.1
To check that consecutive, but different floats are not removed,
I added row with index 4.5 and value 2.1. As you see, it was not removed.
Step 2: Convert arrays in A column
Define another function:
def step2(row):
if type(row.A).__name__ == 'ndarray':
arr = row.A
arr[1] = arr.sum() / row.name
return row
return row
(row.name is the index value of the current row).
Then apply it:
df = df.apply(step2, axis=1)
The result is:
A
0.0 0.1
1.0 [1.0, 2.1]
2.0 [1.0, 0.775]
2.6 [1.0, 0.5473372781065089]
3.0 0.1
3.4 [0.1, 0.12456747404844293]
4.0 1
4.5 2.1
If you want, change the formula in step2 to any other of your choice.
Edit following comments
I defined df as:
A
0.0 0.1
1.0 [1.0, 0.1]
2.0 [1.0, 0.1]
2.6 [1.0, 0.1]
3.0 0.1
3.1 0.1
3.2 0.1
3.4 [0.1, 1.0]
4.0 1
4.5 2.1
It contains 3 consecutive 0.1 values.
Note that you didn't write about how many consecutive
values contains such a "group".
Both functions can be defined also with isinstance:
def equalFloats(row):
if isinstance(row.A, float) and isinstance(row.B, float):
return row.A == row.B
return False
def step2(row):
if isinstance(row.A, np.ndarray):
arr = row.A
arr[1] = arr.sum() / row.name
return row
return row
Then after you run:
df['B'] = df.A.shift()
df = df[~df.apply(equalFloats, axis=1)][['A']]
df = df.apply(step2, axis=1)
The result is:
A
0.0 0.1
1.0 [1.0, 1.1]
2.0 [1.0, 0.55]
2.6 [1.0, 0.4230769230769231]
3.0 0.1
3.4 [0.1, 0.3235294117647059]
4.0 1
4.5 2.1
As you can see, from a sequence of 3 values of 0.1 remained
only the first.

Related

What's the best way for looping through pandas df and comparing 2 different dataframes then performing division on values returned?

I'm currently writing Python code that compares offensive and defensive stats in basketball and I want to be able to create weights with the given stats. I have my stats saved in a dataframe according to: team, position, and other numerical stats. I want to be able to loop through each team and their respective positions and corresponding stats. e.g.:
['DAL', 'C', 0.0, 3.0, 0.5, 0.4, 0.5, 0.7, 6.4] vs ['BOS', 'C', 1.7, 6.0, 2.1, 0.1, 0.7, 1.9, 9.0]
So I would like to compare BOS vs DAL at the C position and compare points, rebounds, assists etc. If one is greater than the other then divide the greater by the lesser.
The best thing I have so far is to convert the the dataframes to numpy and then proceed to loop through those and append into a blank list:
df1 = df1.to_numpy()
df2 = df2.to_numpy()
df1_array = []
df2_array = []
for x in range(len(df1)):
for a, h in zip(away, home):
if df1[x][0] == a or df1[x][0] == h:
df1_array.append(df1[x])
After I get the new arrays I would then loop through them again to compare values, however I feel like this is too rudimentary. What could be a more efficient or smarter way of executing this?
Use numpy.where to compare rows and return the truth value of ('team1' > 'team2') element-wise:
import pandas as pd
import numpy as np
# Creating the dataframe
team1 = ['DAL', 'C', 0.1, 3.0, 0.5, 0.4, 0.5, 0.7, 6.4]
team2 = ['BOS', 'C', 1.7, 6.0, 2.1, 0.1, 0.7, 1.9, 9.0]
df = pd.DataFrame(
{'team1':team1,
'team2':team2,
})
# Select the rows that contain numbers
df2 = df.iloc[2:].copy()
# Make the comparison, if team1 is larger than team2 then team1/team2 and viseversa.
df2['result'] = np.where(df2['team1']>df2['team2'], \
df2['team1']/df2['team2'], \
df2['team2']/df2['team1'])
df['result'] = df2['result'].fillna(0)
This yields
team1 team2 result
0 DAL BOS NaN
1 C C NaN
2 0.1 1.7 17.0
3 3.0 6.0 2.0
4 0.5 2.1 4.2
5 0.4 0.1 4.0
6 0.5 0.7 1.4
7 0.7 1.9 2.714286
8 6.4 9.0 1.40625
Becareful with the 0 in the first column of values in your problem description though, I changed it to 0.1 as otherwise it will give zero division error.

python check if float value is missing

I am trying to find "missing" values in a python array of floats.
Such that in this case [1.1, 1.3, 2.1, 2.2, 2.3] I would like to print "1.2"
I dont have much experience with floats, I have tried something like this How to find a missing number from a list? but it doesn't work on floats.
Thanks!
To solve this, the problem would need to be simplified first, I am assuming that all the values would be float and with one decimal place, also let's assume that there can be multiple ranges like 1.1-1.3 and 2.1-2.3, also assuming that the numbers are in sorted order, here is a solution. It is written in python 3 by the way
vals = [1.1, 1.3, 2.1, 2.2, 2.3] # This will be the values in which to find the missing number
# The logic starts from here
for i in range(len(vals) - 1):
if vals[i + 1] * 10 - vals[i] * 10 == 2:
print((vals[i] * 10 + 1)/10)
print("\nfinished")
You might want to use https://numpy.org/doc/stable/reference/generated/numpy.arange.html
and create a list of floats (if you know start, end, step values)
Then you can create two sets and use difference to find missing values
Simplest yet dumb way:
Split float to integer and decimal parts.
Create cartesian product of both to generate Full array.
Use set and XOR to find out missing ones.
from itertools import product
source = [1.1, 1.3, 2.1, 2.2, 2.3]
separated = [str(n).split(".") for n in source]
integers, decimals = map(set, zip(*separated))
products = [float(f"{i}.{d}") for i, d in product(integers, decimals)]
print(*(set(products) ^ set(source)))
output:
1.2
I guess that the solutions to the problem you quote proprably work on your case, you just need to adapt the built-in range function to numpy.arange that allow you to create a range of numbers with floats.
it gets something like that: (just did a simple example)
import numpy as np
np_range = np.arange(1, 2, 0.1)
float_list = [1.2, 1.3, 1.4, 1.6]
for i in np_range:
if not round(i, 1) in float_list:
print(round(i, 1))
output:
1.0
1.1
1.5
1.7
1.8
1.9
This is an absolutely AWFUL way to do this, but depending on how many numbers you have in the list and how difficult the other solutions are you might appreciate it.
If you write
firstvalue = 1.1
secondvalue = 1.2
thirdvalue = 1.3
#assign these for every value you are keeping track of
if firstvalue in val: #(or whatever you named your list)
print("1.1 is in the list")
else:
print("1.1 is missing!")
if secondvalue in val:
print("1.2 is in the list")
else:
print("1.2 is missing!")
#etc etc etc for every value in the list. It's tedious and dumb but if you have few enough values in your list it might be your simplest option
With numpy
import numpy as np
arr = [1.1, 1.3, 2.1, 2.2, 2.3]
find_gaps = np.array(arr).round(1)
find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1
Output
array([1.2])
Test with random data
import numpy as np
np.random.seed(10)
arr = np.arange(0.1, 10.4, 0.1)
mask = np.random.randint(0,2, len(arr)).astype(np.bool)
gaps = arr[mask]
print(gaps)
find_gaps = np.array(gaps).round(1)
print('missing values:')
print(find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1)
Output
[ 0.1 0.2 0.4 0.6 0.7 0.9 1. 1.2 1.3 1.6 2.2 2.5 2.6 2.9
3.2 3.6 3.7 3.9 4. 4.1 4.2 4.3 4.5 5. 5.2 5.3 5.4 5.6
5.8 5.9 6.1 6.4 6.8 6.9 7.3 7.5 7.6 7.8 7.9 8.1 8.7 8.9
9.7 9.8 10. 10.1]
missing values:
[0.3 0.5 0.8 1.1 3.8 4.4 5.1 5.5 5.7 6. 7.4 7.7 8. 8.8 9.9]
More general solution
Find all missing value with specific gap size
import numpy as np
def find_missing(find_gaps, gaps = 1):
find_gaps = np.array(find_gaps)
gaps_diff = np.r_[np.diff(find_gaps).round(1), False]
gaps_index = find_gaps[(gaps_diff >= 0.2) & (gaps_diff <= round(0.1*(gaps + 1),1))]
gaps_values = np.searchsorted(find_gaps, gaps_index)
ranges = np.vstack([(find_gaps[gaps_values]+0.1).round(1),find_gaps[gaps_values+1]]).T
return np.concatenate([np.arange(start, end, 0.1001) for start, end in ranges]).round(1)
vals = [0.1,0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
print('Vals:', vals)
print('gap=1', find_missing(vals, gaps = 1))
print('gap=2', find_missing(vals, gaps = 2))
print('gap=3', find_missing(vals, gaps = 3))
Output
Vals: [0.1, 0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
gap=1 [0.2]
gap=2 [0.2 0.4 0.5 1.6 1.7 1.9 2. ]
gap=3 [0.2 0.4 0.5 0.8 0.9 1. 1.2 1.3 1.4 1.6 1.7 1.9 2. ]

Multiply each value in a pandas dataframe column with all values of 2nd dataframe column & replace each 1st dataframe value with resulting array

I have a dataframe with 4 rows and 3 columns, and all values in this first dataframe (df1) are floats. I also have a second dataframe (df2) that has a column with 8760 entries. I would like to multiply each value in column 3 of the first dataframe by all 8760 values in the second dataframe. Finally, I want to replace the original value in the first dataframe by the resulting series of 8760 values (from multiplying each value by the 2nd dataframe values). So the values in column 3 of each row of the first dataframe are a resulting array of 8760 values.
data = {'col1':[1.0, 2.0, 1.0, 3.0], 'col2':[.01, .04, .8, 1.0], 'col3':[0.7, 0.1, 0.5, 0.9]}
np.random.seed(123)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(np.random.randint(0,10, size=(1,8760)))
So here I would like to take each value of col3 in df1 and replace with resulting array from multiplying that single value by all 8760 values in df2. So "0.7" would be replaced by an array of 8760 values from multiplying 0.7 by each value in df2. Is there an easy way to do this? When I tried, I just got the first value or NAN in df1 and not the array.
We can use numpy broadcasting here:
df1['result'] = (df1['col3'].to_numpy()[:, None] * df2.to_numpy()[0]).tolist()
Result:
col1 col2 col3 result
0 1.0 0.01 0.7 [1.4, 1.4, 4.2, 0.7, 2.1, 6.3, 4.2, 0.7, 0.0, ...
1 2.0 0.04 0.1 [0.2, 0.2, 0.6, 0.1, 0.3, 0.9, 0.6, 0.1, 0.0, ...
2 1.0 0.80 0.5 [1.0, 1.0, 3.0, 0.5, 1.5, 4.5, 3.0, 0.5, 0.0, ...
3 3.0 1.00 0.9 [1.8, 1.8, 5.4, 0.9, 2.7, 8.1, 5.4, 0.9, 0.0, ...
The following single line of could will fetch you the desired result:
df1['col3'] = df1['col3'].apply(lambda x: df2.values[0]*x)
Here, the values of column 'col3' are treated as a single value multiplied by the entire DataFrame df2 for each row of the df1.

Create Pandas dataframe from numpy array and use first column of the array as index

I have a numpy array (a):
array([[ 1. , 5.1, 3.5, 1.4, 0.2],
[ 1. , 4.9, 3. , 1.4, 0.2],
[ 2. , 4.7, 3.2, 1.3, 0.2],
[ 2. , 4.6, 3.1, 1.5, 0.2]])
I would like to make a pandas dataframe (pd) with values=a, columns= A,B,C,D and index= to the first column of my numpy array, finally it should looks like this:
A B C D
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
I am trying this:
df = pd.DataFrame(a, index=a[:,0], columns=['A', 'B','C','D'])
and I get the following error:
ValueError: Shape of passed values is (5, 4), indices imply (4, 4)
Any help?
Thanks
You passed the complete array as the data param, you need to slice your array also if you want just 4 columns from the array as the data:
In [158]:
df = pd.DataFrame(a[:,1:], index=a[:,0], columns=['A', 'B','C','D'])
df
Out[158]:
A B C D
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
Also having duplicate values in the index will make filtering/indexing problematic
So here a[:,1:] I take all the rows but index from column 1 onwards as desired, see the docs

Differences between dataframe spearman correlation using pandas and scipy

I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])

Categories

Resources