Pandas: how to drop rows if contains more that 2 entries? - python

I have a dataframe like the following
df
entry
0 (5, 4)
1 (4, 2, 1)
2 (0, 1)
3 (2, 7)
4 (9, 4, 3)
I would like to keep only the entry that contains two values
df
entry
0 (5, 4)
1 (0, 1)
2 (1, 7)

If there are tuples use Series.str.len for lengths and compare by Series.le for <= and filter in boolean indexing:
df1 = df[df['entry'].str.len().le(2)]
print (df1)
entry
0 (5, 4)
2 (0, 1)
3 (2, 7)
If there are strings compare number of , and compare by Series.lt for <:
df2 = df[df['entry'].str.count(',').lt(2)]
print (df2)
entry
0 (5,4)
2 (0,1)
3 (2,7)

Related

Splitting a float tuple to multiple columns in python dataframe

I used regex to extract patterns from a csv document, mainly the pattern is (qty x volume in L), eg: 2x2L or 3x4L. (Note that 1 cell can have more than 1 pattern, eg: I want 2x4L and 3x1L)
0 []
1 [(2, x1L), (2, x4L)]
2 [(1, x1L), (1, x4L)]
3 [(2, x4L)]
4 [(1, x4L), (1, x1L)]
...
95 [(1, x2L)]
96 [(1, x1L), (1, x4L)]
97 [(2, x1L)]
98 [(6, x1L)]
99 [(6, x1L), (4, x2L), (4, x4L)]
Name: cards__name, Length: 100, dtype: object
I want to create 3 columns called "1L" "2L" and "4L" and then for every item, take the quantity and add it to the relevant row under the relevant column.
As such
1L 2L 4L
2 0 2
1 0 1
0 0 4
1 0 1
However I am not able to index to index the tuple in order to extract the quantity and the volume size for every item.
Any ideas?
Before you will be able to use pivot you have to normalize your columns, e.g. this way:
df['multiplier_1'] = df['order_1'].apply(lambda r: r[0])
df['base_volume_1'] = df['order_1'].apply(lambda r: r[1])
In such a way you will be able to ungroup the orders and eventually split into multiple base volumes.

How to make a new row by taking the percentage of two other rows in a table on python

So I have a pandas data frame that shows the number of shots taken and the number of goals scored for a list of hockey games from different coordinates. The data frame lists the shots and goals like this (4, 2), and I want to add another column that divides the goals by the shots to give shot percentage for each coordinate. So far here is my code...
key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = (1, 0)
else:
shots[scoordinates] = tuple(map(sum, zip((1, 0), shots[scoordinates])))
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in shots:
shots[gcoordinates] = (1, 1)
else:
shots[gcoordinates] = tuple(map(sum, zip((1, 1), shots[gcoordinates])))
#create data frame using pandas
pd.set_option("display.max_rows", None, "display.max_columns", None)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences (S, G)'])
file.write(f"{sdf}\n")
this gives the result data frame this--
Coordinates Occurences (S, G)
0 (78.0, -19.0) (2, 1)
1 (-37.0, -10.0) (2, 0)
2 (47.0, -23.0) (3, 1)
3 (53.0, 14.0) (1, 0)
4 (77.0, -2.0) (8, 4)
5 (80.0, 1.0) (12, 5)
6 (74.0, 14.0) (7, 0)
7 (87.0, -3.0) (1, 1)
If anyone can help that would be great!
Try this:
df['new_col']=df['old_col'].apply( lambda x: x[1]/x[0])
Just divide the 2 columns. This is the "longer" way. Split the S, G tuples into their own columns, then divide. Or do the one-liner with lambda that Ave799 provided. Both work, but Ave799 is probably the preferred way
import pandas as pd
data = pd.DataFrame([[(78.0, -19.0),(2, 1)],
[(-37.0, -10.0),(2, 0)],
[(47.0, -23.0),(3, 1)],
[(53.0, 14.0),(1, 0)],
[(77.0, -2.0),(8, 4)],
[(80.0, 1.0),(12, 5)],
[(74.0, 14.0),(7, 0)],
[(87.0, -3.0),(1, 1)]], columns=['Coordinates','Occurences (S, G)'])
data[['S','G']] = pd.DataFrame(data['Occurences (S, G)'].tolist(), index=data.index)
data['Percentage'] = data['G'] / data['S']
Output:
print(data)
Coordinates Occurences (S, G) Percentage S G
0 (78.0, -19.0) (2, 1) 0.500000 2 1
1 (-37.0, -10.0) (2, 0) 0.000000 2 0
2 (47.0, -23.0) (3, 1) 0.333333 3 1
3 (53.0, 14.0) (1, 0) 0.000000 1 0
4 (77.0, -2.0) (8, 4) 0.500000 8 4
5 (80.0, 1.0) (12, 5) 0.416667 12 5
6 (74.0, 14.0) (7, 0) 0.000000 7 0
7 (87.0, -3.0) (1, 1) 1.000000 1 1

pandas dataframe conditional selection

There is a pandas data frame.
One of columns named Exceptions.
Row represent entries. In Exceptions i store tuples.
i need to do a conditional selection of rows (there are other conditions which need to be &ed for further selection)
>>>print(dataframe.Exceptions)
0
1
2 (sfm, sfmp)
4
3
Name: Exceptions, dtype: object
>>>'sfm' not in dataframe.Expections
True
How to do this conditional selection with Tuples unpacked.
Appreciate your suggestions.
Here's an example showing how to get tuples that have 1 in the second position.
import pandas as pd
df = pd.DataFrame({
'tups': [(0, 0), (0, 1), (0, 2), (1, 1)]
})
filtered = df[df['tups'].apply(lambda tup: tup[1] == 1)]
print(filtered)
Output:
tups
1 (0, 1)
3 (1, 1)
Is this what you're looking for?

Repeating the same process for the whole dataset

Given DataFrame df:
1 1.1 2 2.1 ... 1600 1600.1
0 45.1024 7.2365 45.8769 7.1937 34.1072 8.4643
1 43.1024 8.9645 32.5798 7.7500 33.1072 9.3564
2 42.1024 6.7498 25.1027 7.3496 26.1072 6.3665
I did the following operation: I chose first(1 and 1.1) couple and created an array. Then I did the same with following couple (2 and 2.1).
x = df['1']
y = df['1.1']
P = np.array([x, y])
and
q = df['2']
w = df['2.1']
Q = np.array([q, w])
Final operation was:
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
Now I want to do it for the whole dataset. But it will take a lot of time. Any idea how to iterate this in a short way?
EDITED
After all I'm doing
df = similaritymeasures.frechet_dist(P_final, Q_final)
So I want to get a new dataset (maybe) with all columns combinations
A simple way is to use agg across axis 1
def f(s):
s = iter(s)
return list(zip(s,s))
agg = df.agg(f,1)
Then to retrieve, use .str. For example,
agg.str[0] # P_final
agg.str[1] # Q_final
.
.
.
Also, can groupby across axis=1, assuming you want every couple of columns
df.groupby(np.arange(len(df.columns))//2, axis=1).apply(lambda s: s.agg(list,1))
You probably don't want to create 1600 individual variables. Store this in a container, like a dict, where the keys reference the original column handles:
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
# or
{idx: [*map(tuple, gp.to_numpy())]
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
Sample
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame((np.random.randint(1,10,(5,6))))
df.columns = ['1', '1.1', '2', '2.1', '3', '3.1']
# 1 1.1 2 2.1 3 3.1
#0 7 4 8 5 7 3
#1 7 8 5 4 8 8
#2 3 6 5 2 8 6
#3 2 5 1 6 9 1
#4 3 7 4 9 3 5
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
#{'1': [(7, 4), (7, 8), (3, 6), (2, 5), (3, 7)],
# '2': [(8, 5), (5, 4), (5, 2), (1, 6), (4, 9)],
# '3': [(7, 3), (8, 8), (8, 6), (9, 1), (3, 5)]}

Rolling over multiple columns returning one result in Pandas

Im struck over rolling a window over multiple columns in Pandas, what I have is:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
def test(ts):
print(ts.shape)
df.rolling(2).apply(test)
However the problem is that ts.shape prints (2,) and I wanted it to print (2,2), that is include the whole window of both rows and columns.
What is wrong about my intuition about how rolling works and how can I get the results im after using Pandas?
You can use a little hack - get numeric columns length by select_dtypes and use this scalar value:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8], 'C':list('abcd')})
print (df)
A B C
0 1 5 a
1 2 6 b
2 3 7 c
3 4 8 d
cols = len(df.select_dtypes(include=[np.number]).columns)
print (cols)
2
def test(ts):
print(tuple((ts.shape[0], cols)))
return ts.sum()
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
df = df.rolling(2).apply(test)

Categories

Resources