Repeating the same process for the whole dataset - python

Given DataFrame df:
1 1.1 2 2.1 ... 1600 1600.1
0 45.1024 7.2365 45.8769 7.1937 34.1072 8.4643
1 43.1024 8.9645 32.5798 7.7500 33.1072 9.3564
2 42.1024 6.7498 25.1027 7.3496 26.1072 6.3665
I did the following operation: I chose first(1 and 1.1) couple and created an array. Then I did the same with following couple (2 and 2.1).
x = df['1']
y = df['1.1']
P = np.array([x, y])
and
q = df['2']
w = df['2.1']
Q = np.array([q, w])
Final operation was:
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
Now I want to do it for the whole dataset. But it will take a lot of time. Any idea how to iterate this in a short way?
EDITED
After all I'm doing
df = similaritymeasures.frechet_dist(P_final, Q_final)
So I want to get a new dataset (maybe) with all columns combinations

A simple way is to use agg across axis 1
def f(s):
s = iter(s)
return list(zip(s,s))
agg = df.agg(f,1)
Then to retrieve, use .str. For example,
agg.str[0] # P_final
agg.str[1] # Q_final
.
.
.
Also, can groupby across axis=1, assuming you want every couple of columns
df.groupby(np.arange(len(df.columns))//2, axis=1).apply(lambda s: s.agg(list,1))

You probably don't want to create 1600 individual variables. Store this in a container, like a dict, where the keys reference the original column handles:
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
# or
{idx: [*map(tuple, gp.to_numpy())]
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
Sample
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame((np.random.randint(1,10,(5,6))))
df.columns = ['1', '1.1', '2', '2.1', '3', '3.1']
# 1 1.1 2 2.1 3 3.1
#0 7 4 8 5 7 3
#1 7 8 5 4 8 8
#2 3 6 5 2 8 6
#3 2 5 1 6 9 1
#4 3 7 4 9 3 5
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
#{'1': [(7, 4), (7, 8), (3, 6), (2, 5), (3, 7)],
# '2': [(8, 5), (5, 4), (5, 2), (1, 6), (4, 9)],
# '3': [(7, 3), (8, 8), (8, 6), (9, 1), (3, 5)]}

Related

Split DataFrame Column into Multiple Columns of Equal Length in Pandas [duplicate]

My dataframe is too long, and I want to wrap it into the next columns. This approach works, but I'm sure there's a better way. I'd like an answer that works with even longer dataframes, wrapping on lines 1 modulo 3.
import pandas as pd
import numpy as np
def wraparound(df, row_number):
"""row_number is the first number that we wrap onto the next column."""
n = row_number - 1
r = df.iloc[:n]
r = pd.concat([r, df.iloc[n:2*n].reset_index(drop=True)], axis=1)
r = pd.concat([r, df.iloc[2 * n:3*n].reset_index(drop=True)], axis=1)
r = r.reset_index(drop=True).T.reset_index(drop=True).T
return r
df = pd.DataFrame.from_records([
(1, 11),
(2, 12),
(3, 13),
(4, 14),
(5, 15),
(6, 16),
(7, 17),
])
result = wraparound(df, 4)
expected = pd.DataFrame.from_records([
(1, 11, 4, 14, 7, 17),
(2, 12, 5, 15, np.nan, np.nan),
(3, 13, 6, 16, np.nan, np.nan),
])
pd.testing.assert_frame_equal(result, expected)
You can create MultiIndex first and then unstack with sort_index:
N = 3
a = np.arange(len(df))
df.index = [a % N, a // N]
df = df.unstack().sort_index(axis=1, level=1)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5
0 1.0 11.0 4.0 14.0 7.0 17.0
1 2.0 12.0 5.0 15.0 NaN NaN
2 3.0 13.0 6.0 16.0 NaN NaN

Splitting a float tuple to multiple columns in python dataframe

I used regex to extract patterns from a csv document, mainly the pattern is (qty x volume in L), eg: 2x2L or 3x4L. (Note that 1 cell can have more than 1 pattern, eg: I want 2x4L and 3x1L)
0 []
1 [(2, x1L), (2, x4L)]
2 [(1, x1L), (1, x4L)]
3 [(2, x4L)]
4 [(1, x4L), (1, x1L)]
...
95 [(1, x2L)]
96 [(1, x1L), (1, x4L)]
97 [(2, x1L)]
98 [(6, x1L)]
99 [(6, x1L), (4, x2L), (4, x4L)]
Name: cards__name, Length: 100, dtype: object
I want to create 3 columns called "1L" "2L" and "4L" and then for every item, take the quantity and add it to the relevant row under the relevant column.
As such
1L 2L 4L
2 0 2
1 0 1
0 0 4
1 0 1
However I am not able to index to index the tuple in order to extract the quantity and the volume size for every item.
Any ideas?
Before you will be able to use pivot you have to normalize your columns, e.g. this way:
df['multiplier_1'] = df['order_1'].apply(lambda r: r[0])
df['base_volume_1'] = df['order_1'].apply(lambda r: r[1])
In such a way you will be able to ungroup the orders and eventually split into multiple base volumes.

How to make a new row by taking the percentage of two other rows in a table on python

So I have a pandas data frame that shows the number of shots taken and the number of goals scored for a list of hockey games from different coordinates. The data frame lists the shots and goals like this (4, 2), and I want to add another column that divides the goals by the shots to give shot percentage for each coordinate. So far here is my code...
key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = (1, 0)
else:
shots[scoordinates] = tuple(map(sum, zip((1, 0), shots[scoordinates])))
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in shots:
shots[gcoordinates] = (1, 1)
else:
shots[gcoordinates] = tuple(map(sum, zip((1, 1), shots[gcoordinates])))
#create data frame using pandas
pd.set_option("display.max_rows", None, "display.max_columns", None)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences (S, G)'])
file.write(f"{sdf}\n")
this gives the result data frame this--
Coordinates Occurences (S, G)
0 (78.0, -19.0) (2, 1)
1 (-37.0, -10.0) (2, 0)
2 (47.0, -23.0) (3, 1)
3 (53.0, 14.0) (1, 0)
4 (77.0, -2.0) (8, 4)
5 (80.0, 1.0) (12, 5)
6 (74.0, 14.0) (7, 0)
7 (87.0, -3.0) (1, 1)
If anyone can help that would be great!
Try this:
df['new_col']=df['old_col'].apply( lambda x: x[1]/x[0])
Just divide the 2 columns. This is the "longer" way. Split the S, G tuples into their own columns, then divide. Or do the one-liner with lambda that Ave799 provided. Both work, but Ave799 is probably the preferred way
import pandas as pd
data = pd.DataFrame([[(78.0, -19.0),(2, 1)],
[(-37.0, -10.0),(2, 0)],
[(47.0, -23.0),(3, 1)],
[(53.0, 14.0),(1, 0)],
[(77.0, -2.0),(8, 4)],
[(80.0, 1.0),(12, 5)],
[(74.0, 14.0),(7, 0)],
[(87.0, -3.0),(1, 1)]], columns=['Coordinates','Occurences (S, G)'])
data[['S','G']] = pd.DataFrame(data['Occurences (S, G)'].tolist(), index=data.index)
data['Percentage'] = data['G'] / data['S']
Output:
print(data)
Coordinates Occurences (S, G) Percentage S G
0 (78.0, -19.0) (2, 1) 0.500000 2 1
1 (-37.0, -10.0) (2, 0) 0.000000 2 0
2 (47.0, -23.0) (3, 1) 0.333333 3 1
3 (53.0, 14.0) (1, 0) 0.000000 1 0
4 (77.0, -2.0) (8, 4) 0.500000 8 4
5 (80.0, 1.0) (12, 5) 0.416667 12 5
6 (74.0, 14.0) (7, 0) 0.000000 7 0
7 (87.0, -3.0) (1, 1) 1.000000 1 1

Pandas: how to drop rows if contains more that 2 entries?

I have a dataframe like the following
df
entry
0 (5, 4)
1 (4, 2, 1)
2 (0, 1)
3 (2, 7)
4 (9, 4, 3)
I would like to keep only the entry that contains two values
df
entry
0 (5, 4)
1 (0, 1)
2 (1, 7)
If there are tuples use Series.str.len for lengths and compare by Series.le for <= and filter in boolean indexing:
df1 = df[df['entry'].str.len().le(2)]
print (df1)
entry
0 (5, 4)
2 (0, 1)
3 (2, 7)
If there are strings compare number of , and compare by Series.lt for <:
df2 = df[df['entry'].str.count(',').lt(2)]
print (df2)
entry
0 (5,4)
2 (0,1)
3 (2,7)

matching two different arrays and making a new array in python

I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array
Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.
If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')
#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html
Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)
Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))

Categories

Resources