matching two different arrays and making a new array in python

matching two different arrays and making a new array in python - python

I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array

Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.

If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')

#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html

Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)

Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))

Related

Converting a list with no tuples into a data frame

Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.

IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9

Repeating the same process for the whole dataset

Given DataFrame df:
1 1.1 2 2.1 ... 1600 1600.1
0 45.1024 7.2365 45.8769 7.1937 34.1072 8.4643
1 43.1024 8.9645 32.5798 7.7500 33.1072 9.3564
2 42.1024 6.7498 25.1027 7.3496 26.1072 6.3665
I did the following operation: I chose first(1 and 1.1) couple and created an array. Then I did the same with following couple (2 and 2.1).
x = df['1']
y = df['1.1']
P = np.array([x, y])
and
q = df['2']
w = df['2.1']
Q = np.array([q, w])
Final operation was:
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
Now I want to do it for the whole dataset. But it will take a lot of time. Any idea how to iterate this in a short way?
EDITED
After all I'm doing
df = similaritymeasures.frechet_dist(P_final, Q_final)
So I want to get a new dataset (maybe) with all columns combinations

A simple way is to use agg across axis 1
def f(s):
s = iter(s)
return list(zip(s,s))
agg = df.agg(f,1)
Then to retrieve, use .str. For example,
agg.str[0] # P_final
agg.str[1] # Q_final
.
.
.
Also, can groupby across axis=1, assuming you want every couple of columns
df.groupby(np.arange(len(df.columns))//2, axis=1).apply(lambda s: s.agg(list,1))

You probably don't want to create 1600 individual variables. Store this in a container, like a dict, where the keys reference the original column handles:
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
# or
{idx: [*map(tuple, gp.to_numpy())]
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
Sample
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame((np.random.randint(1,10,(5,6))))
df.columns = ['1', '1.1', '2', '2.1', '3', '3.1']
# 1 1.1 2 2.1 3 3.1
#0 7 4 8 5 7 3
#1 7 8 5 4 8 8
#2 3 6 5 2 8 6
#3 2 5 1 6 9 1
#4 3 7 4 9 3 5
{idx: list(zip(gp.iloc[:, 0], gp.iloc[:, 1]))
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1)}
#{'1': [(7, 4), (7, 8), (3, 6), (2, 5), (3, 7)],
# '2': [(8, 5), (5, 4), (5, 2), (1, 6), (4, 9)],
# '3': [(7, 3), (8, 8), (8, 6), (9, 1), (3, 5)]}

Delete duplicate values from pandas DataFrame [duplicate]

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]

You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5

Import only first line of multiple csv, dummycode repeats and calculate conditional probability

I am actually a R savvy person but for this task I am using python and am overwhelmed by little things. I really hope you guys may help me.
I have three large csv files that contain ids in the first column. Each csv contains presence data of a day and each unique id stands for a person. csv1 is day 1, csv2 is day 2 and csv3 is day 3.
What I want to do now is to import only the first column for each csv because I don't need the rest and the files are quite big. Then I'd like to compare the appearance for each person (so the appearance for each id) and dummycode the information into one data frame where the first column contains the unique ids of all three csv files, the second a binary code for the presence of the id in the first csv (0=id doesn't appear in csv1, 1=id appaers in csv1) and then in column two do the same for csv2 and so on.
Last but not least, I'd like then to calculate the conditional probability for example for the case that a person that didn't came the first two days would show up on the third.
Of course a whole solution would be great but I would be pretty happy with baby steps and some python for dummies advice here, too. The thing is that I know how to do this with R but not with python and I would really appreciate your help.
Thank you very much and excuse my English.
If you need sample data, maybe this helps (the real data is huge):
csv1
id col2 col3
1 x x
2 x x
7 x x
csv2
id col2 col3
1 x x
3 x x
7 x x
4 x x
csv3
id col2 col3
1 x x
2 x x
3 x x
4 x x
5 x x
6 x x
Dummycoded df would be:
df
id csv1 csv2 csv3
1 1 1 1
2 1 0 1
3 0 1 1
4 0 1 1
5 0 0 1
6 0 0 1
7 1 1 0

You can simply open your csv files and store all the first columns (ids) in a list:
with open(csv1_name) as f1, open(csv2_name) as f2,open(csv3_name) as f3:
column1 = next(zip(*csv.reader(f1))) # in python 2 use zip(*csv.reader(f1))[0]
column2 = next(zip(*csv.reader(f2)))
column3 = next(zip(*csv.reader(f3)))
Now you have the indices of the items in new matrix that should be 1. So you can use a list comprehension to create your columns:
a = [[1, 2, 7], [1, 3, 7, 4], [1, 2, 3, 4, 5, 6]]
[[1 if j in a[i] else 0 for j in range(1, 8)] for i in range(3)]
[[0, 1, 1, 0, 0, 0, 0], [0, 1, 0, 1, 1, 0, 0], [0, 1, 1, 1, 1, 1, 1]]
Or use zip again to create the rows:
In [138]: zip(*[[1 if j in a[i] else 0 for j in range(1, 8)] for i in range(3)])
Out[138]: [(1, 1, 1), (1, 0, 1), (0, 1, 1), (0, 1, 1), (0, 0, 1), (0, 0, 1), (1, 1, 0)]
Note: if you don't know the maximum number of rows you can get that by flattening the columns list
So here is the general code:
from itertools import chain
import csv
with open(csv1_name) as f1, open(csv2_name) as f2,open(csv3_name) as f3:
column1 = next(zip(*csv.reader(f1))) # in python 2 use zip(*csv.reader(f1))[0]
column2 = next(zip(*csv.reader(f2)))
column3 = next(zip(*csv.reader(f3)))
columns = [column1, column2, column3]
m_row = max(chain.from_iterable(a))
f_number = 3
new_columns = [[1 if j in columns[i] else 0 for j in range(1, m_row + 1)] for i in range(f_number)]
# Creating new csv file
with open('new_file.csv', 'w') as f:
csvwriter = csv.writer(f)
for id, row in zip(*new_columns):
csvwriter.writerow((id,)+row)

Loading a table in numpy with row- and column-indices, like in R?

I would like to load a table in numpy, so that the first row and first column would be considered text labels. Something equivalent to this R code:
read.table("filename.txt", row.header=T)
Where the file is a delimited text file like this:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
So that read in I will have an array:
[[5,4,3,2],
[1,0,9,9],
[8,7,6,5]]
With some sort of:
rownames ["X","Y","Z"]
colnames ["A","B","C","D"]
Is there such a class / mechanism?

Numpy arrays aren't perfectly suited to table-like structures. However, pandas.DataFrames are.
For what you're wanting, use pandas.
For your example, you'd do
data = pandas.read_csv('filename.txt', delim_whitespace=True, index_col=0)
As a more complete example (using StringIO to simulate your file):
from StringIO import StringIO
import pandas as pd
f = StringIO("""A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5""")
x = pd.read_csv(f, delim_whitespace=True, index_col=0)
print 'The DataFrame:'
print x
print 'Selecting a column'
print x['D'] # or "x.D" if there aren't spaces in the name
print 'Selecting a row'
print x.loc['Y']
This yields:
The DataFrame:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
Selecting a column
X 2
Y 9
Z 5
Name: D, dtype: int64
Selecting a row
A 1
B 0
C 9
D 9
Name: Y, dtype: int64
Also, as #DSM pointed out, it's very useful to know about things like DataFrame.values or DataFrame.to_records() if you do need a "raw" numpy array. (pandas is built on top of numpy. In a simple, non-strict sense, each column of a DataFrame is stored as a 1D numpy array.)
For example:
In [2]: x.values
Out[2]:
array([[5, 4, 3, 2],
[1, 0, 9, 9],
[8, 7, 6, 5]])
In [3]: x.to_records()
Out[3]:
rec.array([('X', 5, 4, 3, 2), ('Y', 1, 0, 9, 9), ('Z', 8, 7, 6, 5)],
dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8'), ('D', '<i8')])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.