I want to make a dataframe containing 3 columns. I have three different lists containing the values that need to be in the dataframe in a certain order, so I want to loop over the lists to combine them and create the dataframe.
List F, contains 9 values
List P, contains 3 values
List A, contains 3 values
The final dataframe will be exported in Excel and should look like this:
|F |P |A |
|----|----|----|
|F(0)|P(0)|A(0)|
|F(1)|P(0)|A(1)|
|F(2)|P(0)|A(2)|
|F(3)|P(1)|A(0)|
|F(4)|P(1)|A(1)|
|F(5)|P(1)|A(2)|
|F(6)|P(2)|A(0)|
|F(7)|P(2)|A(1)|
|F(8)|P(2)|A(2)|
To achieve this, I wanted to first create a list with these values and split that in a dataframe.
I tried this to obtain the list:
df_test3 = []
for f in F:
df_test3.append(f)
for p in P:
for a in D:
df_test3.append(p)
df_test3.append(a)
List P and A are in the correct order, but I can't match it with the outer loop F. I know I have to do something with break to return to the outer loop, but I can't see how.
It returns this now:
list = [F0, P0, A0, P0, A1, P0, A2, P1, A0, etc.]
and continues to the next value of F after the inner loops are completed. How can I get all the values in the right order in the list? Or am I handling this the wrong way and should I create the dataframe right away?
Try this...
F = [1,2,3,4,5,6,7,8,9]
P = [11,22,33]
A = [111,222,333]
P1 = []
num1 = int(len(F)/len(P))
for p in P:
P1 = P1 + [p]*num1
num2 = int(len(F)/len(A))
A1 = A*num2
df_result = pd.DataFrame({"F":F,"P":P1,"A":A1})
# Output of df_result...
F P A
0 1 11 111
1 2 11 222
2 3 11 333
3 4 22 111
4 5 22 222
5 6 22 333
6 7 33 111
7 8 33 222
8 9 33 333
Hope this Helps...
You can use cycle from itertools.
For example:
from itertools import cycle
F =[ 0,1,2,3,4,5,6,7,8]
P = [0,1,2]
A = [0,1,2]
df_test3 = []
zip_list = zip(F, cycle(P),cycle(A))
print(list(zip_list))
the result is: [(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 0, 0), (4, 1, 1), (5, 2, 2), (6, 0, 0), (7, 1, 1), (8, 2, 2)]
You can go through it. Maybe it could help you to get a solution.
Related
I'm trying to analyze a database of trucks and the items they carry to find out which two trucks are the most similar to one another (trucks which share the most number of items). I have a csv similar to this:
truck_id | item_id
13 | 85394 *
16 | 294 *
13 | 294 *
89 | 3115
89 | 85394
13 | 294
16 | 85394 *
13 | 3115
In the above example, 16 and 13 are the most similar trucks, as they both have the 294 and 85394 items.
The entire code is too long so I'll offer pseudo code for what I'm doing:
truck_items = {}
#1
loop over the csv:
add to truck_items a truck_id and an ARRAY with the items each truck has
#2
go over each truck in the truck_items dictionary, and compare their array to all other arrays
to get the count of similar items
#3
create a 'most_similar' key in the dictionary.
#4
check in most_similar what are the two trucks with most similarity.
So I would end up with something like this:
{
13: [16, 2] // truck_1_id: [truck_2_id, number_similar_items]
89: ...
}
I understand this is not the most efficient way as I'm going ever the lists too many times and that shouldn't be done. Is there a more efficient way?
Non-pandas solution, facilitating built-in tools such as collections.defaultdict (optional) and itertools.product (also optional, but will help you push some calculations/loops down to the C level which will be beneficial if the data set is large enough).
I think the logic itself is self-explanatory.
from collections import defaultdict
from itertools import product
trucks = [
(13, 294),
(13, 294),
(13, 3115),
(13, 85394),
(16, 294),
(16, 85394),
(89, 3115),
(89, 85394),
]
d = defaultdict(set)
for truck, load in trucks:
d[truck].add(load)
li = [({'truck': k1, 'items': v1},
{'truck': k2, 'items': v2})
for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
if k1 != k2]
truck_1_data, truck_2_data = max(li, key=lambda e: len(e[0]['items'] & e[1]['items']))
print(truck_1_data['truck'], truck_2_data['truck'])
outputs
13 16
Arguably a more readable version:
...
li = [{k1: v1,
k2: v2}
for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
if k1 != k2]
def dict_values_intersection_len(d):
values = list(d.values())
return len(values[0] & values[1])
truck_1, truck_2 = max(li, key=dict_values_intersection_len)
print(truck_1, truck_2)
which also outputs
13 16
Use groupby to gather all records for a given truck. For each group, make a set of part numbers. Make a new data frame of that data:
truck_id | items
13 | {85394, 294, 3115}
16 | {294, 85394}
89 | {3115, 85394}
Now you need to make a full cross-product of this DF with itself; filter to remove self-reference and duplicates (13-16 and 16-13, for example). If you make the product with
truck_id_left < truck_id_right (I'll leave the implementation syntax to you, dependent on the package you use), you'll get only the unique pairs.
On that series of truck pairs, simply take the set intersection of their items:
trucks | items
(13, 16) | {85394, 294}
(13, 89) | {3115}
(16, 89) | {85394}
Then find the row with the max value on that intersection.
Can you handle each of those steps? They're all contained in PANDAS tutorials.
Here's a solution that seems like it might work:
I'm using pandas as my main data container, just makes stuff like this easier.
import pandas as pd
from collections import Counter
Here I'm creating a similar dataset
#creating toy data
df = pd.DataFrame({'truck_id':[1,1,2,2,2,3,3],'item_id':[1,7,1,7,5,2,2]})
that looks like this
item_id truck_id
0 1 1
1 7 1
2 1 2
3 7 2
4 5 2
5 2 3
6 2 3
I'm reformatting it to have a list of items for each truck
#making it so each row is a truck, and the value is a list of items
df = df.groupby('truck_id')['item_id'].apply(list)
which looks like this:
truck_id
1 [1, 7]
2 [1, 7, 5]
3 [2, 2]
now I'm creating a function that, given a df that looks like the previous one, counts the number of similar items on 2 trucks.
def get_num_similar(df, id0, id1):
#drops duplicates from each truck, so there's only one of each item in each truck
#combining those lists together, so it's a list of items in both trucks
comp = [*list(set(df.loc[id0])), *list(set(df.loc[id1]))]
#getting how many items of each exist (should be 1 or 2)
quants = dict(Counter(comp))
#getting how many similar items are carried
num_similar = len([quant for quant in quants.values() if quant > 1])
return num_similar
running this:
print(get_num_similar(df, 1, 2))
results in an output of 2, which is accurate. Now just iterate over all groups of trucks you want to analyze, and you can calculate which trucks have the most shared things.
I used regex to extract patterns from a csv document, mainly the pattern is (qty x volume in L), eg: 2x2L or 3x4L. (Note that 1 cell can have more than 1 pattern, eg: I want 2x4L and 3x1L)
0 []
1 [(2, x1L), (2, x4L)]
2 [(1, x1L), (1, x4L)]
3 [(2, x4L)]
4 [(1, x4L), (1, x1L)]
...
95 [(1, x2L)]
96 [(1, x1L), (1, x4L)]
97 [(2, x1L)]
98 [(6, x1L)]
99 [(6, x1L), (4, x2L), (4, x4L)]
Name: cards__name, Length: 100, dtype: object
I want to create 3 columns called "1L" "2L" and "4L" and then for every item, take the quantity and add it to the relevant row under the relevant column.
As such
1L 2L 4L
2 0 2
1 0 1
0 0 4
1 0 1
However I am not able to index to index the tuple in order to extract the quantity and the volume size for every item.
Any ideas?
Before you will be able to use pivot you have to normalize your columns, e.g. this way:
df['multiplier_1'] = df['order_1'].apply(lambda r: r[0])
df['base_volume_1'] = df['order_1'].apply(lambda r: r[1])
In such a way you will be able to ungroup the orders and eventually split into multiple base volumes.
I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array
Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.
If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')
#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html
Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)
Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))
Im struck over rolling a window over multiple columns in Pandas, what I have is:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
def test(ts):
print(ts.shape)
df.rolling(2).apply(test)
However the problem is that ts.shape prints (2,) and I wanted it to print (2,2), that is include the whole window of both rows and columns.
What is wrong about my intuition about how rolling works and how can I get the results im after using Pandas?
You can use a little hack - get numeric columns length by select_dtypes and use this scalar value:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8], 'C':list('abcd')})
print (df)
A B C
0 1 5 a
1 2 6 b
2 3 7 c
3 4 8 d
cols = len(df.select_dtypes(include=[np.number]).columns)
print (cols)
2
def test(ts):
print(tuple((ts.shape[0], cols)))
return ts.sum()
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
df = df.rolling(2).apply(test)
I have a list in Python
numbers_list = [(2,5), (3,4), (2,6), (3,5)...]
I want to copy the list to an Excel CSV called NumberPairings but I want each combination to be in a different row and each number in the row in different columns.
So I want the excel file to look like this:
Num1 Num2
2 5
3 4
2 6
3 5
I think I should use a for loop that begins with
for item in numbers_list:
But I need help with using Pandas to write to the file in the way I want it. If you think there is an easier way than Pandas, I'm open to it as well.
You can separate the tuples into individual columns like this:
df = pd.DataFrame(data={'tuples': numbers_list})
df
tuples
0 (2, 5)
1 (3, 4)
2 (2, 6)
3 (3, 5)
df['Num1'] = df['tuples'].str[0]
df['Num2'] = df['tuples'].str[1]
df
tuples Num1 Num2
0 (2, 5) 2 5
1 (3, 4) 3 4
2 (2, 6) 2 6
3 (3, 5) 3 5
# optional create csv
df.drop(['tuples'], axis=1).to_csv(path)