working with arrays of large size in python

working with arrays of large size in python - python

I have three arrays as listed below:
users — Contains the id of 50000 users ( all distinct )
pusers — Contains the id of users who own some posts (contains repeated id's also, that is, one user can own many posts) [ 50000 values]
score — Contains the score corresponding to each value in pusers.[ 50000 values]
Now I want to populate another array PScore based on the following calculation. For each value of users in pusers, I need to fetch the corresponding score and add it to the PScore array in the index corresponding to the user.
Example,
if users[5] = 23224
and pusers[6] = pusers[97] = 23224
then PScore[5] += score[6]+score[97]
Items of note:
score is related to pusers (e.g., pusers[5] has score[5])
PScore is expected to be related to users (e.g., cumulative score of users[5] is Pscore[5])
The ultimate aim is to assign a cumulative score of posts to the user who owns it.
The users who don't own any posts are assigned a score of 0.
Can anyone help me in doing this? I tried a lot but once I run my different trials, the output screen remains blank until I Ctrl+Z and get out.
I went through all of the following posts but I couldn't use them effectively for my scenario.
Compare values of two arrays in python
how to compare two arrays in python?
Checking if any elements in one list are in another
I am new to this forum and I'm a beginner in Python too. Any help is going to be really useful to me.
Additional Information
I'm working on a small project using StackOverflow data.
I'm using Orange tool and I'm in the process of learning the tool and python.
Ok I understand that something is wrong with my approach. So shouldn't I use lists for this scenario? Can anyone please tell me how I should proceed with this?
Sample of the data that i have arrived at is as shown below.
PUsers Score
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
13 0
77 1
77 4
77 3
77 0
77 2
77 2
77 3
102 2
105 0
108 2
108 2
117 2
Users
-1
1
2
3
4
5
8
9
10
11
13
16
17
19
20
22
23
24
25
26
27
29
30
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
48
49
50
All that I want is the total score associated with each user. Once again, the pusers list contains repetition while users list contains unique values. I need the total score associated with each user stored in such a way that, if I say PScore[6], it should refer to the total score associated with User[6].
Hope I answered the queries.
Thanks in advance.

From how you described your arrays and since you're using python, this looks like a perfect candidate for dictionaries.
Instead of having one array for post owner and another array for post score, you should be able to make a dictionary that maps a user id to a score. When you're taking in data, look in the dictionary to see if the user already exists. If so, add the score to the current score. If not, make a new entry. When you've looped through all the data, you should have a dictionary that maps from user id to total score.
http://docs.python.org/2/tutorial/datastructures.html#dictionaries

I think your algorithm is either wrong or broken.
Try to compute it's complexity. If it's N^2 or more you are likely using an inefficient algorithm. O(N^2) with 50.000 elements should take a few seconds. O(N^3) will probably take minutes.
If you're sure of your approach try running it with some small fake data to figure out if it does the right thing or if you accidentally added some infinite loop.
You can easily get it working in linear time with dictionaries.

Related

Group by a category

I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!

To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'

I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.

Python Dataframe using Columns as Index

I am relatively new to Python and I've run in to a problem that I cannot seem to search my way out of. I have written a function to query a third-party API. The function runs as expected and retrieves the correct results. However, my dataframe display returns the results with my columns and rows transposed. I have used this same function before without issue. I know I can transpose them to the correct position, but since I want to use this function as part of a larger function it is important that the query return the values with the columns and rows as intended.
I've included my snippet below as well as the results and desired outcome.
import pandas as pd
def get_tiering():
df = vendorAPI.getportfoliocustomcolumns('prod').as_dataframe()
records = df.to_dict('records')
return {rec['Patient']: rec for rec in records}
tiersdf = pd.DataFrame(get_tiering())
print(tiersdf)
[6 rows x 198 columns]
[198 rows x 6 columns]
I am wondering if there is some DataFrame setting that I inadvertently changed? I am using Spyder version 2.2 with Python 3.9. Any guidance you can provide would be appreciated.
Thank you for your time.

Did you try tiersdf.T?
This is my sample df
p1 p2 p3
height 65 66 5
weight 62 22 6
age 32 55 8
bp 12 44 6
hr 2 8 3
and i got this after doing transform
height weight age bp hr
p1 65 62 32 12 2
p2 66 22 55 44 8
p3 5 6 8 6 3

I would suggest you try to change
records = df.to_dict('records')
return {rec['Patient']: rec for rec in records}
to
return df.transpose().to_dict('series')
Please, let me know if it works. Otherwise, please let us know the exact output of vendorAPI.getportfoliocustomcolumns('prod')

Repeat rows based on numbers in multiple columns - Python

I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!

You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1

below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable

How to use or command in pandas to categorize my Data

I think it might be a noob question, but I'm new to coding. I used the following code to categorize my data. But I need to command that if, e.g., not all my conditions together fulfill the categories terms, e.g., consider only 4 out of 7 conditions, and give me the mentioned category. How can I do it? I really appreciate any help you can provide.
c1=df['Stroage Condition'].eq('refrigerate')
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
df['Restock Action']=np.where(c1&c2&c3,'Hold Current stock level','On Sale')
print(df)

Let`s say this is your dataframe:
Stroage Condition refrigerate Profit Per Unit Inventory Qty
0 0 1 0 20
1 1 1 102 1
2 2 2 5 2
3 3 0 100 8
and the conditions are the ones you defined:
c1=df['Stroage Condition'].eq(df['refrigerate'])
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
Then you can define a lambda function and pass this to your np.where() function. There you can define how many conditions have to be True. In this example I set the value to at least two.
def my_select(x,y,z):
return np.array([x,y,z]).sum(axis=0) >= 2
Finally you run one more line:
df['Restock Action']=np.where(my_select(c1,c2,c3), 'Hold Current stock level', 'On Sale')
print(df)
This prints to the console:
Stroage Condition refrigerate Profit Per Unit Inventory Qty Restock Action
0 0 1 0 20 On Sale
1 1 1 102 1 Hold Current stock level
2 2 2 5 2 Hold Current stock level
3 3 0 100 8 Hold Current stock level
If you have more conditions or rules, you have extend the lambda function with as many variables as rules.

Add weightings to Integer Linear Programming in Python / PuLP

I have successfully used Integer Linear Programming to solve a slot filling problem using the following (+ some hard / soft constraints).
def opt(A, max=float('inf'),
min=0):
"""Find the optimal solution
M, N = len(A), len(A[0])
# Create problem:
prob = pulp.LpProblem("optimiser", pulp.LpMaximize)
# Create variables:
x = pulp.LpVariable.dicts("x", itertools.product(range(M), range(N)),
cat=pulp.LpBinary)
# Constraints
Specifically I have a range of IDs (M) with an associated range of products (N), and products, despite initially being attached to multiple IDS can only be assigned to one ID at the end of the process. I am able to do this so far.
ID Product
1 100
1 200
1 300
2 100
2 200
3 100
4 500
4 200
5 600
6 600
These are converted to the binary matrix in order to assign to up to 6 slots available (a min max of which I control with soft constraints), and this works fine as a basic optimizer, however I need to add some extra rules.
For example, I want to add weights to certain products (they each have an assigned score as a separate variable) and I need the highest scored products prioritised to be added to the slots).
ID Product Score
1 100 5
1 200 5
1 300 2
2 100 5
2 200 5
3 100 5
4 500 1
4 200 5
6 600 4
6 600 4
So I would want to prioritise those products that score 5, to be assigned to an ID first. Once all those are in, then the next descending score and so on.
So
Is this possible here and any ideas how to get there?
Could I add further weighting variables if need be?
If not possible with ILP - any other suggestions? (hope that's not O/T)
I'd really appreciate any help here..

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

working with arrays of large size in python - python

Related

Group by a category

Python Dataframe using Columns as Index

Repeat rows based on numbers in multiple columns - Python

How to use or command in pandas to categorize my Data

Add weightings to Integer Linear Programming in Python / PuLP

Categories

Resources