Pandas - Sort order by Last Page Viewed - python

Can anyone help me sort the order of last page viewed?
I have a dataframe where I am attempting to sort it by the previous page viewed and I am having a really hard time coming up with an efficient method using Pandas.
For example from this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | A | D |
| 1051471580 | C | B |
| 1051471580 | A | exit |
| 1051471580 | B | A |
| 1051471580 | D | A |
| 1051471580 | entrance | C |
+------------+------------------+----------+
To this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | entrance | C |
| 1051471580 | C | B |
| 1051471580 | B | A |
| 1051471580 | A | D |
| 1051471580 | D | A |
| 1051471580 | A | exit |
+------------+------------------+----------+
However it could be millions of rows long for thousands of different customers so I really need to think how to make this efficient.
pd.DataFrame({
'Customer':'1051471580',
'previousPagePath': ['E','C','B','A','D','A'],
'pagePath': ['C','B','A','D','A','F']
})
Thanks!

What you're trying to do is topological sorting, which can be achieved with networkx. Note that I had to change some values in your dataframe in order to prevent it throwing a cycle error, so I hope that the data you work on contains unique values:
import networkx as nx
import pandas as pd
data = [ [1051471580, "Z", "D"], [1051471580,"C","B" ], [1051471580,"A","exit" ], [1051471580,"B","Z" ], [1051471580,"D","A" ], [1051471580,"entrance","C" ] ]
df = pd.DataFrame(data, columns=['Customer', 'previousPagePath', 'pagePath'])
edges = df[df.pagePath != df.previousPagePath].reset_index()
dg = nx.from_pandas_edgelist(edges, source='previousPagePath', target='pagePath', create_using=nx.DiGraph())
order = list(nx.lexicographical_topological_sort(dg))
result = df.set_index('previousPagePath').loc[order[:-1], :].dropna().reset_index()
result = result[['Customer', 'previousPagePath', 'pagePath']]
Output:
| | Customer | previousPagePath | pagePath |
|---:|-----------:|:-------------------|:-----------|
| 0 | 1051471580 | entrance | C |
| 1 | 1051471580 | C | B |
| 2 | 1051471580 | B | Z |
| 3 | 1051471580 | Z | D |
| 4 | 1051471580 | D | A |
| 5 | 1051471580 | A | exit |

you can sort your DataFrame by column like that.
df = pd.DataFrame({'Customer':'1051471580','previousPagePath':['E','C','B','A','D','A'], 'pagePath':['C','B','A','D','A','F']})
df.sort_values(by='previousPagePath')
and you can find the document here pandas.DataFrame.sort_values

Related

print multi-index dataframe with tabulate

How can one print a multi-index Dataframe such as the one below:
import numpy as np
import tabulate
import pandas as pd
df = pd.DataFrame(np.random.randn(4, 3),
index=pd.MultiIndex.from_product([["foo", "bar"],
["one", "two"]]),
columns=list("ABC"))
so that the two levels of the Multindex show as separate columns, much the same way pandas itself prints it:
In [16]: df
Out[16]:
A B C
foo one -0.040337 0.653915 -0.359834
two 0.271542 1.328517 1.704389
bar one -1.246009 0.087229 0.039282
two -1.217514 0.721025 -0.017185
However, tabulate prints like this:
In [28]: print(tabulate.tabulate(df, tablefmt="github", headers="keys", showindex="always"))
| | A | B | C |
|----------------|------------|-----------|------------|
| ('foo', 'one') | -0.0403371 | 0.653915 | -0.359834 |
| ('foo', 'two') | 0.271542 | 1.32852 | 1.70439 |
| ('bar', 'one') | -1.24601 | 0.0872285 | 0.039282 |
| ('bar', 'two') | -1.21751 | 0.721025 | -0.0171852 |
MultiIndexes are represented by tuples internally, so tabulate is showing you the right thing.
If you want column-like display, the easiest is to reset_index first:
print(tabulate.tabulate(df.reset_index().rename(columns={'level_0':'', 'level_1': ''}), tablefmt="github", headers="keys", showindex=False))
Output:
| | | A | B | C |
|-----|-----|-----------|-----------|-----------|
| foo | one | -0.108977 | 2.03593 | 1.11258 |
| foo | two | 0.65117 | -1.48314 | 0.391379 |
| bar | one | -0.660148 | 1.34875 | -1.10848 |
| bar | two | 0.561418 | 0.762137 | 0.723432 |
Alternatively, you can rework the MultiIndex to a single index:
df2 = df.copy()
df2.index = df.index.map(lambda x: '|'.join(f'{e:>5} ' for e in x))
print(tabulate.tabulate(df2.rename_axis('index'), tablefmt="github", headers="keys", showindex="always"))
Output:
| index | A | B | C |
|------------|-----------|-----------|-----------|
| foo | one | -0.108977 | 2.03593 | 1.11258 |
| foo | two | 0.65117 | -1.48314 | 0.391379 |
| bar | one | -0.660148 | 1.34875 | -1.10848 |
| bar | two | 0.561418 | 0.762137 | 0.723432 |

Unstack (pivot?) dataframe in Pandas

I have a dataframe somewhat like this:
ID | Relationship | First Name | Last Name | DOB | Address | Phone
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891
1 | 2 | Spouse | Bulma | Saiyan | 04/20/1969 | Saiyan Planet | 123-456-7891
2 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321
3 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870
4 | 4 | Child | Gohan | Kakarot | 04/02/2001 | Planet Earth | 321-654-9870
5 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568
I'm looking to have the rows with same ID appended to the right of the first row with that ID.
Example:
ID | Relationship | First Name | Last Name | DOB | Address | Phone | Spouse_First Name | Spouse_Last Name | Spouse_DOB | Child_First Name | Child_Last Name | Child_DOB |
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891 | Bulma | Saiyan | 04/20/1969 | | |
1 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321 | | | | | |
2 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870 | | | | Gohan | Kakarot | 04/02/2001 |
3 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568 | | | | | |
My real scenario dataframe has more columns, but they all have the same information when the two rows share the same ID, so no need to duplicate those in the other rows. I only need to add to the right the columns that I choose, which in this case would be First Name, Last Name, DOB with the identifier for the new column label depending on what's on the 'Relationship' column (I can rename them later if it's not possible to do in a straight way, just wanted to illustrate my point.
Now that I've said this, I want to add that I have tried different ways and seems like approaching with unstack or pivot is the way to go but I have not been successful in making it work.
Any help would be greatly appreciated.
This solution assumes that the DataFrame is indexed by the ID column.
not_self = (
df.query("Relationship != 'Self'")
.pivot(columns='Relationship')
.swaplevel(axis=1)
.reindex(
pd.MultiIndex.from_product(
(
set(df['Relationship'].unique()) - {'Self'},
df.columns.to_series().drop('Relationship')
)
),
axis=1
)
)
not_self.columns = [' '.join((a, b)) for a, b in not_self.columns]
result = df.query("Relationship == 'Self'").join(not_self)
Please let me know if this is not what was wanted.

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Maximizing a combination of a series of values

This is a complicated one, but I suspect there's some principle I can apply to make it simple - I just don't know what it is.
I need to parcel out presentation slots to a class full of students for the semester. There are multiple possible dates, and multiple presentation types. I conducted a survey where students could rank their interest in the different topics. What I'd like to do is get the best (or at least a good) distribution of presentation slots to students.
So, what I have:
List of 12 dates
List of 18 students
CSV file where each student (row) has a rating 1-5 for each date
What I'd like to get:
Each student should have one of presentation type A (intro), one of presentation type B (figures) and 3 of presentation type C (aims)
Each date should have at least 1 of each type of presentation
Each date should have no more than 2 of type A or type B
Try to give students presentations that they rated highly (4 or 5)
I should note that I realize this looks like a homework problem, but it's real life :-). I was thinking that I might make a Student class for each student that contains the dates for each presentation type, but I wasn't sure what the best way to populate it would be. Actually, I'm not even sure where to start.
TL;DR: I think you're giving your students too much choice :D
But I had a shot at this problem anyway. Pretty fun exercise actually, although some of the constraints were a little vague. Most of all, I had to guess what the actual students' preference distribution would look like. I went with uniformly distributed, independent variables, although that's probably not very realistic. Still I think it should work just as well on real data as it does on my randomly generated data.
I considered brute forcing it, but a rough analysis gave me an estimate of over 10^65 possible configurations. That's kind of a lot. And since we don't have a trillion trillion years to consider all of them, we'll need a heuristic approach.
Because of the size of the problem, I tried to avoid doing any backtracking. But this meant that you could get stuck; there might not be a solution where everyone only gets dates they gave 4's and 5's.
I ended up implementing a double-edged Iterative Deepening-like search, where both the best case we're still holding out hope for (i.e., assign students to a date they gave a 5) and the worst case scenario we're willing to accept (some student might have to live with a 3) are gradually lowered until a solution is found. If we get stuck, reset, lower expectations, and try again. Tasks A and B are assigned first, and C is done only after A and B are complete, because the constraints on C are far less stringent.
I also used a weighting factor to model the trade off between maximizing students happiness with satisfying the types-of-presentations-per-day limits.
Currently it seems to find a solution for pretty much every random generated set of preferences. I included an evaluation metric; the ratio between the sum of the preference values of all assigned student/date combos, and the sum of all student ideal/top 3 preference values. For example, if student X had two fives, one four and the rest threes on his list, and is assigned to one of his fives and two threes, he gets 5+3+3=11 but could ideally have gotten 5+5+4=14; he is 11/14 = 78.6% satisfied.
After some testing, it seems that my implementation tends to produce an average student satisfaction of around 95%, at lot better than I expected :) But again, that is with fake data. Real preferences are probably more clumped, and harder to satisfy.
Below is the core of the algorihtm. The full script is ~250 lines and a bit too long for here I think. Check it out at Github.
...
# Assign a date for a given task to each student,
# preferring a date that they like and is still free.
def fill(task, lowest_acceptable, spread_weight=0.1, tasks_to_spread="ABC"):
random_order = range(nStudents) # randomize student order, so everyone
random.shuffle(random_order) # has an equal chance to get their first pick
for i in random_order:
student = students[i]
if student.dates[task]: # student is already assigned for this task?
continue
# get available dates ordered by preference and how fully booked they are
preferred = get_favorite_day(student, lowest_acceptable,
spread_weight, tasks_to_spread)
for date_nr in preferred:
date = dates[date_nr]
if date.is_available(task, student.count, lowest_acceptable == 1):
date.set_student(task, student.count)
student.dates[task] = date
break
# attempt to "fill()" the schedule while gradually lowering expectations
start_at = 5
while start_at > 1:
lowest_acceptable = start_at
while lowest_acceptable > 0:
fill("A", lowest_acceptable, spread_weight, "AAB")
fill("B", lowest_acceptable, spread_weight, "ABB")
if lowest_acceptable == 1:
fill("C", lowest_acceptable, spread_weight_C, "C")
lowest_acceptable -= 1
And here is an example result as printed by the script:
Date
================================================================================
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
================================================================================
1 | | A | B | | C | | | | | | | |
2 | | | | | A | | | | | B | C | |
3 | | | | | B | | | C | | A | | |
4 | | | | A | | C | | | | | | B |
5 | | | C | | | | A | B | | | | |
6 | | C | | | | | | | A | B | | |
7 | | | C | | | | | B | | | | A |
8 | | | A | | C | | B | | | | | |
9 | C | | | | | | | | A | | | B |
10 | A | B | | | | | | | C | | | |
11 | B | | | A | | C | | | | | | |
12 | | | | | | A | C | | | | B | |
13 | A | | | B | | | | | | | | C |
14 | | | | | B | | | | C | | A | |
15 | | | A | C | | B | | | | | | |
16 | | | | | | A | | | | C | B | |
17 | | A | | C | | | B | | | | | |
18 | | | | | | | C | A | B | | | |
================================================================================
Total student satisfaction: 250/261 = 95.00%

Categories

Resources