SQL-style outer join for two lists - python

I have data from a platform that records a users events - whether answers to polls, or clickstream data. I am trying to bring together a number of related datasets, each of which has a session_id column.
Each dataset began as a csv that was read in as a series of nested lists. Not every session will have a user answering a question, or completing certain actions, so each dataset will not contain an entry for every session -- however, every session exists in at least one of the datasets.
assume there are 5 sessions recorded:
e.g. dataset 1:
SessionID |a | b | c | d
1 | x | x | x | x
2 | x | x | x | x
5 | x | x | x | x
e.g. dataset 2:
SessionID |e | f | g | h
1 | x | x | x | x
3 | x | x | x | x
5 | x | x | x | x
e.g. dataset 3:
SessionID |i | j | k | l
2 | x | x | x | x
3 | x | x | x | x
4 | x | x | x | x
How would I construct this:
SessionID |a | b | c | d | e | f | h | i |j | k | l
1 | x | x | x | x | x | x | x | x | - | - | - | -
2 | x | x | x | x | - | - | - | - | x | x | x | x
3 | - | - | - | - | x | x | x | x | x | x | x | x
4 | - | - | - | - | - | - | - | - | x | x | x | x
5 | x | x | x | x | x | x | x | x | - | - | - | -
By far the easiest way to do this is to import each csv into pandas:
merged_df = pd.merge(dataset1, dataset2, how = 'outer', on="sessionID")
pd.merge(merged_df, dataset3, how = 'outer', on="sessionID")
however the requirements are that I not use any external libraries.
I'm struggling to find a workable logic to detect gaps in the sessionID, and then pad the lists with null data so the three lists would be simply added together.
Any ideas?

How do you define "external libraries"? Does sqlite3 qualify as external or internal?
If it doesn't and you want to think about the problem in terms of relational operations, you could slam your tables into a sqlite3 file and take it from there.
If the number of datasets is finite, you could create a class Session, containing a dictionary where each column (a to j) would be a key. If you are proficient, you could use the __getattr__ function to use a "dot" notation when you need it. For the "table", I would simply use a dictionary, with the key as the id, then fill up your dictionary in three steps (dataset1, dataset2, dataset3). In this way you wouldn't have to worry about gaps.

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

How to sum with condition in a Django queryset

I am trying to sum Django query with a condition. Suppose I have got some data like this:
| Name | Type |
---------------
| a | x |
| b | z |
| c | x |
| d | x |
| e | y |
| f | x |
| g | x |
| h | y |
| i | x |
| j | x |
| k | x |
| l | x |
And these types are string and they have values, like x = 1, y = 25, z = -3
How can I sum up all the values without a loop? Currently using a loop.
data = A.objects.all()
sum = 0
mapp = {'x': 1, 'y': 25, 'z': -3}
for datum in list(data):
sum = sum + mapp[datum.type]
print(sum)
To perform the calculation inside the database use Queryset.aggregate() with an aggregate expression that uses Case/When:
from django.db.models import Sum, Case, When
A.objects.all().aggregate(
amount=Sum(Case(
When(type="x", then=1),
When(type="y", then=25),
When(type="z", then=-3),
default=0,
output_field=FloatField()
))
)
You could shorthand it like:
sum(mapp.get(o['Type'],0) for o in data)
or simpler if you trust the data to have all valid types:
sum(mapp[o['Type']] for o in data)
(Don't trust the data though)
If the mapping x = 1, y = 25 etc... is coming from another table, you can use some SQL voodoo to let the database handle the summation for you. Other wise you have to (one way or another) loop through all results and sum them up.
You could also theoretically just count the distinct amount of x, y and z in the table and have sum = x*count_of_x + y*count_of_y + z*count_of_z based on how the example is structured.

How to extract coefficients from pydynpd package in python(system gmm)?

I am trying to run a monte carlo simulation on a model estimated by the system gmm. Therefore, I need to extract the coefficients of my model from the prettytable from the pydynpd package in python (https://github.com/dazhwu/pydynpd). I am searching for a command/function that returns just like statsmodels with fit().params, the coefficients in an array.
Sorry. I just saw your question. You can always post your questions at
https://github.com/dazhwu/pydynpd/issues
For example, if you run:
df = pd.read_csv("data.csv")
mydpd = regression.abond('n L(1:2).n w k | gmm(n, 2:4) gmm(w, 1:3) iv(k) ', df, ['id', 'year'])
The output regression table will be
+------+------------+---------------------+------------+-----------+-----+
| n | coef. | Corrected Std. Err. | z | P>|z| | |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.9453810 | 0.1429764 | 6.6121470 | 0.0000000 | *** |
| L2.n | -0.0860069 | 0.1082318 | -0.7946553 | 0.4268140 | |
| w | -0.4477795 | 0.1521917 | -2.9422068 | 0.0032588 | ** |
| k | 0.1235808 | 0.0508836 | 2.4286941 | 0.0151533 | * |
| _con | 1.5630849 | 0.4993484 | 3.1302492 | 0.0017466 | ** |
+------+------------+---------------------+------------+-----------+-----+
If you want to programably extract a value, for example, the first z value (6.6121470) then you can add the following:
>>>mydpd.models[0].regression_table.iloc[0]['z_value']
6.6121469997085915
Basically, the object mydpd returned above contains models. By default, it only contains one model which is models[0]. A model has a regression table which is a pandas dataframe:
>>>mydpd.models[0].regression_table
variable coefficient std_err z_value p_value sig
0 L1.n 0.945381 0.142976 6.612147 3.787856e-11 ***
1 L2.n -0.086007 0.108232 -0.794655 4.268140e-01
2 w -0.447780 0.152192 -2.942207 3.258822e-03 **
3 k 0.123581 0.050884 2.428694 1.515331e-02 *
4 _con 1.563085 0.499348 3.130249 1.746581e-03 **
So you can extract any value from this dataframe.

Generate buckets of a numerical variable using interquartile range in pandas

I have some data stored in my pandas dataframe that shows salary for a bunch of users and their category.
| category | user_id | salary |
|----------|-----------|--------|
| A | 546457568 | 49203 |
| C | 356835679 | 49694 |
| A | 356785637 | 48766 |
| B | 45668758 | 36627 |
| C | 686794 | 59508 |
| C | 234232376 | 32765 |
| C | 4356345 | 44058 |
| A | 9878987 | 9999999|
What i would like to do is generate a new column salary_bucket that shows a bucket for salary, that is determined from the upper/lower limits of the Interquartile range for salary.
e.g. calculate upper/lower limits according to q1 - 1.5 x iqr and q3 + 1.5 x iqr, then split this into 10 equal buckets and assign each row to the relevant bucket based on salary. I know from exploration that there is no data outside the lower limit , but for data above the upper limit I would like a seperate bucket such as outside_iqr.
In the end I would liek to get something like so:
| category | user_id | salary | salary_bucket |
|----------|-----------|--------|---------------|
| A | 546457568 | 49203 | 7 |
| C | 356835679 | 49694 | 7 |
| A | 356785637 | 48766 | 7 |
| B | 45668758 | 36627 | 3 |
| C | 686794 | 59508 | 5 |
| C | 234232376 | 32765 | 3 |
| C | 4356345 | 44058 | 4 |
| A | 9878987 | 9999999|outside_iqr |
(these buckets are not actually calculate just for illustration sake)
Is something like qcut useful here?
You can use pandas.cut to turn continuous data into categorical data.
# First, we need to calculate our IQR.
q1 = df.salary.quantile(0.25)
q3 = df.salary.quantile(0.75)
iqr = q3 - q1
# Now let's calculate upper and lower bounds.
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
# Let us create our bins:
num_bins = 10
bin_width = (upper - lower) / num_bins
bins = [lower + i*bin_width for i in range(num_bins)]
bins += [upper, float('inf')] # Now we add our last bin, which will contain any value greater than the upper-bound of the IQR.
# Let us create our labels:
labels = [f'Bucket {i}' for i in range(1,num_bins+1)]
labels.append('Outside IQR')
# Finally, we add a new column to the df:
df['salary_bucket'] = pd.cut(df.salary, bins=bins, labels=labels)
So basically, you'll need to generate your own list of buckets and labels according to what you require, and then pass those as arguments to pandas.cut.

Pandas inter-column referencing

I have some data as follows:
+--------+------+
| Reason | Keys |
+--------+------+
| x | a |
| y | a |
| z | a |
| y | b |
| z | b |
| x | c |
| w | d |
| x | d |
| w | d |
+--------+------+
I want to get the Reason corresponding to the first occurrence of each Key. Like here, I should get Reasons x,y,x,w for Keys a,b,c,d respectively. After that, I want to compute the percentage of each Reason, as in a metric for how many times each Reason occurs. Thus x = 2/4 = 50%. And w,y = 25% each.
For the percentage, I think I can use something like value_counts(normalize=True) * 100, based on the previous step. What is a good way to proceed?
You are right about the second step and the first step could be achieved by
summary = df.groupby("Keys").first()
You can using drop_duplicates
df.drop_duplicates(['Reason'])
Out[207]:
Reason Keys
0 x a
1 y a
2 z a
6 w d

Categories

Resources