I am currently working on understanding Pyspark, and am running into a problem. I am attempting to resolve how to order by multiple columns in the dataframe, when one of these is a count.
As an example, say I have a dataframe (df) with three columns, A,B,and C. I want to group by A and B, and then count these instances. So if there are 10 instances where A=1 and B=1, the Table for that row should look like:
A|B|Count
1 1 10
I have determined that I can do this fairly easily by running:
df.groupBy('A', 'B').count()
Then if I want to order this dataframe by count (descending), this is also pretty straightforward:
df.groupBy('A', 'B').count().orderBy(desc("count"))
This next step is where I am having trouble. What if now I want to also order by column C, ie order first by count, and then by C? I had thought that the syntax would be something akin to:
df.groupBy('A', 'B').count().orderBy(desc("count"), desc("C"))
But this does not work, presumably because once I run count(), the dataframe is limited to only the columns A, B, and count. Do I need to somehow create a new column in the original dataframe with the count column, and if so, how can I do this?
Is there another simpler way that I am missing to order by both count and C?
For clarity an example dataframe that I would like to end with could appear as:
A|B|Count|C
1 1 10 5
1 2 9 3
1 5 9 1
2 4 8 10
2 7 8 5
Any insights or guidance are greatly appreciated.
Try using a window Function , the column 'C' is not in the group by, hence is not available for order/sorting the columns. If you just want the grouped columns eg A,B and the count column, you can always use select statement to get just that after the window function.
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("A","B")
df.withColumn('count',F.count('*').over(windowSpec)).select("A","B","count").distinct().orderBy(F.col('count').desc(),F.col('C').desc()).show()
Related
I have 3 columns I want to create by using python. The most important is Column C. What I want to happen is Row 1 in Column C starts at 2 like in Column A. Then it adds Row 2 from Column A to reflect into Column C (2+3) to equal the next number in Column C, 5. This process is repeated until it reaches the last number. Is this possible to do in python? I can't use excel because I was told to do it in python but I don't know how. I know how to create and enter the inputs for Columns A and B. Anyone know how to do this?
j_t=list(map(int, input("Column A: ").split(",")))
d_t=list(map(int, input("Column B: ").split(",")))
# to store the job time and date in dictionaries
dict_spt={}
dict_edd={}
for i in range (len(j_t)):
dict_spt[int(j_t[i])]=int(d_t[i])
dict_edd[int(d_t[i])]=int(j_t[i])
I trying to select the name row with count>250, which is called effective here. So we will try to find the mean of its rate
t3=dfnew.groupby('name')['ratings']
t4=t3.count()
t5=t4[t4.values>250]
t6=t3.mean()
t6[(t6.index==t5.index)]
Obviously the problem is in last row of my code. Where I want to match t6's index with t5's index. If they match, then save it, otherwise left it out. It is kind of like inner join in SQL.
What should I do to modify last row?
Suppose dataframe like this
input:
name ratings
A 1
A 2
:
A 251
B 1
B 2
:
B 230
so intended result should be 126 ( (1+251)/2))
Output
A 126
t3=dfnew.groupby('name')['ratings'].agg(['count','mean'])
t5=t3[t3['count']>250]
t5
It works fine when I aggregate two functions at the same time.
I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).
frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1
a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
I was hoping values[picked_values] would work, but this ends up with five columns.
In the real-life example, picked_values is a lot larger and calculated.
Thank you for your time.
Use df.lookup
pd.Series(frame1.lookup(picked_values.index,picked_values))
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
dtype: float64
Here's a NumPy based approach using integer indexing and Series.searchsorted:
frame1.values[frame1.index, frame1.columns.searchsorted(picked_values.values)]
# array([0.22095278, 0.86200616, 1.88047197, 0.49816937, 0.10962954])
I have two variables (Columns) that are related: one represent the name a person, the other count the times this person workout in a week. The problem is about visualize that data.
when i want to see the data it looks like this:
x name wrk
0 0 E 1
1 1 A 2
2 2 B 5
3 3 A 3
4 4 C 6
now, the letters are repeated the times that this pearson appears in the variable "wrk". then I just want to see that letter, but without repetitions. For example when i want to see the mean of every person i see one letter and its mean on "wrk"
wrk
name
A 4.625000
B 5.142857
C 5.400000
D 3.833333
E 4.785714
I just want to see every value in wrk and only one letter in name, so I thought the solution is transforming wrk on a list to the output be like this:
work
name
A 1:2:3:5:7:8:10
B 1:2:4:7:8
C 1:6:9
D 1:2:3:5:7:8:10
E 1:2:3:5:7:8:10
the thing is I've shearched how to make this but i haven't found the code that helps me to do it. Can someone help me?
(sorry for my English, I'm learning)
Perhaps this?
df['wrk'] = df['wrk'].astype(str)
df = df.groupby('name')[['wrk']].agg(':'.join)
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)