Pyspark Dataframe pivot and groupby count - python

I am working on a pyspark dataframe which looks like below
id
category
1
A
1
A
1
B
2
B
2
A
3
B
3
B
3
B
I want to unstack the category column and count their occurrences. So, the result I want is shown below
id
A
B
1
2
1
2
1
1
3
Null
3
I tried finding something on the internet that can help me but I couldn't find anything that could give me this specific result.

Short version, dont have to do multiple groupBy's
df.groupBy("id").pivot("category").count().show()

Try this -- (Not sure its optimized)
df = spark.createDataFrame([(1,'A'),(1,'A'),(1,'B'),(2,'B'),(2,'A'),(3,'B'),(3,'B'),(3,'B')],['id','category'])
df = df.groupBy('id','category').count()
df.groupBy('id').pivot('category').sum('count').show()

Related

How to pivot survey/questionnaire data and count multiple option questions?

I'm trying to pivot a survey with many questions all sharing the same levels.
Basically I want to pivot from this:
Customer
Atribute 1
Atribute 2
Atribute 3
1
A
B
A
2
B
B
A
3
C
B
C
To this:
Product
Atribute 1
Atribute 2
Atribute 3
A
1
0
2
B
1
3
0
C
1
0
1
In my real data I have dozens of columns and levels (A, B, C...Z) and hundreds of customers.
I was once able to do this in R but it was years ago with an overcomplicated algorithm. I'm wondering if Python/pandas has an easy fix for this.
I suppose your DataFrame named df, then you can do it in this way:
import pandas as pd
df.apply(pd.Series.value_counts).fillna(0).astype(int).drop("Customer" , axis=1).rename_axis("Product")
this gives me:

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Getting maximum values in a column

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

accessing Groupby Sum results [duplicate]

I have a dataframe with 2 index levels:
value
Trial measurement
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
Which I want to turn into this:
Trial measurement value
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
How can I best do this?
I need this because I want to aggregate the data as instructed here, but I can't select my columns like that if they are in use as indices.
The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).
All you have to do call .reset_index() after the name of the DataFrame:
df = df.reset_index()
This doesn't really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one's multindex have the same name like this:
value
Trial Trial
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.
So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:
value
Trial measurement
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
And then df.reset_index(inplace=True) will work like a charm.
I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.
There may be situations when df.reset_index() cannot be used (e.g., when you need the index, too). In this case, use index.get_level_values() to access index values directly:
df['Trial'] = df.index.get_level_values(0)
df['measurement'] = df.index.get_level_values(1)
This will assign index values to individual columns and keep the index.
See the docs for further info.
As #cs95 mentioned in a comment, to drop only one level, use:
df.reset_index(level=[...])
This avoids having to redefine your desired index after reset.
I ran into Karl's issue as well. I just found myself renaming the aggregated column then resetting the index.
df = pd.DataFrame(df.groupby(['arms', 'success'])['success'].sum()).rename(columns={'success':'sum'})
df = df.reset_index()
Short and simple
df2 = pd.DataFrame({'test_col': df['test_col'].describe()})
df2 = df2.reset_index()
A solution that might be helpful in cases when not every column has multiple index levels:
df.columns = df.columns.map(''.join)
Similar to Alex solution in a more generalized form. It keeps the indexes untouched and adds index level as a new columns with its name.
for i in df.index.names:
df[i] = df.index.get_level_values(i)
which gives
value Trial measurement
Trial measurement
1 0 13 1 0
1 3 1 1
2 4 1 2
...

How to drop designate row of pandas dataframe

I´m new to python, sorry for any mistakes I make, I hope you can understand me.
I have a problem quit like dropping duplicate row.But here i view 1,2 the same as 2,1.And there isn't any actually duplicate items in the pandas dataframe. For example,i have df as
first second
1 2
2 1
2 4
4 2
and i need df eventually become:
first second
1 2
2 4
How to tackle this problem .
thanks in advance.
update
Here is another problem,the dataframe have 1860000 rows,so when using this method will raise memory error. Is there some ways to tackle this?
You can use apply with sorted and then drop_duplicates:
print (df.apply(sorted, axis=1))
first second
0 1 2
1 1 2
2 2 4
3 2 4
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
first second
0 1 2
2 2 4

Categories

Resources