I'm trying to pivot a survey with many questions all sharing the same levels.
Basically I want to pivot from this:
Customer
Atribute 1
Atribute 2
Atribute 3
1
A
B
A
2
B
B
A
3
C
B
C
To this:
Product
Atribute 1
Atribute 2
Atribute 3
A
1
0
2
B
1
3
0
C
1
0
1
In my real data I have dozens of columns and levels (A, B, C...Z) and hundreds of customers.
I was once able to do this in R but it was years ago with an overcomplicated algorithm. I'm wondering if Python/pandas has an easy fix for this.
I suppose your DataFrame named df, then you can do it in this way:
import pandas as pd
df.apply(pd.Series.value_counts).fillna(0).astype(int).drop("Customer" , axis=1).rename_axis("Product")
this gives me:
Related
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
I am working on a pyspark dataframe which looks like below
id
category
1
A
1
A
1
B
2
B
2
A
3
B
3
B
3
B
I want to unstack the category column and count their occurrences. So, the result I want is shown below
id
A
B
1
2
1
2
1
1
3
Null
3
I tried finding something on the internet that can help me but I couldn't find anything that could give me this specific result.
Short version, dont have to do multiple groupBy's
df.groupBy("id").pivot("category").count().show()
Try this -- (Not sure its optimized)
df = spark.createDataFrame([(1,'A'),(1,'A'),(1,'B'),(2,'B'),(2,'A'),(3,'B'),(3,'B'),(3,'B')],['id','category'])
df = df.groupBy('id','category').count()
df.groupBy('id').pivot('category').sum('count').show()
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
Supposed I have a set of data with two labels, put in a pandas Dataframe:
label1 label2
0 0 a
1 1 a
2 1 a
3 1 a
4 1 a
5 2 b
6 0 b
7 1 b
8 2 b
9 0 b
10 2 c
11 1 c
12 2 c
13 0 c
14 2 c
Using the following code, the number of elements for each combination of labels can be obtained:
grouped = df.groupby(['label1', 'label2'], sort = False)
grouped.size()
The result is something like this:
label1 label2
0 a 1
1 a 4
2 b 2
0 b 2
1 b 1
2 c 3
1 c 1
0 c 1
dtype: int64
However, I would also like to compare the distribution of data count for label 2 in each label 1 group. I imagine the most convenient way to further manipulate the data for this purpose would be having a Dataframe (or some sort of table) with label1/2 as rows/columns and content as data count, like this:
a b c
0 1 2 1
1 4 1 1
2 0 2 3
After a while of search, to my surprise, there seems no easy way to do this kind of dataframe reshaping in pandas.
Using a loop is possible. But I assumed it would be super slow, since in the real data, there are hundreds of thousands of different labels.
Moreover, there seems no way to get a group from only label1, after grouping with both label1 and label2, so the loop will have to be on the combination of labels, which might make things even slower and more complicated.
Anyone knows a smart way to do this?
Probably crosstab:
pd.crosstab(df.label1, df.label2)
Are you looking for pd.pivot_table?
df.pivot_table(index='label1', columns='label2', aggfunc='size').fillna(0)
Let's say I have df1:
m n
0 a 1
1 b 2
2 c 3
3 d 4
and df2:
n k
0 1 z
1 2 g
I just want to get the piece of df1 where the values of column 'n' are the same as those present in df2:
m n
0 a 1
1 b 2
What's the best way to do this? It seemed trivial beforehand but then surprisingly nothing I tried worked. For example I tried
df1[df1["n"] == df2["n"]]
but this gave me
ValueError: Can only compare identically-labeled Series objects
You are looking for isin
df1.loc[df1.n.isin(df2.n),:]
I have a question, how does one count the number of unique values that occur within each column of a pandas data-frame?
Say I have a data frame named df that looks like this:
1 2 3 4
a yes f c
b no f e
c yes d h
I am wanting to get output that shows the frequency of unique values within the four columns. The output would be something similar to this:
Column # of Unique Values
1 3
2 2
3 2
4 3
I don't need to know what the unique values are, just how many there are within each column.
I have played around with something like this:
df[all_cols].value_counts()
[all_cols] is a list of all the columns within the data frame. But this is counting how many times the value appears within the column.
Any advice/suggestions would be a great help. Thanks
You could apply Series.nunique:
>>> df.apply(pd.Series.nunique)
1 3
2 2
3 2
4 3
dtype: int64
Or you could do a groupby/nunique on the unstacked version of the frame:
>>> df.unstack().groupby(level=0).nunique()
1 3
2 2
3 2
4 3
dtype: int64
Both of these produce a Series, which you could then use to build a frame with whatever column names you wanted.
You could try df.nunique()
>>> df.nunique()
1 3
2 2
3 2
4 3
dtype: int64