pivot_table with group and without value field - python

I have pandas data frame url like
location dom_category
3 'edu'
3 'gov'
3 'edu'
4 'org'
4 'others'
4 'org'
and i want this data frame to be like
location edu gov org others
3 2 1 0 0
4 0 0 2 1
the edu,gov,org and others contains the count for specific location.
i have right the code but i know its not the optimized
url['val']=1
url_final=url.pivot_table(index=['location'],values='val',columns=
['dom_category'],aggfunc=np.sum)

First if necessary remove ' by str.strip.
Then use groupby with aggregating size and reshape by unstack:
df['dom_category'] = df['dom_category'].str.strip("\'")
df = df.groupby(['location','dom_category']).size().unstack(fill_value=0)
print (df)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
Or use pivot_table:
df['dom_category'] = df['dom_category'].str.strip("\'")
df=df.pivot_table(index='location',columns='dom_category',aggfunc='size', fill_value=0)
print (df)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
Last is possible convert index to column and remove columns name dom_category by reset_index + rename_axis:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Using groupby and value_counts
House Keeping
get rid of '
df.dom_category = df.dom_category.str.strip("'")
Rest of Solution
df.groupby('location').dom_category.value_counts().unstack(fill_value=0)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
To get the formatting just right
df.groupby('location').dom_category.value_counts().unstack(fill_value=0) \
.reset_index().rename_axis(None, 1)
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Let's use str.strip, get_dummies and groupby:
df['dom_category'] = df.dom_category.str.strip("\'")
df.assign(**df.dom_category.str.get_dummies()).groupby('location').sum().reset_index()
Output:
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Related

Combine values in pandas dataframe to string

I have a dataframe similar to this:
Male Over18 Single
0 0 0 1
1 1 1 1
2 0 0 1
I would like an extra column which gets a commaseperated string with the columnnames where the value is 1:
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male, Over18, Single
2 0 0 1 Single
Hope there is someone out there who can help :)
One pandaic way is to perform a pandas dot product with the column headers:
df['CombinedString'] = df.dot(df.columns+',').str.rstrip(',')
df
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single
Another method would be to use .stack() and groupby.agg()
df['CombinedString'] = df.mask(df.eq(0)).stack().reset_index(1)\
.groupby(level=0)['level_1'].agg(','.join)
print(df)
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single

How do I merge two dataframes that don't share one common index?

Here is the first DataFrame:
In: df.head()
Out:
avg_lmp avg_load
read_year read_month trading_block
2017 3 0 24.606666 0.018033
1 32.090800 0.023771
4 0 25.136200 0.017487
1 33.487529 0.023570
5 0 24.085170 0.018008
And here is the second DataFrame that I want to join to the first one based on month (even if it has to repeat values such as read_year = 2018 and read_month = 3. If it's 2019 and the read_month is 3, I want it to say the same value for read_month 3.
In: df2.head()
Out:
fpc
read_month trading_block
1 0 37.501837
1 45.750000
2 0 35.531818
1 41.550000
3 0 28.348427
1 35.900000
4 0 26.250870
1 34.150000
5 0 23.599388
1 34.550000
6 0 25.617027
1 38.670000
7 0 27.531765
1 42.050000
8 0 26.628298
1 40.400000
9 0 25.201923
1 36.500000
10 0 25.299149
1 35.250000
11 0 25.349091
1 34.300000
12 0 28.249623
1 35.500000
Is it clear what I'm asking for?
You seem to have common indexes. Set them, then join:
df = df.reset_index().set_index(['read_month', 'trading_block']).join(df2)
and if you wish:
df.reset_index().set_index(['read year', 'read_month', 'trading_block'])
Not sure if that is what you're after:
index avg_lmp avg_load fpc
read_year read_month trading_block
2017 3 0 0 24.606666 0.018033 28.348427
1 1 32.090800 0.023771 35.900000
4 0 2 25.136200 0.017487 26.250870
1 3 33.487529 0.023570 34.150000
5 0 4 24.085170 0.018008 23.599388
maybe try this. just merge on the two columns with a outer. (like full outer)
DataFrame1.merge(DataFrame2, left_on='read_month', right_on='read_month', how='outer')

Label encoding across multiple columns with same attributes in sckit-learn

If I have two columns as below:
Origin Destination
China USA
China Turkey
USA China
USA Turkey
USA Russia
Russia China
How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e
Origin Destination
0 1
0 3
1 0
1 0
1 0
2 1
If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case
stack
df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
factorize with reshape
pd.DataFrame(
pd.factorize(df.values.ravel())[0].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
np.unique and reshape
pd.DataFrame(
np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Disgusting Option
I couldn't stop trying stuff... sorry!
df.applymap(
lambda x, y={}, c=itertools.count():
y.get(x) if x in y else y.setdefault(x, next(c))
)
Origin Destination
0 0 1
1 0 3
2 1 0
3 1 3
4 1 2
5 2 0
As pointed out by cᴏʟᴅsᴘᴇᴇᴅ
You can shorten this by assigning back to dataframe
df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)
pandas Method
You could create a dictionary of {country: value} pairs and map the dataframe to that:
country_map = {country:i for i, country in enumerate(df.stack().unique())}
df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)
>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
sklearn method
Since you tagged sklearn, you could use LabelEncoder():
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())
df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])
>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
To get the original labels back:
>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)
You can using replace
df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Succinct and nice answer from Pir
df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))
And
df.replace(dict(zip(np.unique(df), itertools.count())))
Edit: just found out about return_inverse option to np.unique. No need to search and substitute!
df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)
You could leverage the vectorized version of np.searchsorted with
df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)
Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.
df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)
Using LabelEncoder from sklearn, you can also try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())
df = df.apply(le.fit_transform)
print(df)
Result:
Origin Destination
0 0 3
1 0 2
2 2 0
3 2 2
4 2 1
5 1 0
If you have more columns and only want to apply to selected columns of dataframe then, you can try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())
df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

Group a dataframe and count amount of items of a column that is not shown

Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0

Categories

Resources