How to create Boolean indicator matrix in PYTHON - python

I have the following dataset:
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 2 dropkick murphys f Germany
4 2 le tigre f Germany
.
.
289950 19718 bob dylan f Canada
289951 19718 pixies f Canada
289952 19718 the clash f Canada
I want to create a Boolean indicator matrix using a dataframe, where there is one row for each user and one column for each artist. For each row(user) if there is artist return 1 else return 0.
Just to mention, there are 1004 unique artists and 15000 unique users—it’s a large data set.
I have created an empty matrix using the following:
pd.DataFrame(index=user, columns=artist)
I am having difficulty populating the dataframe correctly.

There is a method in pandas called notnull
Suppose your dataframe is named df, you should use:
df['has_artist'] = df['artist'].notnull()
This will add a column of boolean named has_artist to your dataframe
If you want to have 0 and 1 do instead:
df['has_artist'] = df['artist'].notnull().astype(int)
You can also store it in a different variable and not alter your dataframe.

Related

To get the Index of a particular element, which is a sentence string present in a column of a DataFrame with conditions

I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)

Counting number of values in a column using groupby on a specific conditon in pandas

I have a dataframe which looks something like this:
dfA
name field country action
Sam elec USA POS
Sam elec USA POS
Sam elec USA NEG
Tommy mech Canada NEG
Tommy mech Canada NEG
Brian IT Spain NEG
Brian IT Spain NEG
Brian IT Spain POS
I want to group the dataframe based on the first 3 columns adding a new column "No of data". This is something which I do using this:
dfB = dfA.groupby(["name", "field", "country"], dropna=False).size().reset_index(name = "No_of_data")
This gives me a new dataframe which looks something like this:
dfB
name field country No_of_data
Sam elec USA 3
Tommy mech Canada 2
Brian IT Spain 3
But now I also want to add a new column to this particular dataframe which tells me what is the count of number of "POS" for every combination of "name", "field" and "country". Which should look something like this:
dfB
name field country No_of_data No_of_POS
Sam elec USA 3 2
Tommy mech Canada 2 0
Brian IT Spain 3 1
How do I add the new column (No_of_POS) to the table dfB when I dont have the information about "POS NEG" in it and needs to be taken from dfA.
You can use a dictionary with functions in the aggregate method:
dfA.groupby(["name", "field", "country"], as_index=False)['action']\
.agg({'No_of_data': 'size', 'No_of_POS': lambda x: x.eq('POS').sum()})
You can precompute the boolean before aggregating; performance should be better as the data size increases :
(df.assign(action = df.action.eq('POS'))
.groupby(['name', 'field', 'country'],
sort = False,
as_index = False)
.agg(no_of_data = ('action', 'size'),
no_of_pos = ('action', 'sum'))
name field country no_of_data no_of_pos
0 Sam elec USA 3 2
1 Tommy mech Canada 2 0
2 Brian IT Spain 3 1
You can add an aggregation function when you're grouping your data. Check agg() function, maybe this will help.

Python Pandas fill missing zipcode with values from another datafrane based on conditions

I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.

Listing unique value counts per groups in pandas dataframe

I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)

Delete rows based on values in column in python

I am performing data clean on a .csv file for performing analytics. I am trying delete the rows having null values in their column in python.
Sample file:
Unnamed: 0 2012 2011 2010 2009 2008 2005
0 United States of America 760739 752423 781844 812514 843683 862220
1 Brazil 732913 717185 715702 651879 649996 NaN
2 Germany 520005 513458 515853 519010 518499 494329
3 United Kingdom (England and Wales) 310544 336997 367055 399869 419273 541455
4 Mexico 211921 212141 230687 244623 250932 239166
5 France 193081 192263 192906 193405 187937 148651
6 Sweden 87052 89457 87854 86281 84566 72645
7 Romania 17219 12299 12301 9072 9457 8898
8 Nigeria 15388 NaN 18093 14075 14692 NaN
So far used is:
from pandas import read_csv
link = "https://docs.google.com/spreadsheets......csv"
data = read_csv(link)
data.head(100000)
How can I delete these rows?
Once you have your data loaded you just need to figure out which rows to remove:
bad_rows = np.any(np.isnan(data), axis=1)
Then:
data[~bad_rows].head(100)
You need to use the dropna method to remove these values. Passing in how='any' into the method as an argument will remove the row if any of the values is null and how='all' will only remove the row if all of the values are null.
cleaned_data = data.dropna(how='any')
Edit 1.
It's worth noting that you may not want to have to create a copy of your cleaned data. (i.e. cleaned_data = data.dropna(how='any').
To save memory you can pass in the inplace option that will modify your original DataFrame and return None.
data.dropna(how='any', inplace=True)
data.head(100)

Categories

Resources