Related
I have two dataframes and in the column 'column_to_compare value from df1' I have "sub-categorical" data which I want to match to the "names" to its "categories".
I listed the matching categories below in the matching_cat variable. I don't know if it is the right way to associate my variable to a category.
I want to create a new dataframe which compares these two dataframes with a common id and on the columns categories and names.
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
df1
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
df2
matching_cat = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect':['spider','mosquito'] }
So, here are my two dataframes:
And I want to be able to "map" the values with the categories and output this:
Ok using your example, here is what I came up with:
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
matching_category = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect': ['spider','mosquito'] }
# Function to compare rows against matching_category
def test(row):
try:
if row['names'] in matching_category[row['categories']]:
val = True
else:
val = False
except:
val=False
return val
# Merge the two dataframes based on 'key_col'
df3 = df1.merge(df2, how='outer', on='key_col')
# Call the test function on each row in the new dataframe (df3)
df3['new_col'] = df3.apply(test, axis=1)
# Drop unwanted columns
df3 = df3.drop(['column3', 'column_randomn'], axis=1)
# Create new dataframes based on whether output matches or not
output_matches = df3[df3['new_col']==True]
output_mismatches = df3[df3['new_col']==False]
# Display the dataframes
print('OUTPUT MATCHES:')
print('================================================')
print(output_matches)
print("")
print('OUTPUT MIS-MATCHES:')
print('================================================')
print(output_mismatches)
OUTPUT:
OUTPUT MATCHES:
================================================
key_col categories names new_col
0 12563 bird falcon True
1 12564 dog golden retriever True
3 12566 insect spider True
OUTPUT MIS-MATCHES:
================================================
key_col categories names new_col
2 12565 insect doberman False
4 12567 NaN spider False
I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
Final output should look something like
Output
You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
Output:
However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).
UPDATE:
If you want to use a list of columns as input and match with the first instance, you can use this instead:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
Output:
You can try:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)
I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1
I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
I have a list of columns that I want to rename a portion of based on a list of values.
I am looking at a file which has 12 months of data and each month is a different column (I need to keep it in this specific format unfortunately). This file is generated once per month and I keep the column names more general since I have to do a lot of calculations on them based the month number (for example, I need to compare information against the average of month 8, 9, and 10 every month).
Here are the columns I want to rename:
['month_1_Sign',
'month_2_Sign',
'month_3_Sign',
'month_4_Sign',
'month_5_Sign',
'month_6_Sign',
'month_7_Sign',
'month_8_Sign',
'month_9_Sign',
'month_10_Sign',
'month_11_Sign',
'month_12_Sign',
'month_1_Actual',
'month_2_Actual',
'month_3_Actual',
'month_4_Actual',
'month_5_Actual',
'month_6_Actual',
'month_7_Actual',
'month_8_Actual',
'month_9_Actual',
'month_10_Actual',
'month_11_Actual',
'month_12_Actual',
'month_1_Target',
'month_2_Target',
'month_3_Target',
'month_4_Target',
'month_5_Target',
'month_6_Target',
'month_7_Target',
'month_8_Target',
'month_9_Target',
'month_10_Target',
'month_11_Target',
'month_12_Target']
Here are the names I want to place:
required_date_range = sorted(list(pd.Series(pd.date_range((dt.datetime.today().date() + pd.DateOffset(months=-13)), periods=12, freq='MS')).dt.strftime('%Y-%m-%d')))
['2015-03-01',
'2015-04-01',
'2015-05-01',
'2015-06-01',
'2015-07-01',
'2015-08-01',
'2015-09-01',
'2015-10-01',
'2015-11-01',
'2015-12-01',
'2016-01-01',
'2016-02-01']
So month_12 columns (month_12 is always the latest month) would be changed to '2016-02-01_Sign', '2016-02-01_Actual', '2016-02-01_Target' in this example.
I tried doing this but it doesn't change anything (trying to change the month_# with the actual date it refers to):
final.replace('month_10', required_date_range[9], inplace=True)
final.replace('month_11', required_date_range[10], inplace=True)
final.replace('month_12', required_date_range[11], inplace=True)
final.replace('month_1', required_date_range[0], inplace=True)
final.replace('month_2', required_date_range[1], inplace=True)
final.replace('month_3', required_date_range[2], inplace=True)
final.replace('month_4', required_date_range[3], inplace=True)
final.replace('month_5', required_date_range[4], inplace=True)
final.replace('month_6', required_date_range[5], inplace=True)
final.replace('month_7', required_date_range[6], inplace=True)
final.replace('month_8', required_date_range[7], inplace=True)
final.replace('month_9', required_date_range[8], inplace=True)
You could construct a dict and then call map on the split column str:
In [27]:
d = dict(zip([str(x) for x in range(1,13)], required_date_range))
d
Out[27]:
{'1': '2015-03-01',
'10': '2015-12-01',
'11': '2016-01-01',
'12': '2016-02-01',
'2': '2015-04-01',
'3': '2015-05-01',
'4': '2015-06-01',
'5': '2015-07-01',
'6': '2015-08-01',
'7': '2015-09-01',
'8': '2015-10-01',
'9': '2015-11-01'}
In [26]:
df.columns = df.columns.to_series().str.rsplit('_').str[1].map(d) + '_' + df.columns.to_series().str.rsplit('_').str[-1]
df.columns
Out[26]:
Index(['2015-03-01_Sign', '2015-04-01_Sign', '2015-05-01_Sign',
'2015-06-01_Sign', '2015-07-01_Sign', '2015-08-01_Sign',
'2015-09-01_Sign', '2015-10-01_Sign', '2015-11-01_Sign',
'2015-12-01_Sign', '2016-01-01_Sign', '2016-02-01_Sign',
'2015-03-01_Actual', '2015-04-01_Actual', '2015-05-01_Actual',
'2015-06-01_Actual', '2015-07-01_Actual', '2015-08-01_Actual',
'2015-09-01_Actual', '2015-10-01_Actual', '2015-11-01_Actual',
'2015-12-01_Actual', '2016-01-01_Actual', '2016-02-01_Actual',
'2015-03-01_Target', '2015-04-01_Target', '2015-05-01_Target',
'2015-06-01_Target', '2015-07-01_Target', '2015-08-01_Target',
'2015-09-01_Target', '2015-10-01_Target', '2015-11-01_Target',
'2015-12-01_Target', '2016-01-01_Target', '2016-02-01_Target'],
dtype='object')
You're going to want to use the .rename method instead of the .replace! For instance this code:
import pandas as pd
d = {'a': [1, 2, 4], 'b':[2,3,4],'c':[3,4,5]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'x1', 'b': 'x2'}, inplace=True)
Changes the 'a' and 'b' column title to 'x1' and 'x2' respectively.
The first line of the renaming code you have would change to:
final.rename(columns={'month_10':required_date_range[9]}, inplace=True)
In fact you could do every column in that one command by adding entries to the columns dictionary argument.
final.rename(columns={'month_10':required_date_range[9],
'month_9':required_date-range[8], ... (and so on) }, inplace=True)
from collections import product
df = pd.DataFrame(np.random.rand(3, 12 * 3), columns=['month_' + str(c[0]) + '_' + c[1] for c in product(range(1, 13), ['Sign', 'Actual', 'Target'])])
First create a mapping to the relevant months.
mapping = {'month_' + str(n): date for n, date in enumerate(required_date_range, 1)}
>>> mapping
{'month_1': '2015-03-01',
'month_10': '2015-12-01',
'month_11': '2016-01-01',
'month_12': '2016-02-01',
'month_2': '2015-04-01',
'month_3': '2015-05-01',
'month_4': '2015-06-01',
'month_5': '2015-07-01',
'month_6': '2015-08-01',
'month_7': '2015-09-01',
'month_8': '2015-10-01',
'month_9': '2015-11-01'}
Then reassign columns, joining the mapped month (e.g. '2016-02-01') to the rest of the column name. This was done using a list comprehension.
df.columns = [mapping.get(c[:c.find('_', 6)]) + c[c.find('_', 6):] for c in cols]
>>> df.columns.tolist()
['2015-03-01_Sign',
'2015-04-01_Sign',
'2015-05-01_Sign',
'2015-06-01_Sign',
'2015-07-01_Sign',
'2015-08-01_Sign',
'2015-09-01_Sign',
'2015-10-01_Sign',
'2015-11-01_Sign',
'2015-12-01_Sign',
'2016-01-01_Sign',
'2016-02-01_Sign',
'2015-03-01_Actual',
'2015-04-01_Actual',
'2015-05-01_Actual',
'2015-06-01_Actual',
'2015-07-01_Actual',
'2015-08-01_Actual',
'2015-09-01_Actual',
'2015-10-01_Actual',
'2015-11-01_Actual',
'2015-12-01_Actual',
'2016-01-01_Actual',
'2016-02-01_Actual',
'2015-03-01_Target',
'2015-04-01_Target',
'2015-05-01_Target',
'2015-06-01_Target',
'2015-07-01_Target',
'2015-08-01_Target',
'2015-09-01_Target',
'2015-10-01_Target',
'2015-11-01_Target',
'2015-12-01_Target',
'2016-01-01_Target',
'2016-02-01_Target']