Pandas method chaining with str.contains and str.split - python

I'm learning Pandas method chaining and having trouble using str.conains and str.split in a chain. The data is one week's worth of information scraped from a Wikipedia page, I will be scraping several years worth of weekly data.
This code without chaining works:
#list of data scraped from web:
list = ['Unnamed: 0','Preseason-Aug 11','Week 1-Aug 26','Week 2-Sep 2',
'Week 3-Sep 9','Week 4-Sep 23','Week 5-Sep 30','eek 6-Oct 7','Week 7-Oct 14',
'Week 8-Oct 21','Week 9-Oct 28','Week 10-Nov 4','Week 11-Nov 11','Week 12-Nov 18',
'Week 13-Nov 25','Week 14Dec 2','Week 15-Dec 9','Week 16 (Final)-Jan 4','Unnamed: 18']
#load to dataframe:
df = pd.DataFrame(list)
#rename column 0 to text:
df = df.rename(columns = {0:"text"})
#remove ros that contain "Unnamed":
df = df[~df['text'].str.contains("Unnamed")]
#split column 0 into 'week' and 'released' at the hyphen:
df[['week', 'released']] = df["text"].str.split(pat = '-', expand = True)
Here's my attempt to rewrite it as a chain:
#load to dataframe:
df = pd.DataFrame(list)
#function to remove rows that contain "Unnamed"
def filter_unnamed(df):
df = df[~df["text"].str.contains("Unnamed")]
return df
clean_df = (df
.rename(columns = {0:"text"})
.pipe(filter_unnamed)
#[['week','released']] = lambda df_:df_["text"].str.split('-', expand = True)
)
The first line of the clean_df chain to rename column 0 works.
The second line removes rows that contain "Unnamed"; it works, but is there a better way than using pipe and a function?
I'm having the most trouble with str.split in the 3rd row (doesn't work, commented out). I tried assign for this and think it should work, but I don't know how to pass in the new column names ("week" and "released") with the str.split function.
Thanks for the help.

I also couldn't figure out how to create two columns in one go from the split... but I was able to do it by splitting twice and accessing parts 1 and 2 in succession (not ideal), df.assign(week = ...[0], released = ...[1]).
Note also I reset the index.
df.assign(week = df[0].str.split(pat = '-', expand=True)[0], released = df[0].str.split(pat = '-', expand=True)[1])[~df[0].str.contains("Unnamed")].reset_index(drop=True).rename(columns = {0: "text"})
I'm sure there's a sleeker way, but this may help.

Related

Is there a method to programmatically calculate index number in Python dataframe?

I have this hardcoded index number in the for loop to extract the data from specific row index within the dataframe:
import pandas as pd
import re
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
# df.drop([0, 3], inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
for i in range(len(df.columns)):
if df["Total"][i] == '':
shareholders = df.iloc[24:42]
print(i, shareholders)
else:
officers = df.iloc[0:23]
print(i, officers)
The for loop on the above works great and returns a separate information for shareholders and officers, but instead of using df.iloc[hardcoded number] that works only for this file, is there a way to adjust so that python is able to automatically locate the officers and shareholders even when the file format is changed?

How to use wide_to_long (Pandas)

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

How to use the filtered data in Pandas?

I am new to Pandas. Below is part of a code. I am trying to use the df_filtered which is the filtered data having codenum column value =AB123. However from Line 14 if I use df_filtered instead of excel_data_df , it's not giving results. The desired columns are getting picked correctly. But the value filtering is not happening - In codenum col value should be = AB123. But the value filtering is not happening and I am getting the entire excel converted to json with chosen columns. Please help understand how to consider/use df_filtered data from Line 14.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[4,5,6,18,19], #index starts from 0
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
df_filtered = excel_data_df.loc[lambda x: x['codenum'] == 'AB123'] # Row Filtering -- need to use this further
excel_data_df.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns -- Line Num 14
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
First, a small tip; you can remove the use/need of lambda by doing
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if you want to get rid of the lambda.
Afterwards, as pointed out in the comments, make sure that it contains samples after the filtering;
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if df_filtered.shape[0]: #it contains samples
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe
print("Filtered data does not contain data")
Try the following code below:
df_filtered = excel_data_df[excel_data_df['codenum'] == 'AB123']
If it still not working then "codenum" may not have this value that you are trying to filter out.
Thanks all for your inputs. Initially it was returning empty dataframe as suggested in above answers and comments. Posting the edited working code based on your inputs for anyone's future reference.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[3,5,6,18,19], #index starts from 0 ## edit 1: corrected index to right column index
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
print(excel_data_df['codenum'].unique()) ##edit 1: returns unique values including AB123
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=='AB123'] # Row Filtering ##edit 1
print(df_filtered) ##edit 1 - to check if expected results are present in filtered data
df_filtered.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns
if df_filtered.shape[0]: #it contains samples ## edit 1
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
df_filtered = df_filtered[cols_to_keep] # columns to keep
df_filtered # columns to keep
json_str = df_filtered.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe ##edit 1
print("Filtered data does not contain data")
The pandas df.loc will return the filtered result.
In your code, you tried to make a filtered but df.loc is not a filter maker.
See the example, the df.loc return the filter result from the origin df.
import pandas as pd
df = pd.DataFrame([[1, "AB123"], [4, "BC123"], [7, "CD123"]],columns=['A', 'B'])
print(df)
# A B
#0 1 AB123
#1 4 BC123
#2 7 CD123
print(df.loc[lambda x: x["B"] == "AB123"])
# A B
#0 1 AB123

Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas

I want to use the loop function to perform filters containing a list of different keywords (i.e different reference numbers) in a specific column (i.e. CaseText) but just filter one keyword each time so that I do not have to manually change the keyword each time. I want to see the result in the form of dataframe.
Unfortunately, my code doesn't work. It just returns the whole dataset.
Anyone can help and find out what's wrong with my code? Additionally, it will be great if the resultant table will be broken into different results of each keyword.
Many thanks.
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
data_container = []
for filename in list_of_files:
print(filename)
df = pd.read_csv(filename, encoding='mac_roman')
data_container.append(df)
all_data = pd.concat(data_container)
reference_list = ['H/04522/11','15/07697/FUL'] # I want to filter the dataset with a single keyword each time. Because I have nearly 70 keywords to filter.
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(all_data[all_data['CaseText'].str.contains("reference_list", na=False)])
select_data = select_data[['CaseReference', 'CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
One of the problems is that the items in reference_container do not match any of the values in column 'CaseReference'. Once you figure out which CaseReference numbers you want to search for then the below code should work for you. Just put the correct CaseReference numbers in the reference_container list.
import pandas as pd
url = ('https://open.barnet.gov.uk/download/2nq32/fzw/'
'Open%20Data%20Planning%202011-2012%20-%20NG.csv')
data = pd.read_csv(url, encoding='mac_roman')
reference_list = ['hH/02159/13','16/4324/FUL']
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(data[data['CaseReference'] == keywords],
ignore_index=True)
select_data = select_data[['CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data
This should work
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
# this takes some time
df = pd.concat([pd.read_csv(el, engine='python' ) for el in list_of_files]) # read all csvs
reference_list = ['H/04522/11','15/07697/FUL']
reference_dict = dict() # create an empty dictionary.
# Will populate this where each key will be the keyword and each value will be a dataframe
# filtered for 'CaseText' contain the keyword
for el in reference_list:
reference_dict[el] = df[(df['CaseText'].str.contains(el)) & ~(df['CaseText'].isna())]
# notice the two conditions
# 1) the column CaseText should contain the keyword. (df['CaseText'].str.contains(el))
# 2) there are some elements in CaseText that are NaN so they need to be excluded
# this is what ~(df['CaseText'].isna()) does
# you can see the resulting dataframes like so: reference_dict[keyword]. for example
reference_dict['H/04522/11']
UPDATE
if you want one dataframe to include the cases where any of the keywords is in the column CaseText try this
# lets start after having read in the data
# Seperate your keywords with | in one string.
keywords = 'H/04522/11|15/07697/FUL' # read into regular expressions to understand this
final_df = df[(df['CaseText'].str.contains(keywords)) & ~(df['CaseText'].isna())]
final_df

Excel diff with python

I am looking for an algorithm to comapre two excel sheets, based on their column names, in Python.
I do not know what the columns are, so one sheet may have an additional column or both sheets can have several columns with the same name.
The easiest case is when a column in the first sheet corresponds to only one column in the second excel sheet. Then I can perform the diff on rows of that column using xlrd.
If the column name is not unique, I can verify if the columns have the same position.
Does anyone know of an already existing algorithm or have any experience in this domain?
Fast an dirty:
# Since order of the names doesn't matter, we can use the set() option
matching_names = set(sheet_one_names) & set(sheet_one_names)
...
# Here, order does matter since we're comparing rowdata..
# not just if they match at some point.
matching_rowdata = [i for i, j in zip(columndata_one, columndata_two) if i != j]
Note: This assumes that you've done a few things ahead,
get the column names for sheet 1 via xlrd and same for the second sheet,
get the row data for both sheets in two different variables.
This is to give you an idea.
Also note that doing the [...] option (second one) it's important that the rows are of the same length, otherwise it will be skipped. This is a MISS-MATCH scenario, reverse to get the matches in the data flow.
This is a slower but functional solution:
column_a_name = ['Location', 'Building', 'Location']
column_a_data = [['Floor 1', 'Main', 'Sweden'],
['Floor 2', 'Main', 'Sweden'],
['Floor 3', 'Main', 'Sweden']]
column_b_name = ['Location', 'Building']
column_b_data = [['Sweden', 'Main', 'Floor 1'],
['Norway', 'Main', 'Floor 2'],
['Sweden', 'Main', 'Floor 3']]
matching_names = []
for pos in range(0, len(column_a_name)):
try:
if column_a_name[pos] == column_b_name[pos]:
matching_names.append((column_a_name[pos], pos))
except:
pass # Index out of range, column length are not the same
mismatching_data = []
for row in range(0, len(column_a_data)):
rowa = column_a_data[row]
rowb = column_b_data[row]
for name, _id in matching_names:
if rowa[_id] != rowb[_id] and (rowa[_id] not in rowb or rowb[_id] not in rowa):
mismatching_data.append((row, rowa[_id], rowb[_id]))
print mismatching_data

Categories

Resources