Cleaning raw data with multi-line header

Cleaning raw data with multi-line header - python

I need to organize data that I import from an excel database, the problem is that it have a multiple line header with client info, followed by a lot of lines with payment information. I want to get data from the header and create a new column with the number of the contract and situation of the operation (they are both in the header) and put this information in every payment line, so i can then slice the dataframe easily.
I used to work with Excel, and what i did was create a formula with IF statements in a column that would identify the number of the contract in the header, if not found would copy the cell above.
My code identified one key string in a column an then get the contract value and the status from a pre defined distance between the cells. You can see it in my python for loop below.
The python for loop became too slow, it was the primary reason I abandoned excel, so i hope there is a faster way to do it in python.
I also tried to use the .where() function but i couldn't tind a proper way to get the contract and status information from the header.
the for loop i used was something like this:
report = pd.read_excel('report_filename.xls', header = None)
for j in range(report.shape[0]):
if str(report.loc[j,1])[0:7] == 'Extract':
contract = report.loc[j + 1, 3]
status = report.loc[j + 7, 1]
report.loc['contract #', j] = contrato
report.loc['status'] = status
# Here is the final version of the code i used:
report = pd.read_excel('report_filename.xls', header = None)
report['Contract #'] = None
report['Status'] = None
for i, row in report.iterrows():
if str(row[1]).lower().startswith('extract'):
report.at[i, 'Contract #'] = report.at[i+1, 3]
report.at[i, 'Status'] = report.at[i+7, 1]
report['Contract #'] = report['Contract #'].ffill(axis = 0)
report['Status'] = report['Status'].ffill(axis = 0)
report = report[report['Status'] != 'Inactive']

Can you use pandas.iterrows?
import pandas as pd
report = pd.read_excel('report_filename.xls', header = None)
newreport = report
newreport['Contract #'] = ''
newreport['Status'] = ''
for i, row in report.iterrows():
if row[1].lower().startswith('extract'):
newreport.at[i, 'Contract #'] = report.at[i+1, 3]
newreport.at[i, 'Status'] = report.at[i+7, 1]

Related

DataFrame returns Value Error after adding auto index

This script needs to query the DC server for events. Since this is done live, each time the server is queried, it returns query results of varying lengths. The log file is long and messy, as most logs are. I need to filter only the event names and their codes and then create a DataFrame. Additionally, I need to add a third column that counts the number of times each event took place. I've done most of it but can't figure out how to fix the error I'm getting.
After doing all the filtering from Elasticsearch, I get two lists - action and code - which I have emulated here.
action_list = ['logged-out', 'logged-out', 'logged-out', 'Directory Service Access', 'Directory Service Access', 'Directory Service Access', 'logged-out', 'logged-out', 'Directory Service Access', 'created-process', 'created-process']
code_list = ['4634', '4634', '4634', '4662', '4662', '4662', '4634', '4634', '4662','4688']
I then created a list that contains only the codes that need to be filtered out.
event_code_list = ['4662', '4688']
My script is as follows:
import pandas as pd
from collections import Counter
#Create a dict that combines action and code
lists2dict = {}
lists2dict = dict(zip(action_list,code_list))
# print(lists2dict)
#Filter only wanted eventss
filtered_events = {k: v for k, v in lists2dict.items() if v in event_code_list}
# print(filtered_events)
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index to DataFrame
df = pd.DataFrame(filtered_events,index=index)#Create DataFrame from filtered events
#Create Auto Index
count = Counter(df)
action_count = dict(Counter(count))
action_count_values = action_count.values()
# print(action_count_values)
#Convert Columns to Rows and Add Index
new_df = df.melt(var_name="Event",value_name="Code")
new_df['Count'] = action_count_values
print(new_df)
Up until this point, everything works as it should. The problem is what comes next. If there are no events, the script outputs an empty DataFrame. This works fine. However, if there are events, then we should see the events, the codes, and the number of times each event occurred. The problem is that it always outputs 1. How can I fix this? I'm sure it's something ridiculous that I'm missing.
#If no alerts, create empty DataFrame
if new_df.empty:
empty_df = pd.DataFrame(columns=['Event','Code','Count'])
empty_df['Event'] = ['-']
empty_df['Code'] = ['-']
empty_df['Count'] = ['-']
empty_df.to_html()
html = empty_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
else: #else, output alerts + codes + count
new_df.to_html()
html = new_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
Any help is appreciated.

It is because you are collecting the result as dictionary - the repeated records are ignored. You lost the record count here: lists2dict = dict(zip(action_list,code_list)).
You can do all these operations very easily on dataframe. Just construct a pandas dataframe from given lists, then filter by code, groupby, and aggregate as count:
df = pd.DataFrame({"Event": action_list, "Code": code_list})
df = df[df.Code.isin(event_code_list)] \
.groupby(["Event", "Code"]) \
.agg(Count = ("Code", len)) \
.reset_index()
print(df)
Output:
Event Code Count
0 Directory Service Access 4662 4
1 created-process 4688 2

How do I loop column names in a pandas dataframe?

I am new to Python and have never really used Pandas, so forgive me if this doesn't make sense. I am trying to create a df based on frontend data I am sending to a flask route. The data is looped through and appended for each row. My only problem is that I don't know how to get the df columns to reflect that. Here is my code to build the rows and the current output:
claims = csv_data["claims"]
setups = csv_data["setups"]
for setup in setups:
setup = setups[0]
offerings = setup["currentOfferings"]
considered = setup["considerationSet"]
reach_dict = setup["reach"]
favorite_dict = setup["favorite"]
summary_dict = setup["summaryMetrics"]
rows = []
for i, claim in enumerate(claims):
row = []
row.append(i + 1)
row.append(claim)
for setup in setups:
setup = setups[0]
row.append("X") if claim in setup["currentOfferings"] else row.append(float('nan'))
row.append("X") if claim in setup["considerationSet"] else row.append(float('nan'))
if claim in setup["currentOfferings"]:
reach_score = reach_dict[claim]
reach_percentage = "{:.0%}".format(reach_score)
row.append(reach_percentage)
else:
row.append(float('nan'))
if claim in setup["currentOfferings"]:
favorite_score = favorite_dict[claim]
fav_percentage = "{:.0%}".format(favorite_score)
row.append(fav_percentage)
else:
row.append(float('nan'))
rows.append(row)
I know that I can put columns = ["#", "Claims", "Setups", etc...] in the df, but that doesn't work because the rows are looping through multiple setups, and the number of setups can change. If I don't specify the column names (how it is in the image), then I just have numbers as columns names. Ideally it should loop through the data it receives in the route, and would start with "#" "Claims" as columns, and then for each setup "Setup 1", "Consideration Set 1", "Reach", "Favorite", "Setup 2", "Consideration Set 2", and so on... etc.
I tried to create a similar type of loop for the columns:
my_columns = []
for i, row in enumerate(rows):
col = []
if row[0] != None:
col.append("#")
else:
pass
if row[1] != None:
col.append("Claims")
else:
pass
if row[2] != None:
col.append("Setup")
else:
pass
if row[3] != None:
col.append("Consideration Set")
else:
pass
if row[4] != None:
col.append("Reach")
else:
pass
if row[5] != None:
col.append("Favorite")
else:
pass
my_columns.append(col)
df = pd.DataFrame(
rows,
columns = my_columns
)
But this didn't work because I have the same issue of no loop, I have 6 columns passed and 10 data columns passed. I'm not sure if I am just not doing the loop of the columns properly, or if I am making everything more complicated than it needs to be.
This is what I am trying to accomplish without having to explicitly name the columns because this is just sample data. There could end up being 3, 4, however many setups in the actual app.
what I would like the ouput to look like

I don't know if this is the most efficient way of doing something like this but I think this is what you want to achieve.
def create_columns(df):
new_cols=[]
for i in range(len(df.columns)):
repeated_cols = 6 #here is the number of columns you need to repeat for every setup
idx = 1 + i // repeated_cols
basic = ['#', 'Claims', f'Setup_{idx}', f'Consideration_Set_{idx}', 'Reach', 'Favorite']
new_cols.append(basic[i % len(basic)])
return new_cols
df.columns = create_columns(df)

If your data comes as csv then try pd.read_csv() to create dataframe.

Drop rows in dataframe if length of the name columns <=1

Please point out where i am doing wrong or a duplicate of this question
I have 11 columns in my table, i am loading data from Ceph(AWS) bucket to Postgres and while doing that i have to filter the data with the below conditions before inserting data into Postgres
Drop the entire row if there is any empty/ Null values in any column
First name and last name should have more than a single letter. Ex : first name = A or last name = P, any record either first name or last name or both , entire record/row should be dropped
Zip code should be 5 digit or greater . Max 7 digit
First name and last name records should not have [Jr, Sr, I, II, etc] in it. or drop the entire record
i have managed to execute the first step (new to pandas) but i was blocked at the next step and i believe that it might also help me solve step3 if i find a solution for step2. While doing a quick research in google, I found that i might be complicating the process by using chunks and might have to use 'concat' to apply it for all chunks or may be i am wrong but i am dealing with huge amount of data and using chunks would help me load the data faster into Postgres.
I am going to paste my code here and mention what i tried, what was the output and what would be the expected output
what i tried:
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd',
'indiv_entpr_id',
'TOKEN_1',
'TOKEN_2',
'TOKEN_3',
'TOKEN_4',
'TOKEN_KEY'
]
def push_to_pg_weekly(key):
vants = []
print(key)
key = _download_s3(key)
how_many_files_pushed.append(True)
s=sp.Popen(["wc", "-l", key], stdout=sp.PIPE)
a, b = s.communicate()
total_rows = int(a.split()[0])
rows = 0
data = pd.read_csv(key, sep="|", header=None, chunksize=100000)
for chunk in data:
rows += len(chunk)
print("Processed rows: ", (float(rows)/total_rows)*100)
chunk = chunk.dropna(axis=0) #step-1 Drop the rows where at least one element is missing.
index_names = chunk[(len(chunk[0]) <= 1) | (len(chunk[1]) <= 1)].index #step2
chunk.drop(index_names, axis=0)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(user = os.environ.get("DATABASE_USER", “USERNAME”),
password = os.environ.get("DATABASE_PASS", “PASSWORD“),
host = os.environ.get("DATABASE_HOST", "cvlpsql.pgsql.com"),
port = 5432,
dbname = os.environ.get("DATABASE_NAME", "cvlpsql_db"),
options = "-c search_path=DATAVANT_O")
with connection.cursor() as cursor:
cursor.copy_from(open('/tmp/sample.csv'), "COVID1", sep='|')
connection.commit()
def push_to_pg():
paginator = CLIENT.get_paginator('list_objects')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if "Contents" in page:
for obj in page["Contents"]:
if obj['Key'].startswith('test/covid-2020-11-10-175213') and (obj['Key'].endswith('.txt') or obj['Key'].endswith('.csv')):
push_to_pg_weekly(obj['Key'])
os.remove(obj['Key'])
return
Data:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
||1969-01-01|M|926.0|135112782|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
||1978-01-01|M|70.0|170737333|Q8NDJz563UrquOUUz0vD6Es05vIaAD/AfVOef4Mhj24=|k5Q02GVd0nJ6xMs1vHVM24MxV6tZ46HJNKoePcDsyoM=|C9cvHz5n+sDycUecioiWZW8USE6D2dli5gRzo4nOyvY=|z4eNSVNDAjiPU2Sw3VY+Ni1djO5fptl5FGQvfnBodr4=|cigna_TOKEN_ENCRYPTION_KEY
||1996-01-01|M|840.0|91951973|Y4kmxp0qdZVCW5pJgQmvWCfc4URg9oFnv2DWGglfQKM=|RJfyDYJjwuZ1ZDjP+5PA5S2fLS6llFD51Lg+uJ84Tus=|+PXzrKt7O79FehSnL3Q8EjGmnyZVDUfdM4zzHk1ghOY=|gjyVKjunky2Aui3dxzmeLt0U6+vT39/uILMbEiT0co8=|cigna_TOKEN_ENCRYPTION_KEY
||1960-01-01|M|180.0|64496569|80e1CgNJeO8oYQHlSn8zWYL4vVrHSPe9AnK2T2PrdII=|bJl7veT+4MlU4j2mhFpFyins0xeCFWeaA30JUzWsfqo=|0GuhUfbS4xCnCj2ms43wqmGFG5lCnfiIQdyti9moneM=|lq84jO9yhz8f9/DUM0ACVc/Rp+sKDvHznVjNnLOaRo4=|cigna_TOKEN_ENCRYPTION_KEY
||1963-01-01|M|310.0|122732991|zEvHkd5AVT7hZFR3/13dR9KzN5WSulewY0pjTFEov2Y=|eGqNbLoeCN1GJyvgaa01w+z26OtmplcrAY2vxwOZ4Y4=|6q9DPLPK5PPAItZA/x253DvdAWA/r6zIi0dtIqPIu2g=|lOl11DhznPphGQOFz6YFJ8i28HID1T6Sg7B/Y7W1M3o=|cigna_TOKEN_ENCRYPTION_KEY
||2001-01-01|F|650.0|43653178|vv/+KLdhHqUm13bWhpzBexwxgosXSIzgrxZIUwB7PDo=|78cJu1biJAlMddJT1yIzQAH1KCkyDoXiL1+Lo1I2jkw=|9/BM/hvqHYXgfmWehPP2JGGuB6lKmfu7uUsmCtpPyz8=|o/yP8bMzFl6KJ1cX+uFll1SrleCC+8BXmqBzyuGdtwM=|cigna_TOKEN_ENCRYPTION_KEY
output - data inserted into postgresDB:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Any answers/comments will be very much appriciated, thank you

Fastest way to do operations like this on pandas is through numpy.where.
eg for String length:
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1), True, False)]
Note: you can add postal code condition in same way. by default in your data postal codes will read in as floats, so cast them to string first, and then set length limit:
## string length & postal code conditions together
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1) &
(data['cust_postl_cd'].astype('str').str.len()>4) &
(data['cust_postl_cd'].astype('str').str.len()<8)
, True, False)]
EDIT:
Since you working in chunks, change the data to chunk and put this inside your loop. Also, since you don't import headers (headers=0, change column names to their index values. And convert all values to strings before comparison, since otherwise NaN columns will be treated as floats eg:
chunk = chunk[np.where((chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[5].astype('str').str.len()>4) &
(chunk[5].astype('str').str.len()<8), True, False)]

Create a new column in the dataframe with a value for the length:
df['name_length'] = df.name.str.len()
Index using the new column:
df = df[df.name_length > 1]

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

Python - Pandas library returns wrong column values after parsing a CSV file

SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?

You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning raw data with multi-line header - python

Related

DataFrame returns Value Error after adding auto index

How do I loop column names in a pandas dataframe?

Drop rows in dataframe if length of the name columns <=1

Script keep showing "SettingCopyWithWarning'

Python - Pandas library returns wrong column values after parsing a CSV file

Categories

Resources