ignoring null value columns in python

ignoring null value columns in python - python

I Have a .txt file which has three columns in it.
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
I need to create two lists which comprises of values under the column Implementation Authority.mail and Assigned Engineer.mail. It works perfectly when columns have compltete values (i.e no null values). The values got mixed when column contains null values.
aengg=[]
iauth=[]
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
I tried it with this code and it is perfectly worked for complete column values.
Can anyone please tell me a solution for null values?

It seems you don't have a separator. I use number of spaces for your case. And fill the blank with a None.
Try this:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
aengg = []
iauth = []
with open('C:\\temp\\test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 2:
# when there are more than 17 spaces between two elements, I consider it as a third element in the row, then I add a None between them
if row.index(columns[1]) > 17:
columns.insert(1, None)
# if there are less than 17 spaces between two elements, I consider it as the second element in the row, then I add a None to the tail
else:
columns.append(None)
print columns
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
Here is the output.
['id', 'ImplementationAuthority.email', 'AssignedEngineer.email']
['ALU02034116', 'bin.a.chen#shan.cn', 'bin.a.chen#ell.com.cn']
['ALU02035113', None, 'Guolin.Pan#ell.com.cn']
['ALU02034116', 'bin.a.chen#ming.com.cn', 'Guolin.Pan#ell.com.cn']
['ALU02022055', 'fria-sha-qdv#list.com', None]
['ALU02030797', 'fria-che-equipment-1#phoenix.com', 'Balagopal.Velusamy#phoenix.com']
['AssignedEngineer.email', 'bin.a.chen#ell.com.cn', 'Guolin.Pan#ell.com.cn', 'Guolin.Pan#ell.com.cn', None, 'Balagopal.Velusamy#phoenix.com']
['ImplementationAuthority.email', 'bin.a.chen#shan.cn', None, 'bin.a.chen#ming.com.cn', 'fria-sha-qdv#list.com', 'fria-che-equipment-1#phoenix.com']

You need to place a 'null' or 0 as place holder.
The interpreter would read Guolin.Pan#ell.com.cn in second row as the second column.
Try this
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 null Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com null
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
And then append values after checking not null.
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
if columns[2] != "null":
aengg.append(columns[2])
if columns[1] != "null":
iauth.append(columns[1])

Related

Python pptx merging table rows

I have a table in powerpoint where (after rendering with some other function) every other row remains completely empty. I tried solving that by merging every empty row with the row below it, as follows:
def is_empty_row(row):
for cell in row.cells:
if len(cell.text):
return False
return True
def merge_empty_row(table,index): # Assumes no 2 consecutive rows are empty!
row = table.rows[index]
try:
next_row = table.rows[index+1]
except:
return
cell_1 = row.cells[0]
cell_2 = next_row.cells[len(next_row.cells)-1]
cell_1.merge(cell_2)
def fix_tables(document):
ppt = Presentation(document)
for slide in ppt.slides:
for shape in slide.shapes:
if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
for index in range(len(shape.table.rows)):
if is_empty_row(shape.table.rows[index]):
merge_empty_row(shape.table, index)
docname = "".join(document.split(".")[0])
ppt.save(docname+'.out.pptx')
And I am calling this function from Django on a template pptx file, only to get the following error:
Exception Type: XMLSyntaxError at /amas/analysis/1178/report/download/34
Exception Value: Opening and ending tag mismatch: r line 2 and t, line 2, column 11532 (<string>, line 2)
Any ideas?

My first choice would be avoiding inserting empty rows. But if that weren't possible for some reason you could try deleting empty rows like this:
def delete_row(row):
tr = row._tr
tr.getparent().remove(tr)
rows = [table.rows[i] for i in range(len(rows))]
empty_rows = [r for r in rows if row_is_empty(r)]
for row in empty_rows:
delete_row(row)
You need to identify the empty rows separately beforehand because otherwise deleting them in the middle of iteration can screw up the references (change which row rows[i] points to).

convert pandas series (with strings) to python list

It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?

Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']

Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.

Drop rows in dataframe if length of the name columns <=1

Please point out where i am doing wrong or a duplicate of this question
I have 11 columns in my table, i am loading data from Ceph(AWS) bucket to Postgres and while doing that i have to filter the data with the below conditions before inserting data into Postgres
Drop the entire row if there is any empty/ Null values in any column
First name and last name should have more than a single letter. Ex : first name = A or last name = P, any record either first name or last name or both , entire record/row should be dropped
Zip code should be 5 digit or greater . Max 7 digit
First name and last name records should not have [Jr, Sr, I, II, etc] in it. or drop the entire record
i have managed to execute the first step (new to pandas) but i was blocked at the next step and i believe that it might also help me solve step3 if i find a solution for step2. While doing a quick research in google, I found that i might be complicating the process by using chunks and might have to use 'concat' to apply it for all chunks or may be i am wrong but i am dealing with huge amount of data and using chunks would help me load the data faster into Postgres.
I am going to paste my code here and mention what i tried, what was the output and what would be the expected output
what i tried:
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd',
'indiv_entpr_id',
'TOKEN_1',
'TOKEN_2',
'TOKEN_3',
'TOKEN_4',
'TOKEN_KEY'
]
def push_to_pg_weekly(key):
vants = []
print(key)
key = _download_s3(key)
how_many_files_pushed.append(True)
s=sp.Popen(["wc", "-l", key], stdout=sp.PIPE)
a, b = s.communicate()
total_rows = int(a.split()[0])
rows = 0
data = pd.read_csv(key, sep="|", header=None, chunksize=100000)
for chunk in data:
rows += len(chunk)
print("Processed rows: ", (float(rows)/total_rows)*100)
chunk = chunk.dropna(axis=0) #step-1 Drop the rows where at least one element is missing.
index_names = chunk[(len(chunk[0]) <= 1) | (len(chunk[1]) <= 1)].index #step2
chunk.drop(index_names, axis=0)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(user = os.environ.get("DATABASE_USER", “USERNAME”),
password = os.environ.get("DATABASE_PASS", “PASSWORD“),
host = os.environ.get("DATABASE_HOST", "cvlpsql.pgsql.com"),
port = 5432,
dbname = os.environ.get("DATABASE_NAME", "cvlpsql_db"),
options = "-c search_path=DATAVANT_O")
with connection.cursor() as cursor:
cursor.copy_from(open('/tmp/sample.csv'), "COVID1", sep='|')
connection.commit()
def push_to_pg():
paginator = CLIENT.get_paginator('list_objects')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if "Contents" in page:
for obj in page["Contents"]:
if obj['Key'].startswith('test/covid-2020-11-10-175213') and (obj['Key'].endswith('.txt') or obj['Key'].endswith('.csv')):
push_to_pg_weekly(obj['Key'])
os.remove(obj['Key'])
return
Data:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
||1969-01-01|M|926.0|135112782|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
||1978-01-01|M|70.0|170737333|Q8NDJz563UrquOUUz0vD6Es05vIaAD/AfVOef4Mhj24=|k5Q02GVd0nJ6xMs1vHVM24MxV6tZ46HJNKoePcDsyoM=|C9cvHz5n+sDycUecioiWZW8USE6D2dli5gRzo4nOyvY=|z4eNSVNDAjiPU2Sw3VY+Ni1djO5fptl5FGQvfnBodr4=|cigna_TOKEN_ENCRYPTION_KEY
||1996-01-01|M|840.0|91951973|Y4kmxp0qdZVCW5pJgQmvWCfc4URg9oFnv2DWGglfQKM=|RJfyDYJjwuZ1ZDjP+5PA5S2fLS6llFD51Lg+uJ84Tus=|+PXzrKt7O79FehSnL3Q8EjGmnyZVDUfdM4zzHk1ghOY=|gjyVKjunky2Aui3dxzmeLt0U6+vT39/uILMbEiT0co8=|cigna_TOKEN_ENCRYPTION_KEY
||1960-01-01|M|180.0|64496569|80e1CgNJeO8oYQHlSn8zWYL4vVrHSPe9AnK2T2PrdII=|bJl7veT+4MlU4j2mhFpFyins0xeCFWeaA30JUzWsfqo=|0GuhUfbS4xCnCj2ms43wqmGFG5lCnfiIQdyti9moneM=|lq84jO9yhz8f9/DUM0ACVc/Rp+sKDvHznVjNnLOaRo4=|cigna_TOKEN_ENCRYPTION_KEY
||1963-01-01|M|310.0|122732991|zEvHkd5AVT7hZFR3/13dR9KzN5WSulewY0pjTFEov2Y=|eGqNbLoeCN1GJyvgaa01w+z26OtmplcrAY2vxwOZ4Y4=|6q9DPLPK5PPAItZA/x253DvdAWA/r6zIi0dtIqPIu2g=|lOl11DhznPphGQOFz6YFJ8i28HID1T6Sg7B/Y7W1M3o=|cigna_TOKEN_ENCRYPTION_KEY
||2001-01-01|F|650.0|43653178|vv/+KLdhHqUm13bWhpzBexwxgosXSIzgrxZIUwB7PDo=|78cJu1biJAlMddJT1yIzQAH1KCkyDoXiL1+Lo1I2jkw=|9/BM/hvqHYXgfmWehPP2JGGuB6lKmfu7uUsmCtpPyz8=|o/yP8bMzFl6KJ1cX+uFll1SrleCC+8BXmqBzyuGdtwM=|cigna_TOKEN_ENCRYPTION_KEY
output - data inserted into postgresDB:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Any answers/comments will be very much appriciated, thank you

Fastest way to do operations like this on pandas is through numpy.where.
eg for String length:
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1), True, False)]
Note: you can add postal code condition in same way. by default in your data postal codes will read in as floats, so cast them to string first, and then set length limit:
## string length & postal code conditions together
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1) &
(data['cust_postl_cd'].astype('str').str.len()>4) &
(data['cust_postl_cd'].astype('str').str.len()<8)
, True, False)]
EDIT:
Since you working in chunks, change the data to chunk and put this inside your loop. Also, since you don't import headers (headers=0, change column names to their index values. And convert all values to strings before comparison, since otherwise NaN columns will be treated as floats eg:
chunk = chunk[np.where((chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[5].astype('str').str.len()>4) &
(chunk[5].astype('str').str.len()<8), True, False)]

Create a new column in the dataframe with a value for the length:
df['name_length'] = df.name.str.len()
Index using the new column:
df = df[df.name_length > 1]

Is it possible to update a row of data using position of column (e.g. like a list index) in Python / SQLAlchemy?

I am trying to compare two rows of data to one another which I have stored in a list.
for x in range(0, len_data_row):
if company_data[0][0][x] == company_data[1][0][x]:
print ('MATCH 1: {} - {}'.format(x, company_data[0][0][x]))
# do nothing
if company_data[0][0][x] == None and company_data[1][0][x] != None:
print ('MATCH 2: {} - {}'.format(x, company_data[1][0][x]))
# update first company_id with data from 2nd
if company_data[0][0][x] != None and company_data[1][0][x] == None:
print ('MATCH 3: {} - {}'.format(x, company_data[0][0][x]))
# update second company_id with data from 1st
Psuedocode of what I want to do:
If data at index[x] of a list is not None for row 2, but is blank for row 1, then write the value of row 2 at index[x] for row 1 data in my database.
The part I can't figure out is if in SQLAlchemy you can do specify which column is being updated by an "index" (I think in db-land index means something different than what I mean. What I mean is like a list index, e.g., list[1]). And also if you can dynamically specify which column is being updated by passing a variable to the update code? Here's what I'm looking to do (it doesn't work of course):
def some_name(column_by_index, column_value):
u = table_name.update().where(table_name.c.id==row_id).values(column_by_index=column_value)
db.execute(u)
Thank you!

How to print out two columns per line, formatted

I have two files with 2 columns each, I need to use 1 column from one and one from another and create a new file with 2 columns.
while i<500020:
columns=datas.readline()
columns2 = datas2.readline()
columns = columns.split(" ")
columns2 = columns2.split(" ")
colum.write(" {1} {0}".format((columns2[1]), (columns[1]) ))
i=i+1
My output is like this:
181.053131
0.0005301
168.785828
0.3596852
I want to show them on same line, EX:
181.053131 0.0005301
168.785828 0.3596852

You need to remove the newline from columns2[1]:
columns2 = datas.readline().rstrip('\n')
otherwise you'll always insert those newlines into your output.
I'd also remove the newline from columns and use an explicit newline when writing:
columns = datas.readline().rstrip('\n')
and
colum.write(" {1} {0}\n".format(columns2[1], columns[1]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ignoring null value columns in python - python

Related

Python pptx merging table rows

convert pandas series (with strings) to python list

Drop rows in dataframe if length of the name columns <=1

Is it possible to update a row of data using position of column (e.g. like a list index) in Python / SQLAlchemy?

How to print out two columns per line, formatted

Categories

Resources