Python - issue with removing fields from the output - python

I have an issue with my code. I need the script to remove fields which fill all three conditions:
the CreatedBy is koala,
Book is PI or SI or II or OT or FG,
and the Category **is ** Cert or CertPlus or Cap or Downside.
Currently my code removes all koala and all books and only takes the last argument. So for example my current output leaves fields only if the category is different. I would like it to show fields ONLY if all 3 arguments are met and not if koala or book = PI or SI or II or OT or FG and to show everything else which is in range.
If field is created by koala and category is Cert I wish to see this field but now it is removed.
Or if none of the arguments are met I also want to see those fields ( e.g. createdby is Extra, Book is NG and Category is Multiple. Now those are also removed from the output.
Example dataset:
In the link below - I wish to remove only those marked red:
current_path = os.path.dirname(os.path.realpath(sys.argv[0]))
a_path, q_path = 0, 0
def assign_path(current_path, a_path = 0, q_path = 0):
files = os.listdir(current_path)
for i in files:
if re.search('(?i)activity',i):
a_path = '\\'.join([current_path,i])
elif re.search('(?i)query',i):
q_path = '\\'.join([current_path,i])
return a_path, q_path
a_path, q_path = assign_path(current_path)
if a_path == 0 or q_path == 0:
files = os.listdir(current_path)
directories = []
for i in files:
if os.path.isdir(i): directories.append(i)
for i in directories:
if re.search('(?i)input',i):
a_path, q_path = assign_path('\\'.join([current_path,i]), a_path, q_path)
L = list(range(len(qr)))
L1 = list(range(len(qr2)))
L2 = list(range(len(ac)))
-------------------------------------------------------
qr = pd.read_excel(q_path)
qr2 = pd.read_excel(q_path)
qr_rec = qr2.iloc[[0,1]]
d = qr2.iloc[0].to_dict()
for i in list(d.keys()): d[i] = np.nan
for i in range(len(qr2)):
if qr2.iloc[i]['LinkageLinkType'] != 'B2B_COUNTER_TRADE'\
and qr2.iloc[i]['CreatedBy'] == 'koala_'\
and qr2.iloc[i]['Book'] in {'PI','SI','II','OT','FG'}\
and qr2.iloc[i]['Category'] not in {'Cert','CertPlus','Cap','Downside'}:
while i in L: L.remove(i)
if qr2.iloc[i]['PrimaryRiskId'] not in list(aID):
qr_rec = qr_rec.append(qr2.iloc[i],ignore_index=True)
I have added the beggining of the code which allows me to use the Excel file. I have two files, one of them being a_path ( please disregard this one). The issue I have is on the q_path.

Check this out:
pd.read_csv('stackoverflow.csv')
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
5 Cert OT koala
6 Cap FG koala
7 Cert PI koala
8 Block SI koala
9 Cap II koala
df.query("~(category in ['Cert', 'Cap'] and book in ['OT', 'FG', 'PI', 'II'] and createdby=='koala')")
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
8 Block SI koala
pd.DataFrame.query can be used to filter data, the ~ at the beginning is a not operator.
BR
E

Related

Filter Pandas dataframe with user input

I'm trying to develop this code where I would have certain inputs for different variables, these would make the filter happen and return the filtered dataframe, this input will always only receive a single value that the user will choose amoung fewer options and if the input is empty, that filter must bring all the data.
I didn't put the user input because I was testing the function first, however, the function always returns an empty dataframe and I can't find out why. Here is the code I was developing:
I didn't put the dataframe because it comes from an excel, but if necessary I'll put together a sample that fits
df = pd.DataFrame({"FarolAging":["Vermelho","Verde","Amarelo"],"Dias Pendentes":["20 dias","40 dias","60 dias"],"Produto":["Prod1","Prod1","Prod2"],
"Officer":["Alexandre Denardi","Alexandre Denardi","Lucas Fernandes"],"Analista":["Guilherme De Oliveira Moura","Leonardo Silva","Julio Cesar"],
"Coord":["Anna Claudia","Bruno","Bruno"]})
FarolAging1 = ['Vermelho']
DiasPendentes = []
Produto = []
Officer = []
def func(FarolAging1,DiasPendentes,Produto,Officer):
if len(Officer) <1:
Officer = df['Officer'].unique()
if len(FarolAging1) <1:
FarolAging1 = df['FarolAging'].unique()
if len(DiasPendentes) <1:
DiasPendentes = df['Dias Pendentes'].unique()
if len(Produto) <1:
Produto = df['Produto'].unique()
dados2 = df.loc[df['FarolAging'].isin([FarolAging1]) & (df['Dias Pendentes'].isin([DiasPendentes])) & (df['Produto'].isin([Produto])) & (df['Officer'].isin([Officer]))]
print(dados2)
func(FarolAging1, DiasPendentes, Produto, Officer) ```
You have to remove the square brackets in isin because you already have lists:
def func(FarolAging1,DiasPendentes,Produto,Officer):
if len(Officer) <1:
Officer = df['Officer'].unique()
if len(FarolAging1) <1:
FarolAging1 = df['FarolAging'].unique()
if len(DiasPendentes) <1:
DiasPendentes = df['Dias Pendentes'].unique()
if len(Produto) <1:
Produto = df['Produto'].unique()
# Transform .isin([...]) into .isin(...)
dados2 = (df.loc[df['FarolAging'].isin(FarolAging1)
& (df['Dias Pendentes'].isin(DiasPendentes))
& (df['Produto'].isin(Produto))
& (df['Officer'].isin(Officer))])
print(dados2)
return dados2 # don't forget to return something
Output:
>>> func(FarolAging1, DiasPendentes, Produto, Officer)
FarolAging Dias Pendentes Produto Officer Analista Coord
0 Vermelho 20 dias Prod1 Alexandre Denardi Guilherme De Oliveira Moura Anna Claudia

Extract part of string based on a template in Python

I'd like to use Python to read in a list of directories and store data in variables based on a template such as /home/user/Music/%artist%/[%year%] %album%.
An example would be:
artist, year, album = None, None, None
template = "/home/user/Music/%artist%/[%year%] %album%"
path = "/home/user/Music/3 Doors Down/[2002] Away From The Sun"
if text == "%artist%":
artist = key
if text == "%year%":
year = key
if text == "%album%":
album = key
print(artist)
# 3 Doors Down
print(year)
# 2002
print(album)
# Away From The Sun
I can do the reverse easily enough with str.replace("%artist%", artist) but how can extract the data?
If your folder structure template is reliable the following should work without the need for regular expressions.
path = "/home/user/Music/3 Doors Down/[2002] Away From The Sun"
path_parts = path.split("/") # divide up the path into array by slashes
print(path_parts)
artist = path_parts[4] # get element of array at index 4
year = path_parts[5][1:5] # get characters at index 1-5 for the element of array at index 5
album = path_parts[5][7:]
print(artist)
# 3 Doors Down
print(year)
# 2002
print(album)
# Away From The Sun
# to put the path back together again using an F-string (No need for str.replace)
reconstructed_path = f"/home/user/Music/{artist}/[{year}] {album}"
print(reconstructed_path)
output:
['', 'home', 'user', 'Music', '3 Doors Down', '[2002] Away From The Sun']
3 Doors Down
2002
Away From The Sun
/home/user/Music/3 Doors Down/[2002] Away From The Sun
The following works for me:
from difflib import SequenceMatcher
def extract(template, text):
seq = SequenceMatcher(None, template, text, True)
return [text[c:d] for tag, a, b, c, d in seq.get_opcodes() if tag == 'replace']
template = "home/user/Music/%/[%] %"
path = "home/user/Music/3 Doors Down/[2002] Away From The Sun"
artist, year, album = extract(template, path)
print(artist)
print(year)
print(album)
Output:
3 Doors Down
2002
Away From The Sun
Each template placeholder can be any single character as long as the character is not present in the value to be returned.

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here,finished in ~100sec:
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!
Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

How can I ensure that my PDF reading code does not return a NaN row and a duplicate row?

I am working on reading PDF files that are CVs into a DataFrame (pandas). However, after reading the files I find a NaN row and a duplicate row of the last CV (alphabetically). Is there something in the code that does this? I can't seem to figure out why. I have tried changing around the iloc index[0] parts and the fileIndex value, but have found no solution. All help is appreciated.
dataset = []
pdf_dir = "C:/Users/user/Documents/CV ML Test/CVs/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
dataset = pd.DataFrame(columns = ['FileName','Text'])
fileIndex = 0
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([output_data, newRow], ignore_index=True)
Here is a list of the PDF files currently in the directory:
[PDF files in directory][1]
[1]: https://i.stack.imgur.com/8UJZQ.png
This is the result:
FileName Text
0 NaN NaN
1 C:/Users/user/Documents/CV ML Test/CVs\1... [Copyright, ©, 1996-2018,, JobStreet.com., All...
2 C:/Users/user/Documents/CV ML Test/CVs\1... [AMY, PROFILE, Fund, accountant, with, nearly,...
3 C:/Users/user/Documents/CV ML Test/CVs\2... [BEN, Fund, Accoutant, Sep, 2016, -, Present, ...
4 C:/Users/user/Documents/CV ML Test/CVs\3... [CARRIE, Professional, Experience, Citco, Fund...
5 C:/Users/user/Documents/CV ML Test/CVs\4... [DICKSON, PROFESSIONAL, EXPERIENCE, Conifer, F...
6 C:/Users/user/Documents/CV ML Test/CVs\5... [EDWARDO, QUALIFICATION, SUMMARY, Results-driv...
7 C:/Users/user/Documents/CV ML Test/CVs\6... [FAYE, Citco, Fund, Services, (Singapore), Pte...
8 C:/Users/user/Documents/CV ML Test/CVs\7... [GIRAFFE, Work, Experience:, CITCO, Fund, Serv...
9 C:/Users/user/Documents/CV ML Test/CVs\8... [Career, Objectives, Have, strong, interest, i...
10 C:/Users/user/Documents/CV ML Test/CVs\9... [IGNATIUS, Work, Experience, Watiga, &, Co., (...
11 C:/Users/user/Documents/CV ML Test/CVs\9... [IGNATIUS, Work, Experience, Watiga, &, Co., (...
Fixed by changing the following part of the code:
dataset = pd.DataFrame(columns = ['FileName','Text'])
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
dataset = pd.concat([dataset, newRow], ignore_index= True )
Notice last line has changed list parameter to match first part. "dataset" instead of output_data.

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

Categories

Resources