pandas dataframe not creating new column - python

I have data like this. What I am trying to do is to create a rule, based on domain names for my project. I want to create a new column named new_url based on domains. If it contains .cdn. it will take the string before .cdn. , otherwise it will call url parser library and parse url in another way. The problem is that in the csv file I created (cleanurl.csv) , there is no new_url column created. When I print parsed urls in code, I can see them. If and else condition are working. Could you help me please ?
import pandas as pd
import url_parser
from url_parser import parse_url,get_url,get_base_url
import numpy as np
df = pd.read_csv("C:\\Users\\myuser\\Desktop\\raw_data.csv", sep=';')
i=-1
for x in df['domain']:
i=i+1
print("*",x,"*")
if '.cdn.' in x:
parsed_url=x.split('.cdn')[0]
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
else:
parsed_url=get_url(x).domain +'.' + get_url(x).top_domain
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
df.to_csv("C:\\Users\\myuser\\Desktop\\cleanurl.csv", sep=';')

Use .loc[row, 'column'] to create new column
for idx, x in df['domain'].items():
if '.cdn.' in x:
df.loc[idx, 'new_url'] = parsed_url
else:
df.loc[idx, 'new_url'] = parsed_url

Related

Apply string function to data frame

The task is to wrap URLs in excel file with html tag.
For this, I have a fucntion and the following code that works for one column named ANSWER:
import pandas as pd
import numpy as np
import string
import re
def hyperlinksWrapper(myString):
#finding all substrings that look like a URL
URLs = re.findall("(?P<url>https?://[^','')'' ''<'';'\s\n]+)", myString)
#print(URLs)
#replacing each URL by a link wrapped into <a> html-tags
for link in URLs:
wrappedLink = '' + link + ''
myString = myString.replace(link, wrappedLink)
return(myString)
#Opening the original XLS file
filename = "Excel.xlsx"
df = pd.read_excel(filename)
#Filling all the empty cells in the ANSWER cell with the value "n/a"
df.ANSWER.replace(np.NaN, "n/a", inplace=True)
#Going through the ANSWER column and applying hyperlinksWrapper to each cell
for i in range(len(df.ANSWER)):
df.ANSWER[i] = hyperlinksWrapper(df.ANSWER[i])
#Export to CSV
df.to_excel('Excel_refined.xlsx')
The question is, how do I look not in one column, but in all the columns (each cell) in the dataframe without specifying the exact column names?
Perhaps you're looking for something like this:
import pandas as pd
import numpy as np
import string
import re
def hyperlinksWrapper(myString):
#finding all substrings that look like a URL
URLs = re.findall("(?P<url>https?://[^','')'' ''<'';'\s\n]+)", myString)
#print(URLs)
#replacing each URL by a link wrapped into <a> html-tags
for link in URLs:
wrappedLink = '' + link + ''
myString = myString.replace(link, wrappedLink)
return(myString)
# dummy dataframe
df = pd.DataFrame(
{'answer_col1': ['https://example.com', 'https://example.org', np.nan],
'answer_col2': ['https://example.net', 'Hello', 'World']}
)
# as suggested in the comments (replaces all NaNs in df)
df.fillna("n/a", inplace=True)
# option 1
# loops over every column of df
for col in df.columns:
# applies hyperlinksWrapper to every row in col
df[col] = df[col].apply(hyperlinksWrapper)
# [UPDATED] option 2
# applies hyperlinksWrapper to every element of df
df = df.applymap(hyperlinksWrapper)
df.head()

Extracting multiple tables into a .csv

I have a csv file, which I am using to search uniprot.org for multiple variants of a protein, an example of this is the following website:
https://www.uniprot.org/uniprot/?query=KU168294+env&sort=score
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
df = pd.read_csv('Env_seq_list.csv')
second_column_df = df['Accession']
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
print(df.loc[df['Gene names'] == 'env'])
If I perform the print function, it works fine and I get back a list of the tables that I'm after. I'm stuck at this point because if I instead use the pandas df.to_csv function I cannot seem to get it to work alongside the df.loc function. Additionally, simply using the df.to_csv function only writes the last search result to the .csv, which I'm pretty sure is due to that function being within the for loop, however I am unsure as to how to fix this. Any help would be greatly appreciated :-)
I would suggest that you take the df you find each time through the loop, and append it to a 'final' df. Then outside the loop, you can run to_csv on that 'final' df. Code below:
final_df = pd.DataFrame()
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
#print(df.loc[df['Gene names'] == 'env'])
final_df = pd.concat([final_df, df.loc[df['Gene names'] == 'env']], axis=0)
final_df.to_csv('/path/to/save/csv')

How can I use the .findall() function for a excel file iterating through all rows of a column?

I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:
My goal is to achieve a dataframe like this:
I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.
import openpyxl
import re
import sys
import pandas as pd
reload(sys)
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
for cells in row:
a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
a1="".join(a).replace('\n',' ')
b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
b1="".join(b).replace('\n',' ')
c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
c1="".join(c).replace('\n',' ')
w['Name'].append(q1)
w['Affiliation'].append(r1)
w['Email'].append(s1)
print cell.value
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')
I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column
df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan
for i in range(0, len(df)):
full_value = df['Companies'].loc[i]
full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
full_value = full_value.split(';')
df['Name'].loc[i] = full_value[0]
df['Affiliation'].loc[i] = full_value[1]
df['Email'].loc[i] = full_value[2]
df['Mobile'].loc[i] = full_value[3]
del df['Companies']
print(df)

Adding a column using the urllib.parse

I have a CSV imported using Pandas :
df = pd.read_csv('files_2.csv')
One of the columns in the data is PAGE URL , I would like to add a column to the data frame with functions executed using the urlib
something like this:
o = urlparse(df['Page URL'])
o.query #function that pull the data
parse_qs(o.query) # the logic for the column
The new column should hold the results from the parse_qs(o.query) function. I am new to using Python 3 and it will be great if you can point me to the right direction.
Thanks
Try doing the following:
page_url = urlparse(df['Page URL'])
df['query'] = parse_qs(page_url.query)
Remember that parse_qs() returns a dictionary, so you will have dictionaries in your query cells.
Using apply
Ex:
import pandas as pd
from urlparse import urlparse, parse_qs
df = pd.DataFrame({"Page URL": ["https://www.google.com/search?ei=0kkkk"]})
df['query'] = df['Page URL'].apply(urlparse).apply(lambda x: parse_qs(x.query))
print(df)
Output:
Page URL query
0 https://www.google.com/search?ei=0kkkk {u'ei': [u'0kkkk']}

How do I save my list to a dataframe keeping empty rows?

I'm trying to extract subject-verb-object triplets and then attach an ID. I am using a loop so my list of extracted triplets keeping the results for the rows were no triplet was found. So it looks like:
[]
[trump,carried,energy]
[]
[clinton,doesn't,trust]
When I print mylist it looks as expected.
However when I try and create a dataframe from mylist I get an error caused by the empty rows
`IndexError: list index out of range`.
I tried to include an if statement to avoid this but the problem is the same. I also tried using reindex instead but the df2 came out empty.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import spacy
import textacy
import csv, string, re
import numpy as np
import pandas as pd
#Import csv file with pre-processing already carried out
import pandas as pd
df = pd.read_csv("pre-processed_file_1.csv", sep=",")
#Prepare dataframe to be relevant columns and unicode
df1 = df[['text_1', 'id']].copy()
import StringIO
s = StringIO.StringIO()
tweets = df1.to_csv(encoding='utf-8');
nlp = spacy.load('en')
count = 0;
df2 = pd.DataFrame();
for row in df1.iterrows():
doc = nlp(unicode(row));
text_ext = textacy.extract.subject_verb_object_triples(doc);
tweetID = df['id'].tolist();
mylist = list(text_ext)
count = count + 1;
if (mylist):
df2 = df2.append(mylist, ignore_index=True)
else:
df2 = df2.append('0','0','0')
Any help would be very appreciated. Thank you!
You're supposed to pass a DataFrame-shaped object to append. Passing the raw data doesn't work. So df2=df2.append([['0','0','0']],ignore_index=True)
You can also wrap your processing in a function process_row, then do df2 = pd.DataFrame([process_row(row) for row in df1.iterrows()]). Note that while append won't work with empty rows, the DataFrame constructor just fills them in with None. If you want empty rows to be ['0','0','0'], you have several options:
-Have your processing function return ['0','0','0'] for empty rows -Change the list comprehension to [process_row(row) if process_row(row) else ['0','0','0'] for row in df1.iterrows()] -Do df2=df2.fillna('0')

Categories

Resources