Compare a column between 2 csvs and merge data into one csv - python

I am trying to print out a csv file by comparing a column between two csv files:
CSV1:
network, geoname
1.0.0.0/24, 123456
2.0.0.0/24, 76890
.....
CSV2:
geoname, country_code, country
123456, XX, ABC
....
....
I want to compare the geoname column between csv 1 & 2 and map the network section according to the geoname and associated country and country code.
Final CSV:
network, geoname, country code, country
1.0.0.0/24,123456, XX, ABC
NB: csv1 contains duplicate geonames as well while csv2 maps the geonames to the country.
I am trying to map the network section in csv1 using the geonames and get the associated country and code in my final csv.
The current problem i am facing is that the code will only run until the 2nd CSV file finishes and hence, I am not able to map things properly.
#script for making GeipCountrywhois csv for input to zmap
from asyncore import write
import csv
import os
import pandas as pd
import numpy as np
import sys
indir=os.environ['HOME']+'/code/surveys/mmdb/GeoLite2-Country-CSV_20220308/'
v4file=indir+'GeoLite2-Country-Blocks-IPv4.csv'
localefile = indir+'GeoLite2-Country-Locations-en.csv'
outfile = indir+'GeoIPCountryWhois.csv'
data = []
data2 = []
data3 = []
with open(v4file, "r") as file:
reader = csv.reader(file)
of = open(outfile, "w")
writer = csv.writer(of)
for row in reader:
ip_cidr = row[0]
geoname =row[1]
data = [ip_cidr, geoname]
#print(data)
with open(localefile, "r") as file:
reader2 = csv.reader(file)
for row in reader2:
geoname_en = row[0]
cc = row[4]
country = row[5]
data2 = [geoname, cc, country]
#print(data2)
if(data[1] == data2[0]):
data3 = [ip_cidr, geoname, cc, country]
writer.writerow(data3)
print(data3)

I don't know if you are using python pandas or not... I can put a pandas solution for you.
At first install pandas in your python environment if pandas is not installed in your system before. You can using pandas using the following command -
python -m pip install pandas
Then try this code block -
import pandas as pd
# load csv1, assume that its name is network.csv
network = pd.read_csv('network.csv')
# Then load csv2, assume that its name is geonames.csv
geo = pd.read_csv('geonames.csv')
# create a geo mapper
final_csv = network.merge(geo, how='left', on='geoname')
# save result as csv
final_csv.to_csv('final.csv', index=False)

Related

I need to split one column in csv file into two columns using python

Hello everyone I am learning python I am new I have a column in a csv file with this example of value:
I want to divide the column programme based on that semi column into two columns for example
program 1: H2020-EU.3.1.
program 2: H2020-EU.3.1.7.
This is what I wrote initially
import csv
import os
with open('IMI.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('new_IMI.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter='\t')
#for line in csv_reader:
# csv_writer.writerow(line)
please note that after i do the split of columns I need to write the file again as a csv and save it to my computer
Please guide me
Using .loc to iterate through each row of a dataframe is somewhat inefficient. Better to split an entire column, with the expand=True to assign to the new columns. Also as stated, easy to use pandas here:
Code:
import pandas as pd
df = pd.read_csv('IMI.csv')
df[['programme1','programme2']] = df['programme'].str.split(';', expand=True)
df.drop(['programme'], axis=1, inplace=True)
df.to_csv('IMI.csv', index=False)
Example of output:
Before:
print(df)
id acronym status programme topics
0 945358 BIGPICTURE SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2019-18-01
1 821362 EBiSC2 SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2017-13-06
2 116026 HARMONY SIGNED H202-EU.3.1. IMI2-2015-06-04
After:
print(df)
id acronym status topics programme1 programme2
0 945358 BIGPICTURE SIGNED IMI2-2019-18-01 H2020-EU.3.1. H2020-EU3.1.7
1 821362 EBiSC2 SIGNED IMI2-2017-13-06 H2020-EU.3.1. H2020-EU3.1.7
2 116026 HARMONY SIGNED IMI2-2015-06-04 H2020-EU.3.1. None
You can use pandas library instead of csv.
import pandas as pd
df = pd.read_csv('IMI.csv')
p1 = {}
p2 = {}
for i in range(len(df)):
if ';' in df['programme'].loc[i]:
p1[df['id'].loc[i]] = df['programme'].loc[i].split(';')[0]
p2[df['id'].loc[i]] = df['programme'].loc[i].split(';')[1]
df['programme1'] = df['id'].map(p1)
df['programme2'] = df['id'].map(p2)
and if you want to delete programme column:
df.drop('programme', axis=1)
To save new csv file:
df.to_csv('new_file.csv', inplace=True)

How to list to individual columns in CSV file?

I have a script that I have been working on where I pull info from a text file and output it to a CSV file with specific column headers.
I am having an issue with writing to the correct columns with the output. Instead of having it "interface_list" writing all the port names under "Interface", it instead writes all of them across the row. I am having the same issue for my other lists as well.
This is what the output looks like in the csv file:
Current Output
But I would like it to look like this:
Desired Output
I am kind of new to python but have been learning through online searches.
Can anybody help me understand how to get my lists to go in their respective columns?
Here is my code:
import netmiko
import csv
import datetime
import os
import sys
import re
import time
interface_pattern = re.compile(r'interface (\S+)')
regex_description = re.compile(r'description (.*)')
regex_switchport = re.compile(r'switchport (.*)')
with open('int-ports.txt','r') as file:
output = file.read()
with open('HGR-Core2.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Interface', 'Description', 'Switchport'])
interface_iter = interface_pattern.finditer(output)
interface_list = []
for interface in interface_iter:
interface_list.append(interface.group(1))
writer.writerow(interface_list)
description_iter = regex_description.finditer(output)
description_list = []
for description in description_iter:
description_list.append(description.group(1))
writer.writerow(description_list)
switchport_iter = regex_switchport.finditer(output)
switchport_list = []
for switchport in switchport_iter:
switchport_list.append(switchport.group(0))
f.close()
Thanks.
Append can get very bittersome in loops.
People don't realize how resource hungry things can get quickly.
1.A.) All hings to 1 dataframe that later you can export as csv
salary = [['Alice', 'Data Scientist', 122000],
['Bob', 'Engineer', 77000],
['Ann', 'Manager', 119000]]
# Method 2
import pandas as pd
df = pd.DataFrame(salary)
df.to_csv('file2.csv', index=False, header=False)
1.B.) 1 list to 1 specific column in dataframe from here
L = ['Thanks You', 'Its fine no problem', 'Are you sure']
#create new df
df = pd.DataFrame({'col':L})
print (df)
col
0 Thanks You
1 Its fine no problem
2 Are you sure
2.) Export as csv documentation
df.to_csv('name.csv',index=False)

Can't get rid of a column while writing data to a csv file using reverse search

I've created a script in python to read different id numbers from a csv file in order to use them with a link to populate result and write the result in a different csv file.
This is the base link https://abr.business.gov.au/ABN/View?abn= and these are the numbers (stored in a csv file) 78007306283,70007746536,95051096649 appended to that link to make them usable links. Those numbers are under ids header in the csv file. One such qualified link is https://abr.business.gov.au/ABN/View?abn=78007306283.
My script can read the numbers from a csv file, append them one by one in that link, populate the result in the website and write them in another csv file after extraction.
The only problem I'm facing is that my newly created csv file contains the ids header as well whereas I would like to exclude that column in the new csv file.
How can I get rid of a column available in the old csv file when writing the result in a new csv file?
I've tried so far:
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
The answer below is basically pointing out that the use of pandas can give some control over manipulating tables (Ie, you want to get get rid of a column). You certainly can do it using csv and BeautifulSoup, but in less line of code, the same is accomplished with pandas.
For example, just using your list of the 3 ids, could generate a table to easily write to file:
import pandas as pd
import requests
URL = "https://abr.business.gov.au/ABN/View?abn="
# Read in your csv with the ids
id_df = pd.read_csv('path/file.csv')
#create your list of ids from that csv
id_list = list(id_df['ids'])
results = pd.DataFrame()
for entry in id_list:
url = URL+'%s' %(str(entry))
res = requests.get(url)
table = pd.read_html(url)[0]
name = table.iloc[0,1]
status = table.iloc[1,1]
temp_df = pd.DataFrame([[name,status]], columns = ['Name', 'Status'])
results = results.append(temp_df).reset_index(drop=True)
results.to_csv('path/new_file.csv', index=False)
Output:
print(results)
name status
0 AUSTRALIAN NATIONAL MEMORIAL THEATRE LIMITED Active from 30 Mar 2000
1 MCDONNELL INDUSTRIES PTY. LTD. Active from 24 Mar 2000
2 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
3 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
As far as with the code you're dealing with, I believe the issue is with:
new_row = entry
because entry refers to file f, which has that id column. What you could do is drop the column right before you write. And technically, I believe it's a dictionary you have, so you just need to delete whatever that key:value is:
I don't have a way to test at the moment, but I'm thinking it would be something like:
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
del new_row ['id'] #or whatever the key is for that id value
writer.writerow(new_row)
EDIT / ADDITIONAL
The reason it's still showing is because of this line:
newfieldnames = reader.fieldnames + ['Name', 'Status']
Since you have reader = csv.DictReader(f), it's including the ids column. So in your newfieldnames = reader.fieldnames + ['Name', 'Status'], you're including the field names from the original csv. Just drop the reader.fieldnames +, and initialize your new_row = {}
I think this should work it out:
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = {}
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
You can do web scraping in Python using Pandas package too. Less code you know. You can get a data frame first and then you'll select any column or row. Take a look at how I did https://medium.com/#alcarsil/python-for-cryptocurrencies-absolutely-beginners-how-to-find-penny-cryptos-and-small-caps-72de2eb6deaa

unable to write data into excel which has been split based on street name and address using python

I am trying to read Address information from excel break it down into street name street number direction zip code and then write it in another excel or csv again based on the classification Done.
Sample input
Address1
107 ALVISO DR
12418 SUNNYGLEN DR
2292 MAGNOLIA ST
2092 ATWATER AVE
1242 CARLSBAD PL
Sample output
ZipCode StreetNamePostDirectional StreetNamePreDirectional
777 E N
Based on certain set of rules. I am using the below code.
The problem is when i write the data to the csv it just retruns 1 row.
import csv
import usaddress
import xlsxwriter
file_name = 'Address.xlsx'
import pandas as pd
xl_workbook = pd.ExcelFile(file_name) # Load the excel workbook
df = xl_workbook.parse("Sheet1") # Parse the sheet into a dataframe
aList = df['Address1'].tolist()
di = {}
dicts ={}
for i in aList:
i = str(i)
x = usaddress.parse(i)
for ele in x:
try:
di[ele[1]].append(ele[0])
except KeyError:
di[ele[1]] = [ele[0],]
dicts.update(di)
with open("test.csv", "w") as outfile:
writer = csv.writer(outfile)
writer.writerow(dicts.keys())
writer.writerows(zip(*dicts.values()))
Not sure what is going wrong
If you make your table into a CSV with appropriate headers you can do the following to access the data in each row:
import csv
with open('C:\folder\\file.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
var1 = (row["street namezip code"])
var2 = (row["street number"])
var3 = (row["zip code"])
from there you can use these variables to create your output.
I hope this helps.

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources