Python/Numpy(CSV): Finding values, appending another csv - python

I have found other posts very closely related to this, but they are not helping.
I have a Master CSV file, and I need to find specific 'string' from the second column. Shown below:
Name,ID,Title,Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved
Joshua Morales,MF6B9X,Tech_Rep, 08-Nov-2016,948,740,8,8
Betty García,ERTW77,SME, 08-Nov-2016,965,854,15,12
Kathleen Marrero,KTD684,Probation, 08-Nov-2016,946,948,na,na
Mark León,GSL89D,Tech_Rep, 08-Nov-2016,951,844,6,4
The ID column is unique, and so I was trying to find 'KTD684'(for expample). Once found, I need to export the values of "Date", "Prj1_Assigned", "Prj1_closed", "Prj2_assigned" and "Prj2_solved".
The export would be to a file 'KTD684.csv'(same as ID) where there is already headers available 'Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved'
So far (as I am a non-programmer) I have not been able to draft this, but can one please be kind to guide me in:
Finding the row with the element 'KTD684'.
Selecting the values of the below from that row:
['Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved']
Appending the file with the ID name itself please('KTD684.csv')
I need to perform this for 45 userIDs, and now with hiring in company, its 195. I tried to write excel macro(didn't work either), but I feel python is most reliable.
I know I need to at least show the basic progress, but after over 2 months of trying to learn from someone, I'm still unable to find the element in this csv.

If I understand your problem correctly; You need to read from 2 input files:
1 containing the users IDs you are looking for
2 containing the project data related to users
In that fashion something like this would find all the users you specify in 1 in file 2 and write them out to result.csv
Sepicify your search IDs in search_for.csv. Keep in mind that this
will revrite your result.csv every time you run it.
import csv
import sys
import os
inputPatterns = open(os.curdir + '/search_for.csv', 'rt')
# Reader for the IDs (users) you are looking to find (key)
reader = csv.reader(inputPatterns)
ids = []
# reading the IDs you are looking for from search_for.csv
for row in reader:
ids.append(row[0])
inputPatterns.close()
# Let's see if any of the user IDs we are looking for has any project related info
# if so write them to your output CSV
for userID in ids:
# Organization list with names and Company ID and reader
userList = open(os.curdir + '/users.csv', 'rt')
reader = csv.reader(userList)
# This will be the output file
result_f = open(os.curdir + "/" + userID + ".csv", 'w')
w = csv.writer(result_f)
# Writing header information
w.writerow(['Date', 'Prj1_Assigned', 'Prj1_closed', 'Prj2_assigned', 'Prj2_solved'])
# Scanning for projects for user and appending them
for row in reader:
if userID == row[1]:
w.writerow([row[3], row[4], row[5], row[6], row[7]])
result_f.close()
userList.close()
For example, search_for.csv looks like this

This is an ideal use-case for pandas:
import pandas as pd
id_list = ['KTD684']
df = pd.read_csv('input.csv')
# Only keep values that are in 'id_list'
df = df[df['ID'].isin(id_list)]
gb = df.groupby('ID')
for name, group in gb:
with open('{}.csv'.format(name), 'a') as f:
group.to_csv(f, header=False, index=False,
columns=["Date", "Prj1_Assigned", "Prj1_closed",
"Prj2_assigned", "Prj2_solved"])
This will open the CSV, only select rows that are in your list (id_list), group by the values in the ID column and save individual CSV files for each unique ID. You just need to expand id_list to have the ids you are interested in.
Extended example:
Reading in the CSV results in a DataFrame object like this:
df = pd.read_csv('input.csv')
Name ID Title Date Prj1_Assigned \
0 Joshua Morales MF6B9X Tech_Rep 08-Nov-2016 948
1 Betty García ERTW77 SME 08-Nov-2016 965
2 Kathleen Marrero KTD684 Probation 08-Nov-2016 946
3 Mark León GSL89D Tech_Rep 08-Nov-2016 951
Prj1_closed Prj2_assigned Prj2_solved
0 740 8 8
1 854 15 12
2 948 na na
3 844 6 4
If you just select KTD684 and GSL89D:
id_list = ['KTD684', 'GSL89D']
df = df[df['ID'].isin(id_list)]
Name ID Title Date Prj1_Assigned \
2 Kathleen Marrero KTD684 Probation 08-Nov-2016 946
3 Mark León GSL89D Tech_Rep 08-Nov-2016 951
Prj1_closed Prj2_assigned Prj2_solved
2 948 na na
3 844 6 4
The groupby operation groups on ID and export each unique ID to a CSV file resulting in:
KTD684.csv
Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved
08-Nov-2016,946,948,na,na
GSL89D.csv
Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved
08-Nov-2016,951,844,6,4

Here's a pure python approach which reads the master .csv file with csv.DictReader, matches the ids, and appends the file data into a new or existing .csv file with csv.DictWriter():
from csv import DictReader
from csv import DictWriter
from os.path import isfile
def export_csv(user_id, master_csv, fieldnames, key_id, extension=".csv"):
filename = user_id + extension
file_exists = isfile(filename)
with open(file=master_csv) as in_file, open(
file=filename, mode="a", newline=""
) as out_file:
# Create reading and writing objects
csv_reader = DictReader(in_file)
csv_writer = DictWriter(out_file, fieldnames=fieldnames)
# Only write header once
if not file_exists:
csv_writer.writeheader()
# Go through lines and match ids
for line in csv_reader:
if line[key_id] == user_id:
# Modify line and append to file
line = {k: v.strip() for k, v in line.items() if k in fieldnames}
csv_writer.writerow(line)
Which can be called like this:
export_csv(
user_id="KTD684",
master_csv="master.csv",
fieldnames=["Date", "Prj1_Assigned", "Prj1_closed", "Prj2_assigned", "Prj2_solved"],
key_id="ID",
)
And produces the following KTD684.csv:
Date,Prj1_Assigned,Prj1_closed,Prj2_assigned,Prj2_solved
08-Nov-2016,946,948,na,na

Related

How to do vlookup without pandas using python script

I am having two csv files where I need a python code to do a vlookup that does match the values and takes only the needed column and creates a new csv file. I know it can be done with pandas but I need it to do this without pandas or any 3rd party tools.
INPUT 1 csv file
ID NAME SUBJECT
1 Raj CS
2 Allen PS
3 Bradly DP
4 Tim FS
INPUT 2 csv file
ID COUNTRY TIME
2 USA 1:00
4 JAPAN 14:00
1 ENGLAND 5:00
3 CHINA 0.00
OUTPUT csv file
ID NAME SUBJECT COUNTRY
1 Raj CS ENGLAND
2 Allen PS USA
3 Bradly DP CHINA
4 Tim FS JAPAN
Probably a more efficient way to do it, but basically create a nested dictionary (using the ID as the key) with the other column names and their values under the ID key. Then when you iterate through each file, it'll update the dictionary on the ID key.
Finally put them together into a list and write to file:
input_files = ['C:/test/input_1.csv', 'C:/test/input_2.csv']
lookup_column_name = 'ID'
output_dict = {}
for file in input_files:
file = open(file, 'r')
header = {}
# Read each line in the csv
for idx, line in enumerate(file.readlines()):
# If it's the first line, store as the header
if idx == 0:
header = line.split(',')
# Get the index value of the lookup column from the list of headers
header_dict = {idx:x.strip() for idx, x in enumerate(header)}
lookup_column_idx = dict((v,k) for k,v in header_dict.items())[lookup_column_name]
continue
line_split = line.split(',')
# Initialize the dictionary by look up column
if line_split[lookup_column_idx] not in output_dict.keys():
output_dict[line_split[lookup_column_idx]] = {}
# If not the lookup column, then add the other column and data to the dictionary
for idx, value in enumerate(line_split):
if idx != lookup_column_idx:
output_dict[line_split[lookup_column_idx]].update({header_dict[idx]:value})
# Create a list of the rows that will be written to file under the correct columns
rows = []
for k, v in output_dict.items():
header = [lookup_column_name] + list(v.keys())
row = [k] + [output_dict[k][x].strip() for x in header if x != lookup_column_name]
row = ','.join(row) + '\n'
rows.append(row)
# Final list of rows, begining with the header
output_lines = [','.join(header) + '\n'] + rows
# writing to file
output = open('C:/test/output.csv', 'w')
output.writelines(output_lines)
output.close()
To do this without pandas (and assuming you know the structure of your data + it fits in memory), you can iterate through the csv file and store the results in a dictionary, where you fill the entries where the ID maps to the other information that you want to keep.
You can do this for both csv files and join them manually afterwards by iterating over the keys of the dictionary.
input1='.\file1.csv'
input2='.\file2.csv'
with open(input1,'r',encoding='utf-8-sig') as inuputlist:
with open(input2, "r",encoding='utf-8-sig') as inputlist1:
with open('.\output.csv','w',newline='',encoding='utf-8-sig') as output:
reader = csv.reader(inputlist)
reader2 = csv.reader(inputlist1)
writer = csv.writer(output)
dict1 = {}
for xl in reader2:
dict1[xl[0]] = xl[1]
for i in reader:
if i[2] in dict1:
i.append(dict1[i[2]])
writer.writerow(i)
else:
i.append("N/A")
writer.writerow(i)

Insert a line between existing lines in a CSV file using python

I am creating a script that writes lines to a CSV file using Python.
For now, my script writes the CSV in this format:
Title row
Value1;Value2;.... (more than 70)
Title row2
Value1;Value2;...
I just want to be able to read the file again and insert a line of values in between rows, like the following:
Title row
Value1;Value2;.... (more than 70)
Value1;Value2;....
Title row2
Value1;Value2;...
Do you have any ideas?
import csv
with open('csvfile.csv', mode='w') as csv_file:
fieldnames = ['Title', 'row']
writer = csv.DictWriter(csv_file, delimiter=';',fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'Title': 'Value1', 'row': 'Value2'})
I think you can try getting the index of the row with Header Titles and then use append to add new rows and then again combine the two dataframes. Here is the code that might work for you.
import pandas as pd
# intialise data of lists.
data = {'Title':['Tom', 'nick', 'krish', 'jack','Title','Harry'],
'Row':[20, 21, 19, 18,'Row',21]}
new_data = {'Title':['Rahib'],
'Row':[25]}
# Create DataFrame
df = pd.DataFrame(data)
new_df = pd.DataFrame(new_data)
#print(df)
index = df[df['Title'] == 'Title'].index.values.astype(int)[0]
upper_df = df.loc[:index-1]
lower_df = df.loc[index+1:]
upper_df = upper_df.append(new_df)
upper_df = upper_df.append(lower_df).reset_index(drop=True)
print(upper_df)
This will return following dataframe.
Title Row
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
4 Rahib 25
5 Harry 21
Thanks to #harshal's answer and #Rahib's answer, but I've already tried pandas for a long time, and can't seem to get it to work, looks like it's not really suited for my CSV format.
Finally, I've looked at the posts provided by #Serge Ballesta, and in fact, a simple readlines and the retrieval of the line index is a pretty simple trick:
with open(output) as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
index = num
f = open(output, "r")
contents = f.readlines()
value=';'.join(value)
f.close()
contents.insert(index+1, str(value)+'\n')
f = open(output, "w")
contents = "".join(contents)
f.write(contents)
f.close()
With output being the name of the file (passed in parameters), value being a list of values (joined for a string with ";" as delimiter), and lookup being the string i was looking for (the title row)

Can't get rid of a column while writing data to a csv file using reverse search

I've created a script in python to read different id numbers from a csv file in order to use them with a link to populate result and write the result in a different csv file.
This is the base link https://abr.business.gov.au/ABN/View?abn= and these are the numbers (stored in a csv file) 78007306283,70007746536,95051096649 appended to that link to make them usable links. Those numbers are under ids header in the csv file. One such qualified link is https://abr.business.gov.au/ABN/View?abn=78007306283.
My script can read the numbers from a csv file, append them one by one in that link, populate the result in the website and write them in another csv file after extraction.
The only problem I'm facing is that my newly created csv file contains the ids header as well whereas I would like to exclude that column in the new csv file.
How can I get rid of a column available in the old csv file when writing the result in a new csv file?
I've tried so far:
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
The answer below is basically pointing out that the use of pandas can give some control over manipulating tables (Ie, you want to get get rid of a column). You certainly can do it using csv and BeautifulSoup, but in less line of code, the same is accomplished with pandas.
For example, just using your list of the 3 ids, could generate a table to easily write to file:
import pandas as pd
import requests
URL = "https://abr.business.gov.au/ABN/View?abn="
# Read in your csv with the ids
id_df = pd.read_csv('path/file.csv')
#create your list of ids from that csv
id_list = list(id_df['ids'])
results = pd.DataFrame()
for entry in id_list:
url = URL+'%s' %(str(entry))
res = requests.get(url)
table = pd.read_html(url)[0]
name = table.iloc[0,1]
status = table.iloc[1,1]
temp_df = pd.DataFrame([[name,status]], columns = ['Name', 'Status'])
results = results.append(temp_df).reset_index(drop=True)
results.to_csv('path/new_file.csv', index=False)
Output:
print(results)
name status
0 AUSTRALIAN NATIONAL MEMORIAL THEATRE LIMITED Active from 30 Mar 2000
1 MCDONNELL INDUSTRIES PTY. LTD. Active from 24 Mar 2000
2 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
3 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
As far as with the code you're dealing with, I believe the issue is with:
new_row = entry
because entry refers to file f, which has that id column. What you could do is drop the column right before you write. And technically, I believe it's a dictionary you have, so you just need to delete whatever that key:value is:
I don't have a way to test at the moment, but I'm thinking it would be something like:
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
del new_row ['id'] #or whatever the key is for that id value
writer.writerow(new_row)
EDIT / ADDITIONAL
The reason it's still showing is because of this line:
newfieldnames = reader.fieldnames + ['Name', 'Status']
Since you have reader = csv.DictReader(f), it's including the ids column. So in your newfieldnames = reader.fieldnames + ['Name', 'Status'], you're including the field names from the original csv. Just drop the reader.fieldnames +, and initialize your new_row = {}
I think this should work it out:
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = {}
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
You can do web scraping in Python using Pandas package too. Less code you know. You can get a data frame first and then you'll select any column or row. Take a look at how I did https://medium.com/#alcarsil/python-for-cryptocurrencies-absolutely-beginners-how-to-find-penny-cryptos-and-small-caps-72de2eb6deaa

Merge 2 csv file with one unique column but different header [duplicate]

This question already has answers here:
Merging two CSV files using Python
(2 answers)
Closed 7 years ago.
I want to merge 2 csv file using some scripting language (like bash script or python).
1st.csv (this data is from mysql query)
member_id,name,email,desc
03141,ej,ej#domain.com,cool
00002,jes,jes#domain.com,good
00002,charmie,charm#domain.com,sweet
2nd.csv (from mongodb query)
id,address,create_date
00002,someCity,20150825
00003,newCity,20140102
11111,,20150808
The examples are not the actual, though i know that some of the member_id from qsl and the id from mongodb are the same.
(*and i wish my output will be something like this)
desiredoutput.csv
meber_id,name,email,desc,address,create_date
03141,ej,ej#domain.com,cool,,
00002,jes,jes#domain.com,good,someCity,20150825
00002,charmie,charm#domain.com,sweet,
11111,,,,20150808
help will be much appreciated. thanks in advance
#########################################################################
#!/usr/bin/python
import csv
import itertools as IT
filenames = ['1st.csv', '2nd.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('desiredoutput.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:1] # column where 1 know there are identical data
if len(row) == 1:
combined_row.extend(row)
else:
combined_row.extend(['']*1)
writer.writerow(combined_row)
for f in handles:
f.close()
#########################################################################
just read and tried this code(manipulate) in this site too
Since you haven't posted an attempt, I'll give you a general answer (using Python) to get you started.
Create a dict, d
Iterate over all the rows of the first file, convert each row into a list and store it in d using meber_id as the key and the list as the value.
Iterate over all the rows of the second file, convert each row into a list leaving out the id column and update the list under d[id] with the new list if d[id] exists, otherwise store the new list under d[id].
Finally, iterate over the values in d and print them out comma separated to a file.
Edit
In your attempt, you are trying to use izip_longest to iterate over the rows of both files at the same time. But this would work only if there were an equal number of rows in both files and they were in the same order.
Anyhow, here is one way of doing it.
Note: This is using the Python 3.4+ csv module. For 2.7 it might look a little different.
import csv
d = {}
with open("file1.csv", newline="") as f:
for row in csv.reader(f):
d.setdefault(row[0], []).append(row + [""] * 3)
with open("file2.csv", newline="") as f:
for row in csv.reader(f):
old_row = d.setdefault(row[0][0], [row[0], "", "", ""])
old_row[4:] = row[1:]
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
for rows in d.values():
writer.writerows(rows)
Here goes a suggestion using pandas I've got from this answer and pandas doc about merging.
import pandas as pd
first = pd.read_csv('1st.csv')
second = pd.read_csv('2nd.csv')
merged = pd.concat([first, second], axis=1)
This will output:
meber_id name email desc id address create_date
3141 ej ej#domain.com cool 2 someCity 20150825
2 jes jes#domain.com good 11 newCity 20140102
11 charmie charm#domain.com sweet 11111 NaN 20150808

Python 3: How to read a csv file and store specific values as variables

I 'm new in SO, new at programming and even more with python haha,
I'm trying to read CSV files (which will contain different data types) and store specific values ("coordinates") as variables.
CSV file example (sorry for using code format, text didn't want to stay quiet):
$id,name,last_name,age,phone_number,addrstr,addrnum
1,Constance,Harm,37,555-1234,Ocean_view,1
2,Homer,Simpson,40,555-1235,Evergreen_Terrace,742
3,John,Doe,35,555-1236,Fake_Street,123
4,Moe,Tavern,20,7648-4377,Walnut_Street,126
I want to know if there is some easy way to store a specific value using the rows as index, for example: "take row 2 and store 2nd value in variable Name, 3rd value in variable Lastname" and the "row" for each storage will vary.
Not sure if this will help because my coding level is very crappy:
row = #this value will be taken from ANOTHER csv file
people = open('people.csv', 'r')
linepeople = csv.reader(people)
data = list(linepeople)
name = int(data[**row**][1])
lastname = int(data[**row**][2])
age = int(data[**row**][3])
phone = int(data[**row**][4])
addrstr = int(data[**row**][5])
addrnum = int(data[**row**][6])
I haven't found nothing very similar to guide me into a solution. (I have been reading about dictionaries, maybe that will help me?)
EDIT (please let me know if its not allowed to edit questions): Thanks for the solutions, I'm starting to understand the possibilities but let me give more info about my expected output:
I'm trying to create an "universal" function to get only one value at given row/col and to store that single value into a variable, not the whole row nor the whole column.
Example: Need to store the phone number of John Doe (column 5, row 4) into a variable so that when printing that variable the output will be: 555-1236
You can iterate line by line. Watch out for your example code, you are trying to cast names of people into integers...
for row in linepeople:
name=row['name']
age = int(row['age'])
If you are going to do more complicated stuff, I recommend pandas. For starters it will try to convert numerical columns to float, and you can access them with attribute notation.
import pandas as pd
import numpy as np
people = pd.read_table('people.csv', sep=',')
people.name # all the names
people.loc[0:2] # first two rows
You can use the CSV DictReader which will automatically assign dictionary names based on your CSV column names on a per row basis as follows:
import csv
with open("input.csv", "r") as f_input:
csv_input = csv.DictReader(f_input)
for row in csv_input:
id = row['$id']
name = row['name']
last_name = row['last_name']
age = row['age']
phone_number = row['phone_number']
addrstr = row['addrstr']
addrnum = row['addrnum']
print(id, name, last_name, age, phone_number, addrstr, addrnum)
This would print out your CSV entries as follows:
1 Constance Harm 37 555-1234 Ocean_view 1
2 Homer Simpson 40 555-1235 Evergreen_Terrace 742
3 John Doe 35 555-1236 Fake_Street 123
4 Moe Tavern 20 7648-4377 Walnut_Street 126
If you wanted a list of just the names, you could build them as follows:
with open("input.csv", "r") as f_input:
csv_input = csv.DictReader(f_input)
names = []
for row in csv_input:
names.append(row['name'])
print(names)
Giving:
['Constance', 'Homer', 'John', 'Moe']
As the question has changed, a rather different approach would be needed. A simple get row/col type function would work but would be very inefficient. The file would need to be read in each time. A better approach would be to use a class. This would load the file in once and then you could get as many entries as you need. This can be done as follows:
import csv
class ContactDetails():
def __init__(self, filename):
with open(filename, "r") as f_input:
csv_input = csv.reader(f_input)
self.details = list(csv_input)
def get_col_row(self, col, row):
return self.details[row-1][col-1]
data = ContactDetails("input.csv")
phone_number = data.get_col_row(5, 4)
name = data.get_col_row(2,4)
last_name = data.get_col_row(3,4)
print "%s %s: %s" % (name, last_name, phone_number)
By using the class, the file is only read in once. This would print the following:
John Doe: 555-1236
Note, Python numbers indexes from 0, so your 5,4 has to be converted to 4,3 for Python.

Categories

Resources