dissociate firstname/surname by comparing - python

I've a CSV dataset with 2000 rows, with a messy column about first name/surname. In this column I need to dissociate first names and surnames. For that, I've a base with all surnames given in France in the twenty last years.
So, the source database looks like :
"name"; "town"
"Johnny Aaaaaa"; "Bordeaux"
"Bbbb Tom";"Paris"
"Ccccc Pierre Dddd" ; "Lyon"
...
I want obtain something like :
"surname"; "firstname"; "town"
"Aaaaaa"; "Johnny "; "Bordeaux"
"Bbbb"; "Tom"; "Paris"
"Ccccc Dddd" ; "Pierre"; "Lyon"
...
And, my reference database of first names :
"firstname"; "sex"
"Andre"; "M"
"Bob"; "M"
"Johnny"; "M"
...
Technically, I must compare each row from the first base with each field from the second base, in order to identify which character chain correspond to the first name...
I have no idea about the way to do that.
Any ideas are welcome... thanks.

Looks like you want to
Read the data from file say input.csv
Extract the name and split it into first name and last name
Get the sex using first name
And probably write the data again to a new csv or print it.
You can follow the approach as below. You can get more sophisticated in splitting using regex but here is something basic using strip commands:
inFile=open('input.csv','r')
rows=inFile.readlines()
newData=[]
if len(rows) > 1:
for row in rows[1:]:
#Remove the new line chars at the end of line and split on ;
data=row.rstrip('\n').split(';')
#Remove additional spaces in your data
name=data[0].strip()
#Get rid of quotes
name=name.strip('"').split(' ')
fname=name[1]
lname=name[0]
city=data[1].strip()
city=city.strip('"')
#Now you can get the sex info from your other database save this in a list to get the sex info later
sex='M' #replace this with your db calls
newData.append([fname, lname, sex, city])
inFile.close()
#You can put all of this in the new csv file by something like this (it seperates the fileds using comma):
outFile=open('otput.csv','w')
for row in newData:
outFile.write(','.join(row))
outFile.write('\n')
outFile.close(

Well. Finally, I chose the "brute force" approach : each term of each line is compared with the 11.000 keys of my second base (converted in a dictionnary). Not smart, but efficient.
for row in input:
splitted = row[0].lower().split()
for s in splitted :
for cle, valeur in dict.items() :
if cle == s :
print ("{} >> {}".format(cle, valeur))
All ideas about more pretty solutions are still welcome.

Related

How to organise dataFrame like this, in Python:

I have a file which has some information:
1.Movie ID (the first character before a ":")
2.User ID
4.User Rating
3.Date
All elements are splited by a "," but Movie ID, which is separated by a colon
if I create a dataframe like this:
df=pd.read_csv('combined_data_1.txt',header = None,names['Movie_ID','User_ID','Rating','Date'])
and print the dataframe, I will get this:
Which is not correct, obviosly.
So, if you look at the "Movie_ID" column, in the first row, there is a 1:1488844. Only the number "1" (just before the colon) should be in the "Movie_ID" column, not "1:1488844". The rest (1488844) should be in the User_ID column.
Another problem is that not every "Movie_ID" column have its correctly ID, and in this case, it should be "1" until I find another movie id, that again, will be the first number before a colon.
I know that the ids of all the movies follow a sequence, that is: 1,2,3,4,...
Another problem that I saw, was that when I read the file, for some reason a split occours when there is a colon, so after the first row (which doesn't get splited), when a colon appears, a row in "Movie_ID" is created containing only, for example: "2:", not something like the first row.
In the end, I would like to get something like this:
But I don't know how to organise like this.
Thanks for the help!
Use shift with axis=1 and simply modify the columns:
df=df.shift(axis=1)
df['Movie_ID']=df['User_ID'].str[0]
df['User_ID']=df['User_ID'].str[2:]
And now:
print(df)
Would be desired result.
I believe the issue might be coming from how your data is being stored and thus parsed due to the way your Movie ID is stored separated by a : (colon) rather than a , (comma) as would be needed in a CSV.
If you are able to parse to have it delineate by commas exclusively. the text before it is opened as a CSV, you may be able to eliminate this issue. I only note this because Pandas does not permit multiple delimiters.
Here is what I was able to come up with regarding making something which delineates by colon and comma for how you desire. While I know this isn't your ultimate goal, hopefully this is able to get you on the right path.
import pandas as pd
with open("combined_data_1.txt") as file:
lines = file.readlines()
#Splitting the data into a list delineated by colons
data = []
for line in lines:
if(":" in line):
data.append([])
else: #Using else here prevents the line containing the colon from being saved.
data[len(data)-1].append(line)
for x in range(len(data)):
print("Section " + str(x+1) + ":\n")
print(str(data[x]) + "\n\n")

Writing a sorted csv file from a changing list of strings. Python 2.7

I have a substantial amount of strings which consists of different values (like a phone number, name, etc)
there is a total of 10 available values for each string, but every string may have only some of them, it might have 0, and it might have all of them.
I'm trying to find a way to open a csv file, containing 10 columns (as the possible values known to me already), and write every value of every string in his appropriate "cell", or leave a cell empty when needed.
for example:
str1=
name1
phonenum1
address1
email1
str2=
name2
phone2
email2
str3=
name3
adress3
email3
The resut I'm looking for in this example should be something like:
name phonenum adress email
name1 phonenum1 adress1 email1
name2 phonenum2 email2
name3 adress3 email3
I've tried to split the strings into list, check every item in it for its appropriate column and write it in the specific cell it should go to, but I havn't found a way to write to a specific cell according to the 'type' of value (phone number, name, etc in this case).
I found some partial answers for rewriting an existing csv in a specific set cell (like all the cells in the 3rd column, or only the 3rd row in 4th column), nothing I could rearrenge to my goal successfully.
Two more difficulties I'm having are, 1. some of the values contain commas in them.
And 2. how to successfully recognize a missing value to keep his cell empty, in the example above- how can I recognize that the value I'm missing is the phone number and not the name or the adress for example?
Use Numpy's genfromtxt() method to properly read CSV files, it does all the seperation and comma handling for you
Define a row class with primitive slots for the different values and blanks as default values
Override the __str__ method for your specific needs
Lets assume your data looks like:
str1 = """
Adam
+48100200300
Street 2, Dublin
adam#adam.com
"""
str2 = """
Eva
48100000000
eva#eva.com
"""
str3 = """
Tom Jr
Street 1, London
tom#tom.com
"""
data = [str1, str2, str3]
Lets define fields that you are expecting:
field_names = [
'name',
'phone',
'email',
'address',
]
Because on example you did not have fields identified, and as I see may be different combinations,
then we need to recognize what field contains.
Lets write easy solution (definitely you will need more sophisticated recognize method - but this is an example)
import re
def recognize_field_name(line):
if not line:
return
if re.fullmatch('\\+?[0-9]+', line):
return 'phone'
if '#' in line:
return 'email'
if ',' in line:
return 'address'
return 'name'
Then let's create input data:
results = []
for one_string in data:
result = {}
for l in one_string.split("\n"):
value = l.strip()
field_name = recognize_field_name(value)
if field_name:
result[field_name] = value
results.append(result)
And finally we may store it:
import csv
with open("/tmp/out.csv", "w") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=field_names)
for r in results:
writer.writerow(r)
with open("/tmp/out.csv") as show:
print(show.read())
This will produce:
Adam,+48100200300,adam#adam.com,"Street 2, Dublin"
Eva,48100000000,eva#eva.com,
Tom Jr,,tom#tom.com,"Street 1, London"
This solution is written in Python 3, but should be is easy to modify it for your (2.7) needs.

Copy number file format issue (Need to modify the structure)

I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))

How to Store Rows from CSV File into Python and Print Data with HTML

Basically my problem is this: I have a CSV excel file with info on Southpark characters and I and I have an HTML template and what I have to do is take the data by rows (stored in lists) for each character and using the HTML template given implement that data to create 5 seperate HTML pages with the characters last names.
Here is an image of the CSV file: i.imgur.com/rcIPW.png
This is what I have so far:
askfile = raw_input("What is the filename?")
southpark = []
filename = open(askfile, 'rU')
for row in filename:
print row[0:105]
filename.close()
The above prints out all the info on the IDLE shell in five rows but I have to find a way to separate each row AND column and store it into a list (which I don't know how to do). It's pretty rudimentary code I know I'm trying to figure out a way to store the rows and columns first, then I will have to use a function (def) to first assign the data to the HTML template and then create an HTML file from that data/template..and I'm so far a noob I tried searching through the net but I just don't understand the stuff.
I am not allowed to use any downloadable modules but I can use things built in Python like import csv or whatnot, but really its supposed to be written with a couple functions, list, strings, and loops..
Once I figure out how to separate the rows and columns and store them then I can work on implementing into HTML template and creating the file.
I'm not trying to have my HW done for me it's just that I pretty much suck at programming so any help is appreciated!
BTW I am using Python 2.7.2 and if you want to DL the CSV file click here.
UPDATE:
Okay, thanks a lot! That helped me understand what each row was printing and what info is being read by the program. Now since I have to use functions in this program somehow this is what I was thinking.
Each row (0-6) prints out separate values, but just the print row function prints out one character and all his corresponding values which is what I need. What I want is to print out data like "print row" would but I have to store each of those 5 characters in a separate list.
Basically "print row" prints out all 5 characters with each of their corresponding attributes, how can I split each of them into 5 variables and store them as a list?
When I do print row[0] it only prints out the names, or print row1 only prints the DOB. I was thinking of creating a def function that takes only print "row" and splits into 5 variables in a loop and then another def function takes those variables/lists of data and combines them with the HTML template, and at the end I have to figure out how to create HTML files in Python..
Sorry if I sound confusing just trying to make sense of it all. This is my code right now it gives an error that there are too many values to unpack but I am just trying to fiddle around and try different things and see if they work. Based on what I wanted to do above I will probably have to delete all most of this code and find a way to rewrite it with list type functions like .append or .strip, etc which I am not very familiar with..
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
# List of Data
name, dob, descript, phrase, personality, character, apparel = []
count = 0
def southparkinfo():
for row in reader:
count += 1
if count == 0:
row[0] = name
print row[0] # Name (ex. Stan Marsh)
print "----------------"
elif count == 1:
row[1] = dob
print row[1] # DOB
print "----------------"
elif count == 2:
row[2] = descript
print row[2] # Descriptive saying (ex. Respect My Authoritah!)
print "----------------"
elif count == 3:
row[3] = phrase
print row[3] # Catch Phrase (ex. Mooom!)
print "----------------"
elif count == 4:
row[4] = personality
print row[4] # Personality (ex. Jewish)
print "----------------"
elif count == 5:
row[5] = character
print row[5] # Characteristic (ex. Politically incorrect)
print "----------------"
elif count == 6:
row[6] = apparel
print row[6] # Apparel (ex. red gloves)
return
reader.close()
First and foremost, have a look at the CSV docs.
Once you understand the basics take a look at this code. This should get you started on the right path:
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
for row in reader:
#will print each row by itself (all columns from names up to what they wear)
print row
print "-----------------"
#will print first column (character names only)
print row[0]
You want to import csv module so you can work with the CSV filetype. Open the file in universal newline mode and read it with csv.reader. Then you can use a for loop to begin iterating through the rows depending on what you want. The first print row will print a single line of all a single character's data (ie: everything from their name up to their clothing type) like so:
['Stan Marsh', 'DOB: October 19th', 'Dude!', 'Aww #$%^!', 'Star Quarterback', 'Wendy', 'red gloves']
-----------------
['Kyle Broflovski', 'DOB: May 26th', 'Kick the baby!', 'You ***!', 'Jewish', 'Canadian', 'Ushanka']
-----------------
['Eric Theodore Cartman', 'DOB: July 1', 'Respect My Authroitah!', 'Mooom!', 'Big-boned', 'Political
ly incorrect', 'Knit-cap!']
-----------------
['Kenny McCormick', 'DOB: March 22', 'DOD: Every other week', 'Mmff Mmff', 'MMMFFF!!!', 'Mysterion!'
, 'Orange Parka']
-----------------
['Leopold Butters Stotch', 'DOB:Younger than the others!', 'The 4th friend', 'Professor chaos', 'stu
tter', 'innocent', 'nerdy']
-----------------
Finally, the second statement print row[0] will provide you with the character names only. You can change the number and you'll be able to grab the other data as necessary. Remember, in a CSV file everything starts at 0, so in your case you can only go up to 6 because A=0, B=1, C=2, etc... To see these outputs more clearly, it's probably best if you comment out one of the print statements so you get a clearer picture of what you are grabbing.
-----------------
Stan Marsh
-----------------
Kyle Broflovski
-----------------
Eric Theodore Cartman
-----------------
Kenny McCormick
-----------------
Leopold Butters Stotch
Note I threw in that print "-----------------" so you would be able to see the different outputs.
Hope this helps you get you off to a start.
Edit To answer your second question: The easiest way (although probably not the best way) to grab all of a single character's info would be to do something like this:
import csv
original = file('southpark.csv', 'rU')
reader = csv.reader(original)
stan = reader.next()
kyle = reader.next()
eric = reader.next()
kenny = reader.next()
butters = reader.next()
print eric
which outputs:
['Eric Theodore Cartman', 'DOB: July 1', 'Respect My Authroitah!', 'Mooom!', 'Big-boned', 'Politically incorrect', 'Knit-cap!']
Take note that if your CSV is modified such that the order of the characters are moved (ex: butters is moved to top) you will output the info of another character.

A solution to remove the duplicates?

My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])

Categories

Resources