python CSV intelligent parser and column matching

python CSV intelligent parser and column matching - python

I've couple of thousands of CSV files where most of them have following columns
threadSubject
bccList
sender_name
recipient_names
sender
dateReceived
date
recipients
subject
Unfortunately depending on the CSV file each column if it is present might be at different columnt number therefore complicating parsing.
What I need to do is extract from the input CSV files only these selected columns and put them all into single output file.
I'm new to python and am sure there's perfectly easy way to achieve this but I can't figure it out.
I'm not sure if should use Pandas or other mechanism.
In logical code it should work more or less like this.
for file in (all files in current folder); do
open file;
get header and find out at which positions are interesting columns
#or match by column name;
dump interesting columns into output file in the right order;
close file;
done
The tricky part of me is get header...
Would any of you have any advise how to do it in smart pythonic way?
I thought about bash and parse it manually, but thought it might be a good idea to learn how to do it in python with your help.
p.s. background of it is that I need to go through all emails for last 5 years and find out at what time was sent out first email and last email during each day. The CSVs I've were created based on Thunderbird MSF files using Mork tool. Once I'll have this CSV parsing done, I'll need to find out easy way to get time of first email and last email on the same day. BUt this is another story.
Thanks in advance for all advises.

If the column names are the same in all the files , use csv.DictReader to do the job.
Python csv.DictReader Documentation
You can reference the field names directly rather than the column number.
import csv
file = open('Path_to_file','rb')
for record in csv.DictReader(file):
print record['Column_Name']
Hope this helps.

Related

Parsing Excel sheet based on date, number of visitors, and printing email

I am trying to parse through an Excel sheet that has columns for the website name (column A), the number of visitors (F), a contact at that website's first name (B), one for last name (C), for email (E), and date it was last modified (L).
I want to write a python script that goes through the sheet and looks at sites that have been modified in the last 3 months and prints out the name of the website and an email.

It is pretty straightforward to do this. I think a little bit of googling can help you a lot. But in short, you need to use a library called Pandas which is a really powerful tool for handling spreadsheets, datasets, and table-based files.
Pandas documentation is very well written. You can use the tutorials provided within the documentation to work your way through the problem easily. However, I'll give you a brief overview of what you should do.
First open the spreadsheet (excel file) inside python using Pandas and load it into a data frame (read the docs and you'll understand).
Second Using one of the methods provided by pandas called where (actually there are a couple of methods) you can easily set a condition (like if date is older than some data) and get the masked data frame (which represents your spreadsheet) back from the method.

table for organizing students registration of subjects

I have to create my own subjects table, I have an excel file which contain subjects groups and dates that are available, I want to create a python program to run over all the combinations of subjects to give me all the available dates of subjects which I want to register in .
actually, I have no idea even how to start,
now I have four subjects let's call them F,N,S and G.
each one has four groups with different times along the week
so I want to generate all the available combinations which there is no overlap between subjects .
all I want is any hint, I don't want the whole solution just any intial thoughts to start.
I'm really a beginner python programmer and I can't think of any thing to launch this project
how to arrange them into matrices????????

Save the excel file as a csv, or "comma-separated values" file. This format is simple plaintext, and easy for programs to use.
In your program, read in the file using open()
Use the csv module to extract the opened file into a list of lists. Each element of the outer list should be another list: [subject, group, date] (or whatever columns are in your table.
Now that you have your information read into the program, look into solutions for the actual algorithm. You can google various scheduling algorithms, but this StackOverflow question gets at what you're looking for, I think, and might serve as a good starting point

Compare a Python list to an external file (.xls or .csv)

I will have a large list of emails that I will need to compare regularly to a small (20 to 30 entries) list of domains updated by a non-python user, likely in an .xls or .txt or .csv file. Any domains listed in this external file will need to be removed from the list. General tips on setting this up? I already know how to loop over the emails and remove any matches, but I'm less confident on the best way to reference the external file. Thanks so much.

I'd approach it by using Pandas to read the file, with read_csv you can open different types of files that separate values using delimiters (such as commas in a csv), this will return a Pandas Dataframe that you could use to compare with the list of files that you already have.
Pro tip: you probably want to store the list of emails that you already have somewhere, right? if you store them as a csv you can also read them using Pandas. After doing that you can remove occurences following the answer on Diff between two dataframes in pandas
Happy coding!

Writing to specific cells in csv

I have to use CSVs and make a list of people's contact detail, like emails , phone numbers and addresses.
I have a list of column names along the top: name, email, number, etc.
I need to write in a specific cell. User's can enter their name and then enter new information, like if they didn't have a phone number and now they do, they can enter it. I can find the row of a specific person as it starts with their name that I can search, but then I don't know how to write to the column of phone number.
My code is like this:
import csv
with open(csvfile.csv,a)as file:
reader=cvs.reader(file)
writer=csv.writer(file)
for row in file:
if row["First colunm"]==x:
row[1]="still don't have a phone"
writer.writerow(row)
The problem seems like it can't be both writing and reading at the same time, but i don't know what to do. I am using Python 3.

Cause your a student I am going to push you to figure out the total answer on your own. But the tool you will want to use is
pandas.iloc
This is an integer based finding and it could be as simple as
df.iloc[0,1] = whatever you need it to be.
Hopefully this gets you a step closer :)
Best,
Andy
EDIT Realized your just Using CSV
If you can, I would recommend loading your dataframe through Pandas to work with CSV. Its an overall more powerful tool that packs in alot of what you will need to solve this issues.
If you want I can help you set up the pandas but see this for the answer regarding CSV module
Writing to a particular cell using csv module in python
Sorry for my mistake in not reading as fully,
Best,
Andy

Python CSV Manipulation find and replace with the data in two different CSVs

I am new to python
I am having a huge json data from which i have scripted and got sorted and made two csv files using python script. The data is assessments data made for educational and research purpose.
Now the first csv files contains the questions Ids, question text, choiceIDs(refereed to answer options), and choice text. Their are some fields but for this particular question this is enough...
And the second csv file contains students responses, the fields to be considered in this files are, questionids, responseids.
So now i want to map the questionid from the second csv with the questionid in the first csv and then collecting all the question text and choices text and choice ids from that particular questionid in first csv and write it to a new csv file. Then i also need to map the responseid in the second csv with the choice ids in the first csv and write it to the newly created third csv.
So how can i do this using a python script
i havent wrote the script for this yet as i am struggling with logic for the same.

I would recommend you to use the pandas library, that contains all the functions you need, starting with read_csv.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.