I have a bunch of files with file-names as
companyname-date_somenumber.txt
I have to sort the files according to company name, then according to date, and copy their content in this sorted order to another text file.
Here's the approach I'm trying :
From each file-name, extract company name and then date, put these two fields in a dictionary, append this dictionary to a list and then sort this list according to the two columns of companyname and then date.
Then once I have the sorted order, I think I could search for the files in the folder according to the file-order I just obtained, then copy each files content into a txt file and I'll have my final txt file.
Here's the code I have so far :
myfiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
file_list=[]
for file1 in myfiles:
# find indices of companyname and date in the file-name
idx1=file1.index('-',0)
idx2=file1.index('_',idx1)
company=file1[0:idx1] # extract companyname
thisdate=file1[idx1+1:idx2] #extract date, which is in format MMDDYY
dict={}
# extract month, date and year from thisdate
m=thisdate[0:2]
d=thisdate[2:4]
y='20'+thisdate[4:6]
# convert into date object
mydate = date(int(y), int(m), int(d))
dict['date']=mydate
dict['company']=company
file_list.append(dict)
I checked the output of file_list at the end of this block of code and I think I have my list of dicts. Now, how do I sort by companyname and then by date? I looked up sorting by multiple keys online but how would I get the increasing order by date?
Is there any other way that I could sort a list by a string and then a date field?
import os
from datetime import datetime
MY_DIR = 'somedirectory'
# my_files = [ f for f in os.listdir(MY_DIR) if os.path.isfile(os.path.join(MY_DIR,f)) ]
my_files = [
'ABC-031814_01.txt',
'ABC-031214_02.txt',
'DEF-010114_03.txt'
]
file_list = []
for file_name in my_files:
company,_,rhs = file_name.partition('-')
datestr,_,rhs = rhs.partition('_')
file_date = datetime.strptime(datestr,'%m%d%y')
file_list.append(dict(file_date=file_date,file_name=file_name,company=company))
for row in sorted(file_list,key=lambda x: (x.get('company'),x.get('file_date'))):
print row
The function sorted takes a keyword argument key that is a function applied to each item in the sequence you're sorting. If this function returns a tuple, the sequence will be sorted by the items in the tuple in turn.
Here lambda x: (x.get('company'),x.get('file_date')) allows sorted to sort by company name and then by date.
Related
I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)
I have a folder with multiple html files.I want the code to go through each and every file and pick the subject verb object triplets using nlp. I then want pandas to list all of them under the headings of subject verb object for all the files together in one data frame. The problem I face is panda lists only the the subject verb object from the last file and not the first two. When I print the sub_verb_obj in loop it shows 3 lists within a list. But pandas does not pick the 3 lists triplets. Can someone tell me what mistake am I doing?
sub_verb_obj=[]
folder_path = 'C:/Users/user3/.ipynb_checkpoints/xyz/xyz_2018'
for filename in glob.glob(os.path.join(folder_path, '*.html')):
with open(filename, 'r',encoding='utf-8') as f:
pat = f.read()
doc=nlp(text)
text_ext = textacy.extract.subject_verb_object_triples(doc)
sub_verb_obj=list(text_ext)
sao=pd.DataFrame(sub_verb_obj)
sao.columns=['subject','verb','object']
sao=sao.set_index('subject')
print(sao)```
how can I make sure the pandas lists all the subject verb object from all the files in a folder in a single dataframe?
Because your data looks to be a list of tuples each iteration, and that worked for a single run, I'd suggest building a dataframe each loop, storing it in a list, then concatenating the list of dataframes
df_hold_list=[]
folder_path = 'C:/Users/user3/.ipynb_checkpoints/xyz/xyz_2018'
for filename in glob.glob(os.path.join(folder_path, '*.html')):
with open(filename, 'r',encoding='utf-8') as f:
pat = f.read()
soup = BeautifulSoup(pat, 'html.parser')
claim_section = soup.find_all('section', attrs={"itemprop":"claims"})
str_sect = claim_section[0]
claim_text=str_sect.get_text()
#print(str(type(claim_section)))
clean_lower=claim_text.lower()
text=clean_lower
doc=nlp(text)
text_ext = textacy.extract.subject_verb_object_triples(doc)
sub_verb_obj=list(text_ext)
df_hold_list.append(pd.DataFrame(sub_verb_obj)) # add each new dataframe here
sao=pd.concat(df_hold_list, axis=0) # this should concat all dfs on top of one another using axis=0
sao.columns=['subject','verb','object'] # change your columns on teh final df
sao=sao.set_index('subject')
print(sao)
How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.
I'm trying to solve a simple practice test question:
Parse the CSV file to:
Find only the rows where the user started before September 6th, 2010.
Next, order the values from the "words" column in ascending order (by start date)
Return the compiled "hidden" phrase
The csv file has 19 columns and 1000 rows of data. Most of which are irrelevant. As the problem states, we're only concerned with sorting the the start_date column in ascending order to get the associated word from the 'words' column. Together, the words will give the "hidden" phrase.
The dates in the source file are in UTC time format so I had to convert them. I'm at the point now where I think I've got the right rows selected, but I'm having issues sorting the dates.
Here's my code:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words = []
words.append(row['words'])
# add the dates to a list
dates = []
dates.append(new_startdt)
# create an ordered dictionary to sort the dates... this is where I'm having issues
dict1 = OrderedDict(zip(words, dates))
print dict1
#print list(dict1.items())[0][1]
#dict2 = sorted([(y,x) for x,y in dict1.items()])
#print dict2
When I print dict1 I'm expecting to have one ordered dictionary with the words and the dates included as items. Instead, what I'm getting is multiple ordered dictionaries for each key-value pair created.
Here's the corrected version:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
words = []
dates = []
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words.append(row['words'])
# add the dates to a list
dates.append(new_startdt)
# This is where I was going wrong! Had to move the lines below outside of the for loop
# Originally, because I was still inside the for loop, I was creating a new Ordered Dict for each "row in reader" that met my if condition
# By doing this outside of the for loop, I'm able to create the ordered dict storing all of the values that have been found in tuples inside the ordered dict
# create an ordered dictionary to sort by the dates
dict1 = OrderedDict(zip(words, dates))
dict2 = sorted([(y,x) for x,y in dict1.items()])
# print the hidden message
for i in dict2:
print i[1]
I am writing a scipt (i.e. once upon the time) where I am reading the data from an excel-file. For that data I create an id based on the date and time. I have one missing variable, which is contained in a txt-file. The txt-file has also date and time to create an id.
Now I would like to link the data from the excel-file and txt-file based on the id. Right no I am building two lists from the txt-file. One containing the id and the other containing the value I need. Then, I get the index from the id list, where the id is the same in both data sets using the enumerate function. I use that index to get the value from the valuelist. The code looks something like that:
datelist = []
valuelist = []
txtfile = open(folder + os.sep + "Textfile.txt", "r")
ILines = txtfile.readlines()
for i,row in enumerate(ILines):
datelist.append(row.split(",")[1])
valuelist.append(row.split(",")[2])
rows = myexceldata
for row in rows:
x = row[id]
row = row + valuelist[[i for i,e in enumerate(datelist ) if e == x][0]]
However, that takes ages and I wonder if there is a better way to to that.
The files look like that:
Excelfile:
Date Time Var1 Var2
03.02.2016 12:53:24 10 27
03.02.2016 12:53:25 10 27
03.02.2016 12:53:26 10 27
Textfile:
Date Time Var3
03.02.2016 12:53:24 16
03.02.2016 12:53:25 20
Result:
Date Time Var1 Var2 Var3
03.02.2016 12:53:24 10 27 16
03.02.2016 12:53:25 10 27 20
03.02.2016 12:53:26 10 27 *)
*) It would be perfect, if here would be the same value as above, but empty would be ok, too
Ok, I forgot one important thing. Sorry about that: Not all times of the excelfile are in the textfile. The best option would be to get var3 from the previous time of the textfile just before the time of the excelfile. But it would also be an option to leave it blank than.
If both of your files are sorted in time order then the following kind of approach would be fast:
from heapq import merge
from itertools import groupby, chain
import csv
with open('excel.txt', 'rb') as f_excel, open('textfile.txt', 'rb') as f_text, open('output.txt', 'wb') as f_output:
csv_excel = csv.reader(f_excel)
csv_text = csv.reader(f_text)
csv_output = csv.writer(f_output)
header_excel = next(csv_excel)
header_text = next(csv_text)
csv_output.writerow(header_excel + [header_text[-1]])
for k, g in groupby(merge(csv_text, csv_excel), key=lambda x: x[0:2]):
csv_output.writerow(k + list(chain.from_iterable(cols[2:] for cols in g)))
This assumes your two input files are both in csv format, and works as follows:
Create csv readers/writers for all of the files. This allows the files to automatically be read in as lists of columns without requiring each line to be split.
Extract the headers from both of the files and write a combined form to the output.
Take the two input files and pass them to merge. This returns a row at a time from either input file in order.
Pass this to groupby to group rows with the same date and time together. This returns a key and a group, where the key is the date and time that matched, and the group is an iterable of the matching rows.
For each grouped entry, write the key and columns 2 onwards from each row to the output file. chain is used to produce a flat list.
This would give you an output file as follows:
Date,Time,Var1,Var2,Var3
03.02.2016,12:53:24,10,27,16
03.02.2016,12:53:25,10,27,20
As you already have the excel data, this would need to be passed to merge instead of csv_excel as a list of rows/cols.