Obtaining list of file creation dates and concatenating pandas dataframe - python

Hello I am trying to get a list of file names and file creation dates from a directory and insert them in a pandas data frame but I am getting a type error.
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any help on how to do this would be create thanks
import os
import time
import pandas as pd
cur = os.getcwd()
folder = os.listdir(cur)
files = []
for f in folder:
files.append(f)
creation = []
for cd in files:
c = time.ctime(os.path.getctime(cd))
creation.append(c)
filenames = pd.DataFrame(files, columns=['Files'])
file_creation = pd.DataFrame(creation, columns=['Date Created'])
df = pd.concat(filenames, file_creation)

The message says,that the first argument of pd.concat should be the sequence of series or dataframes to be concatenated. You passed one dataframe as first Argument and the other one as the second. But the second argument is already sth else, the axis to concatenate along in this case.
So try
df = pd.concat([filenames, file_creation], 1)
However, imo it is not the shortest way to first create two dataframes in order to concatenate them afterwards. You should create a final dataframe directly from the two lists:
df = pd.DataFrame({'Files': files, 'Date Created': creation})

Looks like you're betting off building a generator of 2-tuples (filename and the timestamp converted to an actual datetime object), then build your DataFrame directly from that, eg:
import pathlib
import pandas as pd
files = (
(file.name, pd.to_datetime(file.stat().st_ctime, unit='s'))
for file in pathlib.Path.cwd().iterdir()
)
df = pd.DataFrame(files, columns=['Files', 'Creation Time'])

Related

How to read multiple csv from folder without concatenating each file

I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`

Concatenate two columns in pandas

Good evening, I need help on getting two columns together, my brain is stuck right now, here's my code:
import pandas as pd
import numpy as np
tabela = pd.read_csv('/content/idkfa_linkedin_user_company_202208121105.csv', sep=';')
tabela.head(2)
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
pd.concat([coluna1, coluna2])
ERROR: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I'm currently getting this error, but I really don't know what to do, by the way I'm a beginner, I don't know much about coding, so any help is very appreciated.
I am new to Pandas too. But I think I can help you. You seem to have created (2) string variables by encapsulating the literal strings "startPeriodMonth" and "startPeriodYear" in single quotes ('xyz'). I think that what you're trying to do is pass columns from your Pandas data frame... and the way to do that is to explicitly reference the df and then wrap your column name in square brackets, like this:
coluna1 = tabela[startPeriodoMonth]
This is why it is saying that you "can't concatenate an object of type string". It only accepts a series or dataframe object.
From what I understand coluna1, coluna2 are columnw from tabela. You have two options:
The first is selecting the columns from the dataframe and storing it in a new dataframe.
import pandas as pd
import numpy as np
tabela = pd.read_csv('/content/idkfa_linkedin_user_company_202208121105.csv', sep=';')
tabela.head(2)
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
new_df=df[[coluna1, coluna2]]
The second option is creating a Dataframe, which contains just the desired column (for both columns), followed by the concatenation of these Dataframes.
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
df_column1=tabela[[coluna1]]
df_column2=tabela[[coluna2]]
pd_concat=[df_column1, df_column2]
result = pd.concat(pd_concat)
You can create a new column in your existing data frame to get the desired output.
tabela['month_year'] = tabela['coluna1'].apply(str) + '/' + tabela['coluna2'].apply(str)

How to read multiple ann files (from brat annotation) within a folder into one pandas dataframe?

I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())

Error while appending many Excel files to one in Python

I am trying to append 10 Excel files to one in Python,
The code below was used and I am getting
TypeError: first argument must be an iterable of pandas objects,
you passed an object of type "DataFrame"
Once I change sheet_name argument to None, the code run perfectly.
However, all the 10 excel files has three sheets and I only want specific sheet per excel file.
Is there a way to get it done?
your help is appreciated.
import pandas as pd
import glob
path = r'Folder path'
filenames = glob.glob(path + "\*.xlsx")
finalexcelsheet = pd.DataFrame()
for file in filenames:
df = pd.concat(pd.read_excel(file, sheet_name= 'Selected Sheet'), ignore_index=True,sort=False)
finalexcelsheet=finalexcelsheet.append(df,ignore_index=True)
I can't test it but problem is because you use concat in wrong way - or rather because you don't need concat in your situation.
concat needs list with dataframes like
concat( [df1, df2, ...], ...)
but read_excel gives different objects for different sheet_name=... and this makes problem.
read_excel for sheet_name=None returns list or dict with all sheets in separated dataFrames
[df_sheet_1, df_sheet_2, ...]
and then concat can join them to one dataframe
read_excel for sheet_name=name returns single dataframe
df_sheet
and then concat has nothing co join - and it gives error.
But it means you don't need concat.
You should directly assign read_excel to df
for file in filenames:
df = pd.read_excel(file, sheet_name='Selected Sheet')
finalexcelsheet = finalexcelsheet.append(df, ignore_index=True)

How to import all fields from xls as strings into a Pandas dataframe?

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.

Categories

Resources