Python read in multiple .txt files and row bind using pandas - python

I'm coming from R (and SAS) and am having an issue reading in a large set of .txt files (all stored in the same directory), and creating one large dataframe in pandas. So far I have attempted an amalgamation of code - all of which fails miserably. I assume this is a simple task but lack the experience in python...
If it helps this is the data I would like to create one large dataframe with: http://www.ssa.gov/oact/babynames/limits.html
- the state specific sets (50 in total, named for their state abbreviation.txt)
Please help!
import pandas as pd
import glob
filelist = glob.glob("C:\Users\Dell\Downloads\Names\*.txt")
names = ['state', 'gender', 'year', 'name', 'count']
Then, I was thinking of using pd.concat, but am not sure - essentially I want to read in each dataset and then row.bind the sets together (given they all have the same columns)...

concat is nice since "join" is set to "outer" (i.e. union of index) by default. You could just as easily use df.join(), but must specify "how" as "outer". Either way, you can build a dataframe quite simply:
import pandas as pd
from glob import glob as gg
data = pd.DataFrame()
names = ['state', 'gender', 'year', 'name', 'count']
for f in gg('*.txt'):
tmp = pd.read_csv(f,columns=names)
data = pd.concat([data,tmp],axis=0,ignore_index=True)

Related

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.
Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.
import pandas as pd
import os
import glob
path = 'c:/mydata'
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
If you want to change the column name, please try this:
import glob
import pandas as pd
filenames = glob.glob('c:/mydata/*.csv')
all_data = []
for file in filenames:
df = pd.read_csv(file)
if 'implementationstatus' in df.columns:
df = df.rename(columns={'implementationstatus':'implementation'})
all_data.append(df)
df_all = pd.concat(all_data, axis=0)
You can use a combination of header and names parameters from the pd.read_csv function to solve it.
You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.
From pandas docs:
names: array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

Dataframe instance management in Python

I recently worked on a project parsing CSV files with cable modem MAC addresses (CMMAC) data that made it useful to incorporate dataframes through the Pandas module. One of the problems I encountered related to the overall approach to and structure of the dataframes themselves. Specifically I was concerned with having to increment the number of instances of dataframes to perform specific actions on the data. I did not feel that having to invoke "df1", "df2", "df3", etc was an efficient approach to writing in Python.
Below is a segment of the code where I had to instantiate the dataframes for different actions. The sample files (file1.csv and file2.csv) are identical and posted below as well.
file1.csv and file2.csv
cmmac,match
AABBCCDDEEFF,true
001122334455,false
001122334455,false
Python script:
import os
import glob
from functools import partial
import pandas as pd
#read and concatenate all CSV files in working directory
df1 = pd.concat(map(partial(pd.read_csv, header=0), glob.glob(os.path.join('', "*.csv"))))
#sort by column labeled "cmmac"
df2 = df1.sort_values(by='cmmac')
#delete any duplicate records
df3 = df2.drop_duplicates()
#convert MAC address format to colon notation (e.g. 001122334455 to 00:11:22:33:44:55)
df3['cmmac'] = df3['cmmac'].apply(lambda x: ':'.join(x[i:i+2] for i in range(0, len(x), 2)))
There were additional actions that were performed on the data in the CSV files and by the end I had thirteen dataframes (df13). With more complex projects I would have been in a death spiral of dataframes using this method.
The question I have is: how should dataframes be managed in order to avoid using this many instances? If it was necessary to drop a column or rearrange the columns does each one of those actions require invoking a new dataframe? In "df1" I am able to combine two distinct actions, which include reading in all CSV files and concatenating them. I was unable to add additional actions but even so that line would eventually become difficult to read. Which approach have you adopted when working with dataframes that incorporated many smaller tasks? Thanks.

combining multiple files into a single file with DataFrame

I have been able to generate several CSV files through an API. Now I am trying to combine all CSV's into a unique Master file so that I can then work on it. But it does not work. Below code is what I have attempted What am I doing wrong?
import glob
import pandas as pd
from pandas import read_csv
master_df = pd.DataFrame()
for file in files:
df = read_csv(file)
master_df = pd.concat([master_df, df])
del df
master_df.to_csv("./master_df.csv", index=False)
Although it is hard to tell what the precise problem is without more information (i.e., error message, pandas version), I believe it is that in the first iteration, master_df and df do not have the same columns. master_df is an empty DataFrame, whereas df has whatever columns are in your CSV. If this is indeed the problem, then I'd suggest storing all your data-frames (each of which represents one CSV file) in a single list, and then concatenating all of them. Like so:
import pandas as pd
df_list = [pd.read_csv(file) for file in files]
pd.concat(df_list, sort=False).to_csv("./master_df.csv", index=False)
Don't have time to find/generate a set of CSV files and test this right now, but am fairly sure this should do the job (assuming pandas version 0.23 or compatible).

Create n data frames using a for loop

I would like to know how to name in a different way the data frames that I am going to create using the code below.
import pandas as pd
import glob
os.chdir("/Users/path")
dataframes=[]
paths = glob.glob("*.csv")
for path in paths:
dataset= pd.read_csv(path)
dataframes.append(dataset)
I would like to have something like this:
df1
df2
df3
....
in order to use each of them for different analysis purposes. In the folder I have files like
analysis_for_market.csv, dataset_for_analysis.csv, test.csv, ...
Suppose I have 23 csv files (this length is given by dataframes as it appends each of df).
For each of them I would like to create a dataframe df in python in order to run different analysis.
I would do for one of it:
df=pd.read_csv(path) (where path is "/path/analysis_for_market.csv").
and then I could work on it (adding columns, dropping them, and so on).
However, I would like also to be able to work with another dataset, let say dataset_for_analysis.csv, so I would need to create a new dataframe, df2. This could be useful in case I would like to compare rows.
And so on. Potentially I would need a df for each dataset, so I would need 23 df.
I think it could be done using a for loop, but I have not idea on how to call the df(for example, execute df.describe for the two examples above).
Could you please tell me how to do this?
If you find a possible question related to mine, could you please add it in a comment, before closing my question (as a previous post was closed before solving my issues)?
Thank you for your help and understanding.
Update:
import os
import pandas as pd
import glob
os.chdir("/Users/path")
paths = glob.glob("*.csv")
dataframes=[]
df={}
for x in range(1,len(paths)):
for path in paths:
df["0".format(x)]=pd.read_csv(path)
#dataframes[path] = df # it gives me the following error: TypeError: list indices must be integers or slices, not str
df["2"]
it works only for 0 as in the code, but I do not know how to let the value ranges between 1 and len(paths)
Setting the name of dataframe will do the job.
import pandas as pd
import glob
import os
os.chdir("/Users/path")
df = {}
paths = glob.glob("*.csv")
for index, path in enumerate(paths):
df[str(index)]= pd.read_csv(path)
This is working fine for me. If i call df['0'], this is giving me the first dataframe.
You can create a global variable with any name you like by doing
"globals()["df32"] = ..."
But that is usually viewed as poor coding practice (because you might be clobbering existing names without knowing it).
Instead, just create a dictionary mydfs (say) and do mydfs[1]=...
from glob import glob
import pandas as pd
for i, path in enumerate(glob('*.csv')):
exec("{} = {}".format("df{0:03d}".format(i), pd.read_csv(path, encoding = 'latin-1')))
You can adjust the 0:03d bit to the number of leading zeros you'd like if you need to or can just skip it alltogether with df{i}.

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

Categories

Resources