Optionally Select Columns from Pandas Data Frame - python

I have written a function that takes a list of file-paths and then concatenates them into one large dataframe. I would like to include an argument that takes a list of column names the user is interested in looking at.
The dataframe must always contain the 'category' column if the user decides to filter the columns, but I want the default to be that it returns all of the columns. I can't quite seem to figure out how to optionally select columns from a dataframe.
Here is my function interspersed with some psuedo code to explain what I'm talking about.
def combine_all_data(data_files, columns_needed=ALL):
dataframes = map(pd.read_csv, data_files)
if columns_needed != ALL
columns_needed = ['category'] + columns_needed
df = pd.concat(dataframes, sort=False)[columns_needed]
return df

If it's the ALL you don't know how to implement, you can try this:
def combine_all_data(data_files, columns_needed=None):
kwargs= dict()
if columns_needed is not None:
if 'category' not in columns_needed:
columns_needed= ['category'] + columns_needed
kwargs['usecols']= columns_needed
dataframes = [pd.read_csv(data_file, **kwargs) for data_file in data_files]
return pd.concat(dataframes, sort=False)
The advantage of this is, that you need less memory, because the columns you don't want to see, are already skipped in the reading process.
Addtionally you return a full dataframe not a slice of one. So you can work with it without restrictions.

read_csv has a usecols argument:
def combine_all_data(data_files, columns_needed='ALL'):
if needed_columns != 'ALL':
if not 'category' in columns_needed:
columns_needed.append('category')
return pd.concat([pd.read_csv(x, usecols=columns_needed) for x
in data_files], sort=False)
else:
return pd.concat([pd.read_csv(x) for x in data_files], sort=False)

Related

python pandas: columns are being renamed by function

I have a dataframe that's created by a host of functions. From there, I have two more dataframes I need to create off the master frame. I have another function that takes that master frame and does a few more transformations on it. One of those is changing the column names, however that is in turn changing the column names on the master and I can't figure out why.
def create_y_df(c_dataframe: pd.DataFrame):
x_col_list = [str(i) for i in c_dataframe.columns]
for i, j in enumerate(x_col_list):
if 'Unnamed:' in j:
x_col_list[i] = x_col_list[i-1]
x_col_list[i-1] = 'drop'
c_dataframe.columns = x_col_list
c_dataframe = c_dataframe.drop(['drop'], axis=1)
c_dataframe = c_dataframe.apply(lambda x: pd.Series(x.dropna().values))
return c_dataframe
master_df = create_master(params)
y_df = create_y_df(master_df)
After running this, if I export master_df again, the columns now include 'drop'. What's interesting is that if I remove the columns renaming loop from create_y_df but leave the x.dropna(), that portion is not applied to master_df. I just have no idea why the c_dataframe.column = x_col_list from create_y_df() is applying to master_df

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Append to a pd.DataFrame, dynamically allocating any new columns

I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.
You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())

Pandas append() error with two dataframes

When I try to append two or more dataframe and output the result to a csv, it shows like a waterfall format.
dataset = pd.read_csv('testdata.csv')
for i in segment_dist:
for j in step:
print_msg = str(i) + ":" + str(j)
print("\n",i,":",j,"\n")
temp = pd.DataFrame(estimateRsq(dataset,j,i),columns=[print_msg])
csv = csv.append(temp)
csv.to_csv('output.csv',encoding='utf-8', index=False)
estimateRsq() returns array. I think this much code snippet should be enough to help me out.
The format I am getting in output.csv is:
Please help, How can I shift the contents go up from index 1.
From df.append documentation:
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
If you want to add column to the right, use pd.concat with axis=1 (means horizontally):
list_of_dfs = [first_df, second_df, ...]
pd.concat(list_of_dfs, axis=1)
You may want to add parameter ignore_index=True if indexes in dataframes don't match.
Build a list of dataframes, then concatenate
pd.DataFrame.append is expensive relative to list.append + a single call of pd.concat.
Therefore, you should aggregate to a list of dataframes and then use pd.concat on this list:
lst = []
for i in segment_dist:
# do something
temp = pd.DataFrame(...)
lst.append(temp)
df = pd.concat(lst, ignore_index=True, axis=0)
df.to_csv(...)

How can I iterate through multiple dataframes to select a column in each in python?

For my project I'm reading in a csv file with data from every State in the US. My function converts each of these into a separate Dataframe as I need to perform operations on each State's information.
def RanktoDF(csvFile):
df = pd.read_csv(csvFile)
df = df[pd.notnull(df['Index'])] # drop all null values
df = df[df.Index != 'Index'] #Drop all extra headers
df= df.set_index('State') #Set State as index
return df
I apply this function to every one of my files and return the df with a name from my array varNames
for name , s in zip (glob.glob('*.csv'), varNames):
vars()["Crime" + s] = RanktoDF(name)
All of that works perfectly.
My problem is that I also want to create a Dataframe thats made up of one column from each of those State Dataframes.
I have tried iterating through a list of my dataframes and selecting the column (population) i want to append it to a new Dataframe:
dfList
dfNewIndex = pd.DataFrame(index=CrimeRank_1980_df.index) # Create new DF with Index
for name in dfList: #dfList is my list of dataframes. See image
newIndex = name['Population']
dfNewIndex.append(newIndex)
#dfNewIndex = pd.concat([dfNewIndex, dfList[name['Population']], axis=1)
My error is always the same which tells me that name is viewed as a string rather than an actual Dataframe
TypeError Traceback (most recent call last)
<ipython-input-30-5aa85b0174df> in <module>()
3
4 for name in dfList:
----> 5 newIndex = name['Index']
6 dfNewIndex.append(newIndex)
7 # dfNewIndex = pd.concat([dfNewIndex, dfList[name['Population']], axis=1)
TypeError: string indices must be integers
I understand that my list is a list of Strings rather than variables/dataframes so my question is how can i correct my code to be able to do what i want or is there an easier way of doing this?
Any solutions I've looked up have given answers where the dataframes are explicitly typed in order to be concatenated but I have 50 so its a little unfeasible. Any help would be appreciated.
One way would be to index into vars(), e.g.
for name in dfList:
newIndex = vars()[name]["Population"]
Alternatively I think it would be neater to store your dataframes in a container and iterate through that, e.g.
frames = {}
for name, s in zip(glob.glob('*.csv'), varNames):
frames["Crime" + s] = RanktoDF(name)
for name in frames:
newIndex = frames[name]["Population"]

Categories

Resources