When I work with data, very often I will have a bunch of similar objects I want to iterate over to do some processing and store the results.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = []
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results.append(tmp_result) # append results
The problem I have with this is that it's not clear which dataframe the results correspond to. I thought of using the objects as keys for a dict, but this won't always work as dataframes are not hashable objects and can't be used as keys to dicts:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = {}
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results[df] = tmp_result # doesn't work
I can think of a few hacks to get around this, like defining unique keys for the input objects before the loop, or storing the input and the result as a tuple in the results list. But in my experience those approaches are rather unwieldy, error prone, and I suspect they're not terrilbly great for memory usage either. Mostly, I just end up using the first example, and make sure I'm careful to manually keep track of the position of the results.
Are there any obvious solutions or best practices to this problem here?
You can keep the original dataframe and the result together in a class:
class Whatever:
def __init__(self, df):
self.df = df
self.result = None
whatever1 = Whatever(pd.DataFrame(...))
whatever2 = Whatever(pd.DataFrame(...))
for whatever in [whatever1, whatever2]:
whatever.result = whatever.df.median()
There are many ways to improve this depending on your situation: generate the result right in the constructor, add a method to generate and store it, compute it on the fly from a property, and so on.
I would concatenate your data frames, adding an index for each data frame, then use a group-by operation.
pd.concat([df1, df2], keys=['df1', 'df2']).groupby(level=0).median()
If your actual processing is more complex, you could use .apply() instead of .median().
You can try something like this:
dd = {'df1': df1,
'df2': df2}
results_dict = {}
for k, v in dd.items():
results_dict[k] = v.mean()
results_df = pd.concat(results_dict, keys=results_dict.keys(), axis=1)
print(results_df)
Output:
df1 df2
0 561.65 549.85
if you want corresponding output dfs , search SO for using globals() in a loop see if you can rename them to something similar to the input.
For df1, you could name it
df1.name = 'df1_output'
then use globals() to set the name of the output df to df1.name. Then you'd have df1 and df1_ouptut
Related
I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe
I am looking into creating a big dataframe (pandas) from several individual frames. The data is organized in MF4-Files and the number of source files varies for each cycle. The goal is to have this process automated.
Creation of Dataframes:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
These Dataframes are then merged:
df = pd.concat([df, df1, df2], axis=0)
How can I do this without dynamically creating variables for df, df1 etc.? Or is there no other way?
I have all filepathes in an Array of the form:
Filepath = ['File1.mf4', 'File2.mf4','File3.mf4',]
Now I am thinking of looping through it and create dynamically the data frames df,df1.df1000.... Any advice here?
Edit here is the full code:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
#The Data has some offset:
x = df.index.max()
df1.index += x
x = df1.index.max()
df2.index += x
#With correct index now the data can be merged
df = pd.concat([df, df1, df2], axis=0)
The way I'm interpreting your question is that you have a predefined list you want. So just:
l = []
for f in [ list ... of ... files ]:
df = load_file(f) # however you load it
l.append(df)
big_df = pd.concat(l)
del l, df, f # if you want to clean it up
You therefore don't need to manually specify variable names for your data sub-sections. If you also want to do checks or column renaming between the various files, you can also just put that into the for-loop (or alternatively, if you want to simplify to a list comprehension, into the load_file function body).
Try this:
df_list = [(MDF(file)).to_dataframe(channels) for file in Filepath]
df = pd.concat(df_list)
Im trying to join two dataframes on string columns which are not identical. I realise this has been asked a lot but I am struggling to find anything relevant to my need. The code I have is as follows
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
xls = pd.ExcelFile(filename)
df_1 = pd.read_excel(xls, sheet_name="Sheet 1")
df_2 = pd.read_excel(xls, sheet_name="Sheet 2")
df_2['key'] = df_2['Name'].apply(lambda x : [process.extract(x, df_1['Name'], limit=1)][0][0][0])
The idea would then be to joing the two datframes based on df_2['key'], However when I run this code it runs but does not return anything. The df sizes are as follows: df_1 (3366, 8) and df_2(1771, 6)
Is there a better way to do this?
This code returns nothing, because it is exactly what it should do.
df_2['key'] = ... just appends 'key' columns to df_2 dataframe.
If you want to merge dataframes, your code should look similar to this:
name_list_1 = df_1['Name'].tolist()
name_list_2 = df_2['Name'].tolist()
matches = list(map(lambda x: process.extractOne(
x, name_list_1, scorer=fuzz.token_set_ratio)[:2], name_list_2))
df_keys = pd.DataFrame(matches, columns=['key', 'score'])
df_2 = pd.merge(df_2, df_keys, left_index=True, right_index=True)
df_2 = df_2[df_2['score'] > 70]
df_3 = pd.merge(df_1, df_2, left_on='Name', right_on='key', how='outer')
print(df_3)
I use extractOne method, which I guess better suits your situation. It is important to play with scorer parameter as it heavily affects matching result.
you can better use process.extractOne() instead. you code will look like
name_list_1=df_1['Name'].tolist()
name_list_2=df_2['Name'].tolist()
key = map(lambda x : process.extractOne(x,name_list_1)[0],name_list_2)
df_1['key']=key
then you can make the join on the key column.
I currently have a long script that has one goal: take multiple csv tables of data, merge them into one, while performing various calculations along the way, and then output a final csv table.
I originally had this layout (see LAYOUT A), but found that this made it had to see what columns were being added or merged, because the cleaning and operations methods are listed below everything, so you have to go up and down in the file to see how the table gets altered. This was an attempt to follow the whole keep-things-modular-and-small methodology that I've read about:
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = clean_table_1('table1.csv')
df2 = clean_table_2('table2.csv')
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
def some_operation(x,y,z):
#<calculations for performing on table column>
def some_other_operation(a,b):
#<some calculation>
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
return df
def clean_table_2(fn_2):
#<similar to clean_table_1>
def clean_table_3(fn_3):
#<similar to clean_table_1>
if __name__=='__main__':
main()
My next inclination was to move all the functions in-line with the main script, so its obvious what's being done (see LAYOUT B). This makes it a bit easier to see the linearity of the operations being done, but also makes it a bit messier, so that you can't just quickly read through the main function to get the "overview" of all the operations being done.
# LAYOUT B
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
df1 = clean_table_1('table1.csv')
def clean_table_2(fn_2):
#<similar to clean_table_1>
df2 = clean_table_2('table2.csv')
def clean_table_3(fn_3):
#<similar to clean_table_1>
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
So then I think, well why even have these functions; would it maybe just be easier to follow if it's all at the same level, just as a script like so (LAYOUT C):
# LAYOUT C
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = pd.read_csv('table1.csv)
df1['some_col1'] = 400
df1 = #<operations on df1>
df2 = pd.read_csv('table2.csv)
df2['some_col2'] = 200
df2 = #<operations on df2>
df3 = pd.read_csv('table3.csv)
df3['some_col3'] = 800
df3 = #<operations on df3>
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
The crux of the problem is finding a balance between documenting clearly which columns are being updated, changed, dropped, renamed, merged, etc. while still keeping it modular enough to fit the paradigm of "clean code".
Also, in practice, this script and others are much longer with far more tables being merged into the mix, so this quickly becomes a long list of operations. Should I be breaking up the operations into smaller files and outputting intermediate files or is that just asking to introduce errors? It's a matter of also being able to see all the assumptions made along the way and how they affect the data in its final state, without having to jump between files or scroll way up, way down, etc. to follow the data from A to B, if that makes sense.
If anyone has insights on how to best write these types of data cleaning/manipulation scripts, I would love to hear them.
It is highly subjective topic, but here is my typical approach/remarks/hints:
as long as possible optimize for debug/dev time and ease
split flow into several scripts (e.g. download, preprocess, ... for every table sepearetely, so for merging every table is prepared separetely)
try to keep the same order of operations within script (e.g. type correction, fill na, scaling, new columns, drop columns)
for every wrangle script there is load-start and save-end
save to pickle (to avoid problems like saving date as string) and small csv (to have easy preview of results outside of python)
with "integration points" being data you can easily combine different technologies (caveat: in such case typically you don't use pickle as output but csv/other data format)
every script has clearly defined input/output and can be tested and developed separately, also I use asserts on dataframe shapes
scripts for visualization/EDA use data from wrangle scripts, but they are never part of wrangle scripts, typically they are also bottleneck
combine scripts by e.g. bash if you want simplicity
keep length of script below 1 page*
*if I have long, convoluted code before I encapsulate it in function, I check if it can be done simpler, in 80% yes but you need to know more about pandas, but you learn something new, pandas doc is typically better, your code is more declarative and idiomatic
*if there is no easy way to fix and you use this function in many places put it into utils.py, in docstring put sample >>>f(input) output and some rationale about this function
*if function is used across many projects it is worth to make pandas extension like https://github.com/twopirllc/pandas-ta
*if I have a lot of columns, I think a lot about hierarchy and groupings and keep it in separate file for every table, for now it is just py file, but I started to consider yaml and a way to document table structure
stick to one convention e.g. I don't use inplace=True at all
chain operations on dataframe*
*if you have a good meaningful name for subchain result that could be used elsewhere, it could be a good place to split script
remove main function, if you keep script according to rules above there is nothing wrong in df global variable
when I've read from csv, I always check what can be done directly with read_csv parameters e.g. parse date
clearly mark 'TEMPORARY HACKS', in long-term they lead to unexpected side-effects
Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})
and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?
def modify(df,nr):
df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
df_valid_nr=~df_invalid_nr
Invalid_cycles_nr=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_nr)
print(df)
So, when I try to run the above function
modify(df_1,1)
It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.
I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.
for i in range(1,n+1):
df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
How do I, in general, access df_1 using an iterator? It seems to be a problem.
Any help would be appreciated, thanks!
Solution
Inputs
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))
Code
To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:
for idx, df in enumerate([df_1, df_2]):
col = 'SPEED'+str(idx+1)
df['valid'] = df[col] <= 500
print(df_1)
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]
It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.
Another (better?) solution
If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:
dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))
It will allow you to do the following one liner:
dfs = dict(map(lambda key_val: (key_val[0],
key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
dfs.items()))
print(dfs['df_1'])
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
Explanations:
dfs.items() returns a list of key (i.e. names) and values (i.e. DataFrames)
map(foo, bar) apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs of dfs.items().
dict() cast the map to a dict.
Notes
About modify
Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.
You can then test the following for instance:
def modify(df):
df=df[df.SPEED1<0.5]
#The change in df is on the scope of the function only,
#it will not modify your input, return the df...
return df
#... and affect the output to apply changes
df_1 = modify(df_1)
About access df_1 using an iterator
Notice that when you do:
for i in range(1,n+1):
df_i something
df_i in your loop will call the object df_i for each iteration (and not df_1 etc.)
To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.
To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:
dfs = {}
dfs['df_1'] = ...
or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :
dfs = dict((var, eval(var)) for
var in dir() if
isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)
Then it would be easier for your to iterate over your DataFrames:
for i in range(1,n+1):
dfs['df_'+str(i)'] something
You can use the globals() function which allows you to get a variable by his name.
I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :
for i in range(1,n+1):
df_i = globals()["df_"+str(i)]
df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
Your code sample leaves me a little confused, but focusing on
I want to make the same changes to all of the data frames.
and
How do I, in general, access df_1 using an iterator?
you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).
Here's how:
Assuming you've got a bunch of variables in your namespace...
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f'])
df_3 = df_3.set_index(rng)
...you can identify all that are dataframes using:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
... and then organize them in a dict using:
myFrames = {}
for dfName in dfNames:
myFrames[dfName] = eval(dfName)
From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:
invalid = ['df_3']
for inv in invalid:
myFrames.pop(inv, None)
Now you can reference ALL your valid dfs by looping through them:
for key in myFrames.keys():
print(myFrames[key])
And that should cover the...
How do I, in general, access df_1 using an iterator?
...part of the question.
And you can of course reference a single dataframe by its name / key in the dict:
print(myFrames['df_1'])
From here you can do something with ALL columns in ALL dataframes.
for key in myFrames.keys():
myFrames[key] = myFrames[key]*10
print(myFrames[key])
Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns
# A function
decimator = lambda x: x/10
# A subset of columns:
myCols = ['SPEED1', 'SPEED2']
Apply that function to your subset of columns in your dataframes of interest:
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in myCols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
print(myFrames[key][col])
So, back to your function...
modify(df_1,1)
... here's my take on it wrapped in a function.
First we'll redefine the dataframes and the function.
Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)].
Here's the datasets and the function for an easy copy-paste:
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_3 = df_3.set_index(rng)
# A function that divides columns by 10
decimator = lambda x: x/10
# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
# A function as per your request
def modify(dfs, cols, fx):
""" Define a subset of available dataframes and list of interesting columns, and
apply a function on those columns.
"""
# Subset all dataframes with names that start with df_
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
# Organize those dfs in a dict if they match the dataframe names of interest
myFrames = {}
for dfName in dfNames:
if dfName in dfs:
myFrames[dfName] = eval(dfName)
print(myFrames)
# Apply fx to the cols of your dfs subset
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in cols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)
Here are dataframes df_1 and df_2 before manipulation:
Here are the dataframes after manipulation:
Anyway, this is how I would approach it.
Hope you'll find it useful!