avoid writing df['column'] twice when doing df['column'] = df['column'] - python

I don't even know how to phrase this but is there a way in Python to reference the text before the equals without having to actually write it again?
** EDIT - I'm using python3 in Jupyter
I seem to spend half my life writing:
df['column'] = df['column'].some_changes
Is there a way to tell Python that I'm referencing the part before the equals sign?
For example, I would write the following, where <% is just to represent the reference to the text before the = (df['column'])
df['column'] = <%.replace(np.nan)

you are looking for in place methods.
I believe you can pass inplace=True as an argument to most methods in pandas
so it would be something just like
df['column'].replace(np.nan, inplace=True)
edit
You could also do
df["computed_column"] = df["original_column"].many_operations
so you still have access to the original data down the line.
And do all the needed operations at once instead of saving each step.
One of the advantages of inplace not being the default is if you are doing a batch of operations and it fails midway your data is not mangled.

Related

Is overwriting variables names for lengthy operations bad style?

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?
You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."
Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

How can I manipulate a DataFrame name within a function?

How can I manipulate a DataFrame name within a function so that I can have a new DataFrame with a new name that is derived from the input DataFrame name in return?
let say I have this:
def some_func(df):
# some operations
return(df_copy)
and whatever df I put inside this function it should return the new df as ..._copy, e.g. some_func(my_frame) should return my_frame_copy.
Things that I considered are as follows:
As in string operations;
new_df_name = "{}_copy".format(df) -- I know this will not work since the df refers to an object but it just helps to explain what I am trying to do.
def date_timer(df):
df_copy = df.copy()
dates = df_copy.columns[df_copy.columns.str.contains('date')]
for i in range(len(dates)):
df_copy[dates[i]] = pd.to_datetime(df_copy[dates[i]].str.replace('T', ' '), errors='coerce')
return(df_copy)
Actually this was the first thing that I tried, If only DataFrame had a "name" attribute which allowed us to manipulate the name but this also not there:
df.name
Maybe f-string or any kind of string operations could be able to make it happen. if not, it might not be possible to do in python.
I think this might be related to variable name assignment rules in python. And in a sense what I want is reverse engineer that but probably not possible.
Please advice...
It looks like you're trying to access / dynamically set the global/local namespace of a variable from your program.
Unless your data object belongs to a more structured namespace object, I'd discourage you from dynamically setting names with such a method since a lot can go wrong, as per the docs:
Changes may not affect the values of local and free variables used by the interpreter.
The name attribute of your df is not an ideal solution since the state of that attribute will not be set on default. Nor is it particularly common. However, here is a solid SO answer which addresses this.
You might be better off storing your data objects in a dictionary, using dates or something meaningful as keys. Example:
my_data = {}
for my_date in dates:
df_temp = df.copy(deep=True) # deep copy ensures no changes are translated to the parent object
# Modify your df here (not sure what you are trying to do exactly
df_temp[my_date] = "foo"
# Now save that df
my_data[my_date] = df_temp
Hope this answers your Q. Feel free to clarify in the comments.

python string as variable reporting nan

I am finding my name content/ variable value inside one document with the below:
find_name = re.search(r'^[^\d]*', clean_content)
Name = find_name.group(0)
NameUp = Name.upper()
Which works fine... it equals DAN STEPP as needed.
I then open up an excel file:
data1 = pd.read_excel(config.Excel1)
Pass into a data frame, give them headers; all this works:
df = pd.DataFrame(data1)
header = df.iloc[0]
Now when I do the search; with the below it returns nan erroneously
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
With my NameUp var, which equals DAN STEPP in value when I print and test, so it does contain correct value. However, when I use the variable above to search, I get nan.
When I replace NameUp with "DAN STEPP" like that, not using the variable, it becomes found - any thoughts on this? i.e. '.str.contains("DAN STEPP")'
Would you mind doing repr(NameUp)? It's slightly different from str(NameUp) in that it will print exactly what's in the string. Besides that I have no idea what to make of
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
I don't use pandas but that's up... that's a lot of stuff in one line? I would check each process individually as to see what's wrong. Since you said that it was throwing the wrong thing with the NameUp variable, I would deconstruct df['Member Name'].str.contains(NameUp) to see what it spits out, and make sure that it's consistent with your testing. Have you tried with any other names/values?
TL;DR: if the variable is not working, and manually inputting the string is, there is one of two things happening. Either the two strings are different in some minor way, or the process of which you are testing the two are not the same.

How do I update value in DataFrame with mask when iterating through rows

With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.

Categories

Resources