I currently iterate through the rows of an excel file multiple times and write in "XYZ" to a new column when the row meets certain conditions.
My current code is:
df["new_column"] = np.where(fn == True, "XYZ", "")
The issue I face is that when the fn == True condition is not satisfied, I want to do absolutely nothing and move onto checking the next row of the excel file. I noticed that each time I iterate, the empty string replaces the "XYZ"s that are already marked in the file. Is there a way to prevent this from happening? Is there something I can do instead of empty string ("") to prevent overwriting?
Edit:
My dataframe is a huge financial Excel file with multiple columns and rows. This data set has columns like quantity, revenue, sales, etc. Basically, I have a list that contains about 50 conditionals. For each condition, I iterate through all the rows in the Excel and for the row that matches the condition, I wanted to put an "XYZ" in the df["new_column"] flagging that row. The df["new_column"] is an added column to the original dataframe. Then, I move onto the next condition up until the 50th conditional.
I think the problem is, is that the way I wrote code replaces the previous existing "XYZ" with empty string when I proceed onto check the other conditionals in the list. Basically, I want to find a way to lock "XYZ" in, so it can't become overwritten.
The fn is a helper function that returns a boolean depending on if the condition equals a row in the dataframe. While I iterate, if the condition matches a row, then this function returns True and marks the df["new_column"] with "XYZ". The helper function takes in multiple arguments to check if the current condition matches any of the rows in the dataframe. I hope this explanation helps!
you can try using a lambda.
first, create the function:
def checkIfTrue(FN, new):
if new == "":
pass
if FN:
return "XYZ"
than apply this to the new column like that:
df['new_column'] = df.apply(lambda row: checkIfTrue(row["fn"], row["new_column"]), axis=1)
IIUC you want to use .loc[]:
df.loc[fn, "new_column"] = 'XYZ'
Related
I have a huge 800k row dataframe which I need to find the key with another dataframe.
Initially I was looping through my 2 dataframes with a loop and checking the value of the keys with a condition.
I was told about the possibility of using merge to save time. However, no way to make it work :(
Overall, here's the code I'm trying to adapt:
mergeTwo = pd.read_json('merge/mergeUpdate.json')
matches = pd.read_csv('archive/matches.csv')
for indexOne,value in tqdm(mergeTwo.iterrows()):
for index, match in matches.iterrows():
if value["gameid"] == match["gameid"]:
print(match)
for index, value in mergeTwo.iterrows():
test = value.to_frame().merge(matches, on='gameid')
print(test)
In my first case, my code works without worries.
In the second, this one tells me a problem of not known key (gameid)
Anyone got a solution?
Thanks in advance !
When you iterate over rows, your value is a Series which is transformed into a one-column frame by to_frame method with the original column names as its index. So you need to transpose it to make the second way work:
for index, value in mergeTwo.iterrows():
# note .T after .to_frame
test = value.to_frame().T.merge(matches, on='gameid')
print(test)
But iteration is a redundant tool, merge applied to the first frame should be enough:
mergeTwo.merge(matches, on='gameid', how='left')
I'm writing code that data cleans my original dataframe and then spits out a dataframe of all the rows with errors. This line of code currently finds empty cells in the column 'RaceId'. I have 40 other columns and I would like to find empty cells in all of them apart from 'InRun' and 'Flucs'. How do I create a line of code that does this so I don't have to write out 40 lines of code?
My code:
df2[(df2['RaceId'] == '']
If you want to remove rows that have empty string in all the specified columns, pass all to agg
df[your_col_list].apply(lambda x: x == '').agg(all, axis=1)
Otherwise, pass any to agg.
I'm using the .at function to try and save all columns under one header in a list.
The file contains entries for country and population.
df = pandas.read_csv("file.csv")
population_list = []
df2 = df[df['country'] == "India"]
for i in range(len(df2)):
population_list = df2.at[i, 'population']
This is throwing a KeyError. However, the df.at seems to be working fine for the original dataframe. Is .at just not allowed in this case?
IIUC, you don't need to loop over your dataframe to get what you need. Simply use:
population_list = df2["population"].tolist()
If you really want to use the loop (not recommended when unnecessary), note that the index has likely changed after your filter, i.e not consecutive integers.
Try:
for i in df2.index:
population_list.append(df2.at[i, 'population'])
Note: In your code you keep trying to reassign the entire list to one value instead of appending.
In at you pass the index value and the column name.
In the case of the "original" DataFrame all is OK, because probably the index contains
consecutive values starting from 0.
But when you run df2 = df[df['country'] == "India"] then df2 contains
only a subset of original rows, so the index does not contain consecutive numbers.
One of possible solutions is to run reset_index() on df2.
Then the index will again contain consecutive numbers and your code should raise no exception.
Edit
But your code raises other doubts.
Remember that at returns a single value, taken from a cell
with particular index value and column, not a list.
So maybe it is enough to run:
population_India = df.set_index('country').at['India', 'population']
You don't need any list. You want to find just the popupation of India, a single value.
I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])
I'm trying to remove the percent sign after a value in a pandas dataframe, relevant code:
for i in loansdata:
if i.endswith('%'):
i = i[:-1]
I was thinking that i = i[:-1] would set the new value, but it doesn't. How do I go about it? For clarity: if I print i inside the for loop, it prints without the percent sign. But if I print the whole dataframe, it has not changed.
use str.replace to replace a specific character for a column:
df[col] = df[col].str.replace('%','')
What you're doing depending on what loansdata actually is, is either looping over the columns or the row values of a column.
You can't modify the row contents like that, even if you could you should avoid loops where a vectorised solution exists.
If % exists in multiple cols then you could call the above for each col but this method only exists for str dtypes