I have a csv file with many columns but for simplicity I am explaining the problem using only 3 columns. The column names are 'user', 'A' and 'B'. I have read the file using the read_csv function in pandas. The data is stored as a data frame.
Now I want to remove some rows in this dataframe based on their values. So if value in column A is not equal to a and column B is not equal to b I want to skip those user rows.
The problem is I want to dynamically create a dataframe to which I can append one row at a time. Also I do not know the number of rows that there would be. Therefore, I cannot specify the index when defining the dataframe.
I am using the following code:
import pandas as pd
header=['user','A','B']
userdata=pd.read_csv('.../path/to/file.csv',sep='\t', usecols=header);
df = pd.DataFrame(columns=header)
for index, row in userdata.iterrows():
if row['A']!='a' and row['B']!='b':
data= {'user' : row['user'], 'A' : row['A'], 'B' : row['B']}
df.append(data,ignore_index=True)
The 'data' is being populated properly but I am not able to append. At the end, df comes to be empty.
Any help would be appreciated.
Thank you in advance.
Regarding your immediate problem, append() doesn't modify the DataFrame; it returns a new one. So you would have to reassign df via:
df = df.append(data,ignore_index=True)
But a better solution would be to avoid iteration altogether and simply query for the rows you want. For example:
df = userdata.query('A != "a" and B != "b"')
Related
I have a DataFrame which has a few columns. There is a column with a value that only appears once in the entire dataframe. I want to write a function that returns the column name of the column with that specific value. I can manually find which column it is with the usual data exploration, but since I have multiple dataframes with the same properties, I need to be able to find that column for multiple dataframes. So a somewhat generalized function would be of better use.
The problem is that I don't know beforehand which column is the one I am looking for since in every dataframe the position of that particular column with that particular value is different. Also the desired columns in different dataframes have different names, so I cannot use something like df['my_column'] to extract the column.
Thanks
You'll need to iterate columns and look for the value:
def find_col_with_value(df, value):
for col in df:
if (df[col] == value).any():
return col
This will return the name of the first column that contains value. If value does not exist, it will return None.
Check the entire DataFrame for the specific value, checking any to see if it ever appears in a column, then slice the columns (or the DataFrame if you want the Series)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 5, (100, 200)),
columns=[chr(i+40) for i in range(200)])
df.loc[5, 'Y'] = 'secret_value' # Secret value in column 'Y'
df.eq('secret_value').any().loc[lambda x: x].index
# or
df.columns[df.eq('secret_value').any()]
Index(['Y'], dtype='object')
I have another solution:
names = ds.columns
for i in names:
for j in ds[i]:
if j == 'your_value':
print(i)
break
Here you are collecting all the names of columns and then iterating all dataset while it will be found. Then print the name of column.
I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.
I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.
I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])
When I read in a CSV, I can say pd.read_csv('my.csv', index_col=3) and it sets the third column as index.
How can I do the same if I have a pandas dataframe in memory? And how can I say to use the first row also as an index? The first column and row are strings, rest of the matrix is integer.
You can try this regardless of the number of rows
df = pd.read_csv('data.csv', index_col=0)
Making the first (or n-th) column the index in increasing order of verboseness:
df.set_index(list(df)[0])
df.set_index(df.columns[0])
df.set_index(df.columns.tolist()[0])
Making the first (or n-th) row the index:
df.set_index(df.iloc[0].values)
You can use both if you want a multi-level index:
df.set_index([df.iloc[0], df.columns[0]])
Observe that using a column as index will automatically drop it as column. Using a row as index is just a copy operation and won't drop the row from the DataFrame.
Maybe try set_index()?
df = df.set_index([2])
Maybe try df = pd.read_csv(header = 0)