Get result of merge in loop for - python

I have a huge 800k row dataframe which I need to find the key with another dataframe.
Initially I was looping through my 2 dataframes with a loop and checking the value of the keys with a condition.
I was told about the possibility of using merge to save time. However, no way to make it work :(
Overall, here's the code I'm trying to adapt:
mergeTwo = pd.read_json('merge/mergeUpdate.json')
matches = pd.read_csv('archive/matches.csv')
for indexOne,value in tqdm(mergeTwo.iterrows()):
for index, match in matches.iterrows():
if value["gameid"] == match["gameid"]:
print(match)
for index, value in mergeTwo.iterrows():
test = value.to_frame().merge(matches, on='gameid')
print(test)
In my first case, my code works without worries.
In the second, this one tells me a problem of not known key (gameid)
Anyone got a solution?
Thanks in advance !

When you iterate over rows, your value is a Series which is transformed into a one-column frame by to_frame method with the original column names as its index. So you need to transpose it to make the second way work:
for index, value in mergeTwo.iterrows():
# note .T after .to_frame
test = value.to_frame().T.merge(matches, on='gameid')
print(test)
But iteration is a redundant tool, merge applied to the first frame should be enough:
mergeTwo.merge(matches, on='gameid', how='left')

Related

DataFrame.at function not working after copying dataframe

I'm using the .at function to try and save all columns under one header in a list.
The file contains entries for country and population.
df = pandas.read_csv("file.csv")
population_list = []
df2 = df[df['country'] == "India"]
for i in range(len(df2)):
population_list = df2.at[i, 'population']
This is throwing a KeyError. However, the df.at seems to be working fine for the original dataframe. Is .at just not allowed in this case?
IIUC, you don't need to loop over your dataframe to get what you need. Simply use:
population_list = df2["population"].tolist()
If you really want to use the loop (not recommended when unnecessary), note that the index has likely changed after your filter, i.e not consecutive integers.
Try:
for i in df2.index:
population_list.append(df2.at[i, 'population'])
Note: In your code you keep trying to reassign the entire list to one value instead of appending.
In at you pass the index value and the column name.
In the case of the "original" DataFrame all is OK, because probably the index contains
consecutive values starting from 0.
But when you run df2 = df[df['country'] == "India"] then df2 contains
only a subset of original rows, so the index does not contain consecutive numbers.
One of possible solutions is to run reset_index() on df2.
Then the index will again contain consecutive numbers and your code should raise no exception.
Edit
But your code raises other doubts.
Remember that at returns a single value, taken from a cell
with particular index value and column, not a list.
So maybe it is enough to run:
population_India = df.set_index('country').at['India', 'population']
You don't need any list. You want to find just the popupation of India, a single value.

How to append rows to a Pandas dataframe, and have it turn multiple overlapping cells (with the same index) into a single value, instead of a series?

I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.

adding row from one dataframe to another

I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])

How to limit Pandas loc selection

I search Pandas DataFrame by loc -for example like this
x = df.loc[df.index.isin(['one','two'])]
But I need only the first row of the result. If I use
x = df.loc[df.index.isin(['one','two'])].iloc[0]
I get error in the case that no row is found. Of course, I can select all the rows (the first example) and then check if result is empty or not. But I seek some more efficient way (the dataframe can be long). Is there any?
pandas.Index.duplicated
The pandas.Index object has a duplicated method that identifies all repeated values after the first occurance.
x[~x.index.duplicated()]
If you wanted to ...
df[df.index.isin(['one', 'two']) & ~df.index.duplicated()]

Splitting Pandas Dataframe with groupby and last

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?
If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).
Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Categories

Resources