I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])
Related
I have a huge 800k row dataframe which I need to find the key with another dataframe.
Initially I was looping through my 2 dataframes with a loop and checking the value of the keys with a condition.
I was told about the possibility of using merge to save time. However, no way to make it work :(
Overall, here's the code I'm trying to adapt:
mergeTwo = pd.read_json('merge/mergeUpdate.json')
matches = pd.read_csv('archive/matches.csv')
for indexOne,value in tqdm(mergeTwo.iterrows()):
for index, match in matches.iterrows():
if value["gameid"] == match["gameid"]:
print(match)
for index, value in mergeTwo.iterrows():
test = value.to_frame().merge(matches, on='gameid')
print(test)
In my first case, my code works without worries.
In the second, this one tells me a problem of not known key (gameid)
Anyone got a solution?
Thanks in advance !
When you iterate over rows, your value is a Series which is transformed into a one-column frame by to_frame method with the original column names as its index. So you need to transpose it to make the second way work:
for index, value in mergeTwo.iterrows():
# note .T after .to_frame
test = value.to_frame().T.merge(matches, on='gameid')
print(test)
But iteration is a redundant tool, merge applied to the first frame should be enough:
mergeTwo.merge(matches, on='gameid', how='left')
I have a pandas dataframe and I have to fill a new column based on the values of an existing column, associating the values of a dictionary.
mydict={'key1':'val1', 'key2':'val2'}
df['new_col']=df['keys'].map(mydict)
Now I have a similar problem, but the dictionary is now a defaultdict(list)
my_defdict=defaultdict(list)
my_defdict={'key1':['val1','item1'], 'key2':['val2','item2']}
and I need a new column with the second element of the list, something like
df['new_col2']=df['keys'].map(my_defdict()[1])
which is of course wrong. How can I perform this operation without creating another normal dictionary?
Assuming all your values have at least two items per list, add an str[1] at the end:
df['new_col2'] = df['keys'].map(my_defdict).str[1]
Or,
df['new_col2'] = df['keys'].map(my_defdict).str.get(1)
I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (e.g. tone).
One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article.
I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations).
I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this?
Thanks a lot.
'events' is the column that contains the locations
'event_cols' are the columns from the original df that I want to retain in the new df.
'df_events' is the new data frame
for i, row in df.iterrows():
for location in df.events.loc[i]:
try:
df_storage = pd.DataFrame(row[event_cols]).T
df_storage['loc'] = location
df_events = df_events.append(df_storage)
except ValueError as e:
continue
I would group the DataFrame with groupby(), explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame.
df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
.apply(lambda x: pd.DataFrame(x.values[0]))\
.reset_index().drop('level_3', axis = 1)
In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row.
I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I've written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.
This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I'm still a relative beginner. Thanks heaps.
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json
# Set some display options
pd.set_option('max_colwidth',150)
# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
3: '[]',\
4: None}})
display(df)
# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()
# Step through all rows in the dataframe
for i in range(df.shape[0]):
# Check whether record is null, or doesn't contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)
# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)
print("Processed df:")
display(df)
First, a general piece of advice about pandas. If you find yourself iterating over the rows of a dataframe, you are most likely doing it wrong.
With this in mind, we can re-write your current procedure using pandas 'apply' method (this will likely speed it up to begin with, as it means far fewer index lookups on the df):
# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
if pd.notnull(row) and len(row) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(row)
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
#we don't need to re-index
# Give the single record the same index number as the parent dataframe (for the merge to work)
#y.index = [df.index[i]]
#we don't need to add to a temp df
# Append this dataframe on sequentially for each row as we go through the loop
return y.iloc[0]
else:
return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)
Notice that this returns exactly the same result as before, we haven't changed the logic. the apply method sorts out the indexes, so we can just merge, fine.
I believe that answers your question in terms of speeding it up and being a bit more idiomatic.
I think you should consider however, what you want to do with this data structure, and how you might better structure what you're doing.
Given ColB could be of arbitrary length, you will end up with a dataframe with an arbitrary number of columns. When you come to access these values for whatever purpose, this will cause you pain, whatever the purpose is.
Are all entries in ColB important? Could you get away with just keeping the first one? Do you need to know the index of a certain valA val?
These are questions you should ask yourself, then decide on a structure which will allow you to do whatever analysis you need, without having to check a bunch of arbitrary things.
Let's say I have a list of objects (in this instance, dataframes)
myList = [dataframe1, dataframe2, dataframe3 ...]
I want to loop over my list and create new objects based on the names of the list items. What I want is a pivoted version of each dataframe, called "dataframe[X]_pivot" where [X] is the identifier for that dataframe.
My pseudocode looks something like:
for d in myList:
d+'_pivot' = d.pivot_table(index='columnA', values=['columnB'], aggfunc=np.sum)
And my desired output looks like this:
myList = [dataframe1, dataframe2 ...]
dataframe1_pivoted # contains a pivoted version of dataframe1
dataframe2_pivoted # contains a pivoted version of dataframe2
dataframe3_pivoted # contains a pivoted version of dataframe3
Help would be much appreciated.
Thanks
John
You do not want to do that. Creating a variables dynamically is almost always a very bad idea. The correct thing to do would be to simply use an appropriate data structure to hold your data, e.g. either a list (as your elements are all just numbered, you can just as well access them via an index) or a dictionary (if you really really want to give a name to each individual thing):
pivoted_list = []
for df in mylist:
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
pivoted_list.append(pivoted_df)
#now access your results by index
do_something(pivoted_list[0])
do_something(pivoted_list[1])
The same thing can be expressed as a list comprehension. Assume pivot is a function that takes a dataframe and turns it into a pivoted frame, then this is equivalent to the loop above:
pivoted_list = [pivot(df) for df in mylist]
If you are certain that you want to have names for the elements, you can create a dictionary, by using enumerate like this:
pivoted_dict = {}
for index, df in enumerate(mylist):
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
dfname = "dataframe{}_pivoted".format(index + 1)
pivoted_dict[dfname] = pivoted_df
#access results by name
do_something(pivoted_dict["dataframe1_pivoted"])
do_something(pivoted_dict["dataframe2_pivoted"])
The way to achieve that is:
globals()[d+'_pivot'] = d.pivot_table(...)
[edit] after looking at your edit, I see that you may want to do something like this:
for i, d in enumerate(myList):
globals()['dataframe%d_pivoted' % i] = d.pivot_table(...)
However, as others have suggested, it is unadvisable to do so if that is going to create lots of global variables.
There are better ways (read: data structures) to do so.