Matching Pandas DataFrame Column Values with another DataFrame Column - python

country = []
for i in df_temp['Customer Name'].iloc[:]:
if i in gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist():
country.append(gui_broker["Book"].values[gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist().index(i)])
else:
country.append("No Book Defined")
df_temp["Country"] = country
I have currently a large DataFrame (df_temp) with one column ('Customer Name') and am trying to match it with a small DataFrame (gui_broker) which has 3 columns - one of which has all unique values of the large DataFrame ('EXACT_DDI_CUSTOMER_NAME').
After matching the value row of df_temp I want to create a new column in df_temp with the value 'Book' of my small DataFrame (gui_broker) based on the matching. I tried every apply lambda method, but am out of clue. The above provided code provides me with a solution, but it's slow and not Pandas like...
How exactly could I proceed?

You can use pandas merge to do that. like this...
df_temp = df_temp.merge(gui_broker[['EXACT_DDI_CUSTOMER_NAME','Book']], left_on='Customer Name', right_on='EXACT_DDI_CUSTOMER_NAME', how='left' )
df_temp['Book'] = df_temp['Book'].fillna('No Book Defined')

Looks like you are looking for join (docs are here)
It joins DataFrame with the other by matching the selected column(s) in the first with the index in the second.
So
df_temp.join(gui_broker
.set_index("EXACT_DDI_CUSTOMER_NAME")
.loc[:, ["Book"]],
on="Customer Name")

I believe this should do it, using map to map the Book column of gui_broker by the EXACT_DDI_CUSTOMER_NAME, onto Custome Name in df_tmp, :
df_tmp['Country'] = (df_tmp['Customer Name']
.map(gui_broker.set_index('EXACT_DDI_CUSTOMER_NAME').Book)
.fillna('No Book Defined'))
Though I would need some example data to test it with!

Related

Replace from database with python

I'm stuck with this issue:
I want to replace each row in one column in csv with id.
I have vehicle names and id's in the database:
In csv file this column look like this:
I was thinking to use pandas, to make a replacement:
df = pd.read_csv(file).replace('ALFA ROMEO 147 (937), 10.04 - 05.10', '0')
But it is the wrong way to write replace 2000+ times.
So, how can I use names from db and replace them with the correct id?
A possible solution is to merge the second dataset with the first one:
After reading the two datasets (df1, the one from the csv file, and df2, the one with vehicle_id):
df1.merge(df2, how='left', on='vehicle')
So that the final output will be a dataset with columns:
id, vehicle, vehicle_id
Imagine df1 as:
and df2 as:
the result will be:
Here you can find the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Combining two pandas dataframes based on column AND row VALUES

First, I haven't found this asked before - probably because I'm not using the right words to ask it. So if it has been asked, please send me in that direction.
How can I combine two pandas data frames based on column AND row. My main dataframe has a column 'years' and a column 'county' among others. Ideally, I want to add another column 'percent' from the second data frame below.
For example, I have this image of my first df:
and I have another data frame with the same 'year' column and every other column name is a string value in the original "main" dataframe's 'county' column:
How can I combine these two data frames in a way that adds another column to the 'main df'? It would be helpful to first put the second data frame in the format where there are three columns: 'year', 'county', and 'percent'. If anyone can help me with this part, I can merge it.
I think what you will want to do is transform the second dataframe to have a row for each year/county combination and then you can use a left join to combine the two. I believe the ```melt`` method will do this transformation. Try this:
melted_second_df = second_df.melt(id_vars=["year"], var_name="county", value_name="percent")
combined_df = first_df.merge(
right=melted_second_df,
on=["year", "county"],
how="left"
)

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Pandas Column of Lists to Separate Rows

I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (e.g. tone).
One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article.
I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations).
I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this?
Thanks a lot.
'events' is the column that contains the locations
'event_cols' are the columns from the original df that I want to retain in the new df.
'df_events' is the new data frame
for i, row in df.iterrows():
for location in df.events.loc[i]:
try:
df_storage = pd.DataFrame(row[event_cols]).T
df_storage['loc'] = location
df_events = df_events.append(df_storage)
except ValueError as e:
continue
I would group the DataFrame with groupby(), explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame.
df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
.apply(lambda x: pd.DataFrame(x.values[0]))\
.reset_index().drop('level_3', axis = 1)
In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row.

Categories

Resources