Replace from database with python - python

I'm stuck with this issue:
I want to replace each row in one column in csv with id.
I have vehicle names and id's in the database:
In csv file this column look like this:
I was thinking to use pandas, to make a replacement:
df = pd.read_csv(file).replace('ALFA ROMEO 147 (937), 10.04 - 05.10', '0')
But it is the wrong way to write replace 2000+ times.
So, how can I use names from db and replace them with the correct id?

A possible solution is to merge the second dataset with the first one:
After reading the two datasets (df1, the one from the csv file, and df2, the one with vehicle_id):
df1.merge(df2, how='left', on='vehicle')
So that the final output will be a dataset with columns:
id, vehicle, vehicle_id
Imagine df1 as:
and df2 as:
the result will be:
Here you can find the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Related

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

split groups in a table into tables of its sub-groups

I have a table that is already grouped according to first column. I would like to split table into sub-tables with only the corresponding second column. I would like to use pandas or something else in python. I am not keen to use "awk" because that will require me to "subprocess" or "os". In the end I actually only need entries in second column separated according to first. The size of the table can be about 10000 rows X 6 columns.
These are similar posts that I found but I could not figure how to modify them for my purpose.
Split pandas dataframe based on groupby
Splitting groupby() in pandas into smaller groups and combining them
The table/dataframe that I have looks like this:
P0A910 sp|A0A2C5WRC3| 84.136 0.0 100
P0A910 sp|A0A068Z9R6| 73.816 0.0 99
Q9HVD1 sp|A0A2G2MK84| 37.288 4.03e-34 99
Q9HVD1 sp|A0A1H2GM32| 40.571 6.86e-32 98
P09169 sp|A0A379DR81| 52.848 2.92e-117 99
P09169 sp|A0A127L436| 49.524 2.15e-108 98
And I would like it to be split like the following
group1:
P0A910 A0A2C5WRC3
P0A910 A0A068Z9R6
group2:
Q9HVD1 A0A2G2MK84
Q9HVD1 A0A1H2GM32
group3:
P09169 A0A379DR81
P09169 A0A127L436
OR into lists
P0A910:
A0A2C5WRC3
A0A068Z9R6
Q9HVD1:
A0A2G2MK84
A0A1H2GM32
P09169:
A0A379DR81
A0A127L436
So your problem is rather to separate the strings. Is it what you want:
new_col = df[1].str[3:-1]
list(new_col.groupby(df[0]))
So I managed to get a solution of some sort. In this solution I managed to remove prefixes in the second and use groupby in pandas to group the entries by first column. Then, looped through it and wrote each group separately to csv files. I took help from #Quang 's answer and this link. It could probably be done in better ways but here is my code:
import pandas as pd
#read .csv as dataframe
data=pd.read_csv("BlastOut.csv")
#truncates sp| | from second column (['B']).
new_col=data['B'].str[3:-1]
#replaces second column with new_col
data['B']=new_col.to_frame(name=None)
#groups dataframe by first column (['A'])
grouped=data.groupby('A')
#loops through grouped items and writes each group to .csv file with title
#of group ([group_name].csv)
for group_name, group in grouped:
group.to_csv('Out_{}.csv'.format(group_name))
Update- removed all columns except column of interest. This is a continuation to the previous code
import glob
#reads all csv files starting with "Out_" in filename
files=glob.glob("Out_*.csv")
#loop through all csv files
for f in files:
df=pd.read_csv(f, index_col=0)
# Drop columns by column title (["A"])
df.drop(["A"], axis=1, inplace=True)
df.to_csv(f,index=False)

Matching Pandas DataFrame Column Values with another DataFrame Column

country = []
for i in df_temp['Customer Name'].iloc[:]:
if i in gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist():
country.append(gui_broker["Book"].values[gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist().index(i)])
else:
country.append("No Book Defined")
df_temp["Country"] = country
I have currently a large DataFrame (df_temp) with one column ('Customer Name') and am trying to match it with a small DataFrame (gui_broker) which has 3 columns - one of which has all unique values of the large DataFrame ('EXACT_DDI_CUSTOMER_NAME').
After matching the value row of df_temp I want to create a new column in df_temp with the value 'Book' of my small DataFrame (gui_broker) based on the matching. I tried every apply lambda method, but am out of clue. The above provided code provides me with a solution, but it's slow and not Pandas like...
How exactly could I proceed?
You can use pandas merge to do that. like this...
df_temp = df_temp.merge(gui_broker[['EXACT_DDI_CUSTOMER_NAME','Book']], left_on='Customer Name', right_on='EXACT_DDI_CUSTOMER_NAME', how='left' )
df_temp['Book'] = df_temp['Book'].fillna('No Book Defined')
Looks like you are looking for join (docs are here)
It joins DataFrame with the other by matching the selected column(s) in the first with the index in the second.
So
df_temp.join(gui_broker
.set_index("EXACT_DDI_CUSTOMER_NAME")
.loc[:, ["Book"]],
on="Customer Name")
I believe this should do it, using map to map the Book column of gui_broker by the EXACT_DDI_CUSTOMER_NAME, onto Custome Name in df_tmp, :
df_tmp['Country'] = (df_tmp['Customer Name']
.map(gui_broker.set_index('EXACT_DDI_CUSTOMER_NAME').Book)
.fillna('No Book Defined'))
Though I would need some example data to test it with!

Pandas Merge - put all join-column data under one output column instead of two?

I have two CSV files, with the following schemas:
CSV1 columns:
"Id","First","Last","Email","Company"
CSV2 columns:
"PersonId","FirstName","LastName","Em","FavoriteFood"
If I load them each into a Pandas DataFrame and do newdf = df1.merge(df2, how='outer', left_on=['Last', 'First'], right_on=['LastName','FirstName'])
Then a CSV export of the joined DataFrame has a schema of:
"Id","First","Last","Email","Company","PersonId","FirstName","LastName","Em","FavoriteFood"
All the rows that were only in CSV1 have a first name printed under
"First."
All the rows that were only in CSV2 have a first name printed under
"FirstName."
All the rows that were in both CSV file have a first
name (the exact same value - which is to be expected, since it was a
"join on" value) printed under both columns.
Same problem for "Last" & "LastName."
What I'd like is an output schema more like this:
"Id","First","Last","Email","Company","PersonId","Em","FavoriteFood"
It should have all of the "first names" under the column "First" (and equivalent for "Last").
Most relational database software I'm familiar with does this (the left-side join-column names win the naming war). Does Pandas have a syntax for instructing it to do so?
I can do df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First'}), how='outer', on=['Last', 'First']), but stylistically, it drives me crazy to hard-code the same column-names twice in my source code. It's more to fix if I change the column names in the CSV files.
One way to do it would be to just merge the same way but drop columns that you'd like to remove.
newdf.drop(['LastName','FirstName'], 1, inplace=True)

Compare two excel files in Pandas and return the rows which have the same values in TWO columns

I have a couple of excel files. Both the files have two common columns: Customer_Name and Customer_No. The first excel file has around 800k rows while the second has only 460. I want to get a dataframe which has the common data in both the files, ie obtain the rows from the first file that has both the Customer_Name and Customer_No. found in the 2nd file. I tried using .isin but so far I found examples using only a single variable(Column). Thanks in advance!
Use merge:
df = pd.merge(df1, df2, on=['Customer_Name','Customer_No'])
If you have different column names use left_on and right_on:
df = pd.merge(df1,
df2,
left_on=['Customer_Name','Customer_No'],
right_on=['Customer_head','Customer_Id'])
IIUC and you don't need extra columns from the second file - it will be used just for joining, you can do it this way:
common_cols = ['Customer_Name','Customer_No']
df = (pd.read_excel(filename1)
.join(pd.read_excel(filename2, usecols=common_cols),
on=common_cols))
I think the direct way will be like this:
df_file1 = pd.read_csv(file1, index_col) # set Customer_No
df_file2 = pd.read_csv(file2, index_col) # set Customer_No
for index, row in df_file1.iterrows():
if row.get_value('Customer_name) in df_file2['Customer_name'].values:
here you can count, simply by integer or produce some complicated job like add [index, row] to result df, if needed.

Categories

Resources