pick specific columns from multiple csv files in python pandas - python

I am trying to create a modified CSV file from multiple small csv files. There is one column common in field1.csv and field2.csv. The final csv file final.csv will contain column["NAME"], column["ACC"] from field1.csv and column1["SCORE"], column["TEAM"] from field2.csv where column["ID"] from field1.csv is euqal to column["ID"] from field2.csv. If there is no value then it should be blank. I am using Python pandas.
field1.csv :-
"ID","NAME","ACC","POINT"
"123","TRR","OOP","64"
"124","DEE","OOP","78"
"125","EWR","PLO","98"
field2.csv :-
"ID","SCORE","TEAM","END"
"111","92","BCC","0"
"121","80","CSS","1"
"123","87","BCC","0"
final.csv :-
"NAME","ACC","SCORE","TEAM"
"TRR","OOP","87","BCC"
"DEE","OOP","",""
"EWR","PLO","",""
Python code that I am trying,
import pandas as pd
df1 = pd.read_csv("field1.csv", index_col=[1], index_col=[2])
df2 = pd.read_csv("field2.csv", index_col=[1], index_col=[2])
finaldf = pd.concat([df1, df2])
print(finaldf)
finaldf.to_csv('final.csv')

I think need one parameter index_col for convert first column to index with filter columns by usecols with join by default left join:
df1 = pd.read_csv("field1.csv", index_col=[0], usecols=["ID","NAME","ACC"])
df2 = pd.read_csv("field2.csv", index_col=[0], usecols=["ID","SCORE","TEAM"])
finaldf = df1.join(df2)
print (finaldf)
NAME ACC SCORE TEAM
ID
123 TRR OOP 87.0 BCC
124 DEE OOP NaN NaN
125 EWR PLO NaN NaN
Another possible solution is filter columns before join by subsets:
df1 = pd.read_csv("field1.csv", index_col=[0])
df2 = pd.read_csv("field2.csv", index_col=[0])
finaldf = df1[["NAME","ACC"]].join(df2[["SCORE","TEAM"]])
Last write to file with omit index:
finaldf.to_csv('final.csv', index=False)

Related

KEGG Drug database Python script

I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.

Comparing two spreadsheets, removing the duplicates and exporting the result to a csv in python

I'm trying to compare two excel spreadsheets, remove the names that appear in both spreadsheets from the first spreadsheet and then export it to a csv file using python. I am new, but here's what I have so far:
import pandas as pd
data_1 = pd.read_excel (r'names1.xlsx')
bit_data = pd.DataFrame(data_1, columns= ['Full_Name'])
bit_col = len(bit_data)
data_2 = pd.read_excel (r'force_names.xlsx')
force_data = pd.DataFrame(data_2, columns= ['FullName'])
force_col = len(force_data)
for bit_num in range(bit_col):
for force_num in range(force_col):
if bit_data.iloc[bit_num,0] == force_data.iloc[force_num,0]:
data_1 = data_1.drop(data_1.index[[bit_num]])
data_1.to_csv(r"/Users/name/Desktop/Reports/Names.csv")
When I run it it gets rid of some duplicates but not all, any advice anyone has would be greatly appreciated.
Use pandas merge to get all unique names, with no duplicates. If you want to drop any names that are in both files (I'm not sure if that's what you're asking), you can do so. See this toy example:
row1list = ['G. Anderson']
row2list = ['Z. Ebra']
df1 = pd.DataFrame([row1list, row2list], columns=['FullName'])
row1list = ['G. Anderson']
row2list = ['C. Obra']
df2 = pd.DataFrame([row1list, row2list], columns=['FullName'])
df3 = df1.merge(df2, on='FullName', how='outer', indicator=True)
print(df3)
# FullName _merge
# 0 G. Anderson both
# 1 Z. Ebra left_only
# 2 C. Obra right_only
df3 = df3.loc[df3['_merge'] != 'both']
del df3['_merge']
print(df3)
# FullName
# 1 Z. Ebra
# 2 C. Obra

adding dates to a pandas data frame

I currently have a df in pandas with a variable called 'Dates' that records the data an complaint was filed.
data = pd.read_csv("filename.csv")
Dates
Initially Received
07-MAR-08
08-APR-08
19-MAY-08
As you can see there are missing dates between when complaints are filed, also multiple complaints may have been filed on the same day. Is there a way to fill in the missing days while keeping complaints that were filed on the same day the same?
I tried creating a new df with datetime and merging the dataframes together,
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df = pd.DataFrame(data=days)
df.index = range(3653)
dates = pd.merge(days, data['Dates'], how='inner')
but I get the following error:
ValueError: can not merge DataFrame with instance of type <class
'pandas.tseries.index.DatetimeIndex'>
Here are the first four rows of data
You were close, there's an issue with your input
First do:
df = pd.read_csv('filename.csv', skiprows = 1)
Then
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df_clean = df.reset_index()
df_clean['idx dates'] = pd.to_datetime(df_clean['Initially Received'])
df2 = pd.DataFrame(data=days, index = range(3653), columns=['full dates'])
dates = pd.merge(df2, df_clean, left_on='full dates', right_on = 'idx dates', how='left')
Create your date range, and use merge to outer join it to the original dataframe, preserving duplicates.
import pandas as pd
from io import StringIO
TESTDATA = StringIO(
"""Dates;fruit
05-APR-08;apple
08-APR-08;banana
08-APR-08;pear
11-APR-08;grapefruit
""")
df = pd.read_csv(TESTDATA, sep=';', parse_dates=['Dates'])
dates = pd.date_range(start='04-APR-2008', end='12-APR-2008').to_frame()
pd.merge(
df, dates, left_on='Dates', right_on=0,
how='outer').sort_values(by=['Dates']).drop(columns=0)
# Dates fruit
# 2008-04-04 NaN
# 2008-04-05 apple
# 2008-04-06 NaN
# 2008-04-07 NaN
# 2008-04-08 banana
# 2008-04-08 pear
# 2008-04-09 NaN
# 2008-04-10 NaN
# 2008-04-11 grapefruit
# 2008-04-12 NaN

How to match values of a dataframe with another dataframe in Python? [duplicate]

I am merging two csv(data frame) using below code:
import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)
I have the following CSV files
file1:
test_id, student_id
1, 01990
2, 02300
3, 05555
file2:
test_id, result
1, pass
3, fail
after merge
test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail
If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv function it converts it into numeric and removes leading 0.
How can I keep the column as "text" even after to_csv?
I think its to_csv function which saves back again as numeric
Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric
Caveat Please use merge or join. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.
a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
[d.set_index('test_id') for d in [a, df]],
axis=1, join='outer'
).reset_index()
It's not dropping the leading zero on the merge, it's dropping it on the read_csv. You can fix this by specifying that column is a string at import time:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
The important part is the dtype parameter. You are telling pandas to import this column as a string. The skipinitialspace parameter is set to True, because the column headers are defined with spaces, so we strip it:
test_id, student_id
^ The student_id starts here, at the space
The final code looks like this:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')
With the results dataframe looking like this:
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
Then when you run to_csv your result should be:
test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail
Solution with join, first need read_csv with parameter dtype for convert student_id to string and remove whitespaces by skipinitialspace:
df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)
df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})
==============================================================
In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]:
test_id student_id result
0 01 1990 pass
1 02 2300 NaN
2 003 5555 fail

How can I split a column into 2 in the correct way?

I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...

Categories

Resources