I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.
Related
I am hoping someone can help me understand why a dataframe works one way but not the other. I am relatively new to Python but do have a decent understanding of Pandas. However I can't figure out why I can't add data to the empty dataframe directly first without getting data from another dataframe. I know there are plenty of ways around this. I am just curious as to why.
Thank you for any light you may be able to shed on this.
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'col2'] = 2
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
print(df)
Produces:
iset
0 1001
Empty DataFrame
Columns: [Import Set No, col2]
Index: []
And:
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
df.loc[:,'col2'] = 2
print(df)
Produces:
iset
0 1001
Index(['Import Set No', 'col2'], dtype='object')
Import Set No col2
0 1001 2
I'm trying to read a file that doesn't have any quotes, which is causing inconsistent number of row lengths
Data looks as follows:
col_a, col_b
abc, inc., 5
xyz corb, 10
Since there are no quotes around "abc, inc.", this is causing the first row to get split into 3 values, but it should actually be just 2 values.
This column is not necessarily in the first position, and that there can be another bad column like this. The data has around 250 columns.
I'm reading this using pd.read_csv, how can this be resolved?
Thanks!
Its not a CSV but since there is only one column with the errant commas you can process with the csv module and fix the slice that holds too many column values. When a row has too many cells, assume they are the ones from the unescaped comma.
import pandas as pd
import csv
def split_badrows(fileobj, bad_col, total_cols):
"""Iterate rows, colapsing extra columns at bad_col"""
for row in csv.reader(fileobj):
row = [cell.strip() for cell in row]
extras = len(row) - total_cols
if extras > 0:
# colapse slice at troubled column into single value
extras += 1 # python slice doesn't include right endpoint
row[bad_col] = ", ".join(row[bad_col:bad_col+extras])
del row[bad_col+1:bad_col+extras]
yield row
def df_from_badtext(fileobj, bad_col):
"""Make pandas.DataFrame from badly formatted text"""
columns = [cell.strip() for cell in next(fileobj).split(",")]
total_cols = len(columns)
return pd.DataFrame(split_badrows(fileobj, bad_col, total_cols),
columns=columns)
# test
open("testme.txt", "w").write("""col_a, col_b
abc, inc., 5
xyz corb, 10""")
df = df_from_badtext(open("testme.txt"), bad_col=0)
print(df)
Data split to list then transform to dataframe.
csv = '''col_a, col_b
abc, inc., 5
xyz corb, 10'''+'\n'
import re
import pandas as pd
reArr = re.findall('(.*),([^,]+)\n',csv)
df=pd.DataFrame(reArr[1:],columns=reArr[0])
print(df)
col_a
col_b
0
abc, inc.
5
1
xyz corb
10
EDIT:
Thanks to tdelaney comment below:
see if this works
pd.read_csv('foo.csv',delimiter=",(?!( [\w\d]*).,)").dropna(axis=1)
OLD:
using delimiter as ",(?!.*,)" in read_csv seems to be solving this for me
EDIT (after updated question with an additional column):
Solution 1:
You can create a function with the bad column as a parameter and use split and concat to correct the dataframe depending on that bad column. Please note that the bad_col parameter in my function is the column number, where we start counting at 1, rather than 0 (e.g. 1, 2, 3, etc. instead of 0, 1, 2, etc.):
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
def fix_csv(df, bad_col):
cols = df.columns.str.split(', ')[0]
x = len(cols) - bad_col
tmp = df.iloc[:,0].str.split(', ', expand=True, n=x)
df = pd.concat([tmp.iloc[:,0],
tmp.iloc[:,-1].str.rsplit(', ', expand=True, n=x)],
axis=1)
df.columns = cols
return df
fix_csv(df, bad_col=2)
Solution 2 (this is if you have issues in multiple columns and you need to use more brute force):
It sounds like there is a possibility that you there could be multiple columns affected from the comments as you mentioned only 1 "so far".
As such, this might be a little bit of a project to clean up the data. The following code can give you an idea how to do that. The bottom-line is that you can create two different dataframes: 1) The first dataframe has the minimum number of commas (i.e. they should be the rows without any issues). 2) The other dataframe will be the dataframe with all of the issues. I've shown how you can clean the data to get to the correct number of columns and then change the data back and concat the two dataframes.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
s = df.iloc[:,0].str.count(',')
df1 = df.copy()[s.eq(s.min())]
df1 = df1.iloc[:,0].str.split(', ', expand=True)
df1.columns = cols
df2 = df.copy()[s.gt(s.min())]
#inspect this dataframe manually to see how many rows affected, which columns, etc.
#cleanup df2 with some .replace so all equal commas
original = [', inc.', ', corp.']
temp = [' inc.', ' corp.']
df2.iloc[:,0] = df2.iloc[:,0].replace(original, temp, regex=True)
df2 = df2.iloc[:,0].str.split(', ', expand=True)
df2.columns = cols
#cleanup df2 by changing back to original values
df2['col_a'] = df2['col_a'].replace(temp, original, regex=True) # you can do this with other columns as well
df3 = pd.concat([df1, df2]).sort_index()
df3
Out[1]:
col col_a col_b
0 000 abc, inc. 5
1 111 xyz corb 10
Solution 3: Previous Solution (for original question when problem was only in first column - for reference)
You can read in with sep="|" as that | character is not in your .csv, so it reads all of the data into one column.
The main assumption to my solution is that the problematic column is only the first column. I use rsplit(', ') and limit the number of splits to the total number of columns minus 1 (with the example data, this is 2-1=1). Hopefully, this solves with your actual data or at least gives you some idea. If your data is separated by , instead of , , please note whether or not to adjust my splits as well.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col_a, col_b
abc, inc., 5
xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
x = len(cols) - 1
df = df.iloc[:,0].str.rsplit(', ', expand=True, n=x)
df.columns = cols
df
Out[1]:
col_a col_b
0 abc, inc. 5
1 xyz corb 10
I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.
df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()
I currently have a df in pandas with a variable called 'Dates' that records the data an complaint was filed.
data = pd.read_csv("filename.csv")
Dates
Initially Received
07-MAR-08
08-APR-08
19-MAY-08
As you can see there are missing dates between when complaints are filed, also multiple complaints may have been filed on the same day. Is there a way to fill in the missing days while keeping complaints that were filed on the same day the same?
I tried creating a new df with datetime and merging the dataframes together,
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df = pd.DataFrame(data=days)
df.index = range(3653)
dates = pd.merge(days, data['Dates'], how='inner')
but I get the following error:
ValueError: can not merge DataFrame with instance of type <class
'pandas.tseries.index.DatetimeIndex'>
Here are the first four rows of data
You were close, there's an issue with your input
First do:
df = pd.read_csv('filename.csv', skiprows = 1)
Then
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df_clean = df.reset_index()
df_clean['idx dates'] = pd.to_datetime(df_clean['Initially Received'])
df2 = pd.DataFrame(data=days, index = range(3653), columns=['full dates'])
dates = pd.merge(df2, df_clean, left_on='full dates', right_on = 'idx dates', how='left')
Create your date range, and use merge to outer join it to the original dataframe, preserving duplicates.
import pandas as pd
from io import StringIO
TESTDATA = StringIO(
"""Dates;fruit
05-APR-08;apple
08-APR-08;banana
08-APR-08;pear
11-APR-08;grapefruit
""")
df = pd.read_csv(TESTDATA, sep=';', parse_dates=['Dates'])
dates = pd.date_range(start='04-APR-2008', end='12-APR-2008').to_frame()
pd.merge(
df, dates, left_on='Dates', right_on=0,
how='outer').sort_values(by=['Dates']).drop(columns=0)
# Dates fruit
# 2008-04-04 NaN
# 2008-04-05 apple
# 2008-04-06 NaN
# 2008-04-07 NaN
# 2008-04-08 banana
# 2008-04-08 pear
# 2008-04-09 NaN
# 2008-04-10 NaN
# 2008-04-11 grapefruit
# 2008-04-12 NaN
I am trying to create a modified CSV file from multiple small csv files. There is one column common in field1.csv and field2.csv. The final csv file final.csv will contain column["NAME"], column["ACC"] from field1.csv and column1["SCORE"], column["TEAM"] from field2.csv where column["ID"] from field1.csv is euqal to column["ID"] from field2.csv. If there is no value then it should be blank. I am using Python pandas.
field1.csv :-
"ID","NAME","ACC","POINT"
"123","TRR","OOP","64"
"124","DEE","OOP","78"
"125","EWR","PLO","98"
field2.csv :-
"ID","SCORE","TEAM","END"
"111","92","BCC","0"
"121","80","CSS","1"
"123","87","BCC","0"
final.csv :-
"NAME","ACC","SCORE","TEAM"
"TRR","OOP","87","BCC"
"DEE","OOP","",""
"EWR","PLO","",""
Python code that I am trying,
import pandas as pd
df1 = pd.read_csv("field1.csv", index_col=[1], index_col=[2])
df2 = pd.read_csv("field2.csv", index_col=[1], index_col=[2])
finaldf = pd.concat([df1, df2])
print(finaldf)
finaldf.to_csv('final.csv')
I think need one parameter index_col for convert first column to index with filter columns by usecols with join by default left join:
df1 = pd.read_csv("field1.csv", index_col=[0], usecols=["ID","NAME","ACC"])
df2 = pd.read_csv("field2.csv", index_col=[0], usecols=["ID","SCORE","TEAM"])
finaldf = df1.join(df2)
print (finaldf)
NAME ACC SCORE TEAM
ID
123 TRR OOP 87.0 BCC
124 DEE OOP NaN NaN
125 EWR PLO NaN NaN
Another possible solution is filter columns before join by subsets:
df1 = pd.read_csv("field1.csv", index_col=[0])
df2 = pd.read_csv("field2.csv", index_col=[0])
finaldf = df1[["NAME","ACC"]].join(df2[["SCORE","TEAM"]])
Last write to file with omit index:
finaldf.to_csv('final.csv', index=False)