adding rows to empty dataframe with columns - python

I am using Pandas and want to add rows to an empty DataFrame with columns already established.
So far my code looks like this...
def addRows(cereals,lines):
for i in np.arange(1,len(lines)):
dt = parseLine(lines[i])
dt = pd.Series(dt)
print(dt)
# YOUR CODE GOES HERE (add dt to cereals)
cereals.append(dt, ignore_index = True)
return(cereals)
However, when I run...
cereals = addRows(cereals,lines)
cereals
the dataframe returns with no rows, just the columns. I am not sure what I am doing wrong but I am pretty sure it has something to do with the append method. Anyone have any ideas as to what I am doing wrong?

There are two probably reasons your code is not operating as intended:
cereals.append(dt, ignore_index = True) is not doing what you think it is. You're trying to append a series, not a DataFrame there.
cereals.append(dt, ignore_index = True) does not modify cereals in place, so when you return it, you're returning an unchanged copy. An equivalent function would look like this:
--
>>> def foo(a):
... a + 1
... return a
...
>>> foo(1)
1
I haven't tested this on my machine, but I think you're fixed solution would look like this:
def addRows(cereals, lines):
for i in np.arange(1,len(lines)):
data = parseLine(lines[i])
new_df = pd.DataFrame(data, columns=cereals.columns)
cereals = cereals.append(new_df, ignore_index=True)
return cereals
by the way.. I don't really know where lines is coming from, but right away I would at least modify it to look like this:
data = [parseLine(line) for line in lines]
cereals = cereals.append(pd.DataFrame(data, cereals.columns), ignore_index=True)
How to add an extra row to a pandas dataframe
You could also create a new DataFrame and just append that DataFrame to your existing one. E.g.
>>> import pandas as pd
>>> empty_alph = pd.DataFrame(columns=['letter', 'index'])
>>> alph_abc = pd.DataFrame([['a', 0], ['b', 1], ['c', 2]], columns=['letter', 'index'])
>>> empty_alph.append(alph_abc)
letter index
0 a 0.0
1 b 1.0
2 c 2.0
As I noted in the link, you can also use the loc method on a DataFrame:
>>> df = empty_alph.append(alph_abc)
>>> df.loc[df.shape[0]] = ['d', 3] // df.shape[0] just finds next # in index
letter index
0 a 0.0
1 b 1.0
2 c 2.0
3 d 3.0

Related

KEGG Drug database Python script

I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.

Moving data from a dictionary to a pandas DataFrame

I have a dictionary where the key is two parts, one the index coordinate and the other the column coordinate. I would like to use this dictionary to populate a pandas DataFrame based on these coordinates.
For example my dictionary looks like this:
final = {('BUV395', 'BUV496'): 0, ('BUV395', 'BUV563'): 0, ('BUV395', 'BUV615'): 0, ('BUV395', 'BUV661'): 0, etc...
The input to my function is the pandas DataFrame with the original data - just to give context to the code below:
def matrix_all_pairs(df):
dataframe = pd.DataFrame(index=range(0,len(df.index.values)),columns=range(0,len(df.index.values)))
dataframe.columns = df.index.values
idx = list(df.index.values)
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
if (r2_score(df.xs(fluor[0]), df.xs(fluor[1]))) < 0:
final[fluor] = 0
else:
final[fluor] = (r2_score(df.xs(fluor[0]), df.xs(fluor[1])))
for fluor, value in list_fluor:
x = value
dataframe.loc(idx.index(fluor[0]), fluor[1]) = x
dataframe.index = df.index.values
return(dataframe)
When I try to run this, it gives me "SyntaxError: can't assign to function call" for the line:
dataframe.loc(idx.index(fluor[0]), fluor[1]) = x
Is there a better way of doing this? I've seen multiple people say that populating an empty DataFrame using a loop is messy but I'm not sure how else I could do this?
I'm not sure how to post my data for people to work with - I'm new to this site.
Is this what you're asking? First item in each tuple is the "row/index" value and second item is the "column" header. Essentially, you have a multiindex series that you want to unstack into a single index dataframe.
df = pd.DataFrame.from_dict(final, orient='index')
df[['index','column']] = df.index.values.tolist()
df = df.set_index(['index','column'])[0].unstack()
Your example final dictionary has only one unique key in the first tuple elements so result would be:
column BUV496 BUV563 BUV615 BUV661
index
BUV395 0 0 0 0
Alternate example to show more obviously 2-dimensional dataframe.
final = {('BUV395', 'BUV496'): 0, ('BUV395', 'BUV563'): 0, ('BUV496', 'BUV395'): 0, ('BUV496', 'BUV563'): 0, ('BUV563', 'BUV395'): 0, ('BUV563', 'BUV496'): 0}
df = pd.DataFrame.from_dict(final, orient='index')
df[['index','column']] = df.index.values.tolist()
df = df.set_index(['index','column'])[0].unstack().rename_axis(None).rename_axis(None, axis=1)
BUV395 BUV496 BUV563
BUV395 NaN 0.0 0.0
BUV496 0.0 NaN 0.0
BUV563 0.0 0.0 NaN

Adding multiple columns to pandas df based on row values

I would like to use a function that produces multiple outputs to create multiple new columns in an existing pandas dataframe.
For example, say I have this test function which outputs 2 things:
def testfunc (TranspoId, LogId):
thing1 = TranspoId + LogId
thing2 = LogId - TranspoId
return thing1, thing2
I can give those returned outputs to 2 different variables like so:
Thing1,Thing2 = testfunc(4,28)
print(Thing1)
print(Thing2)
I tried to do this with a dataframe in the following way:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
print(df)
df['thing1','thing2'] = df.apply(lambda row: testfunc(row.TranspoId, row.LogId), axis=1)
print(df)
What I want is something that looks like this:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23], 'Thing1':[13,16,26], 'Thing2':[11,12,20]}
df = pd.DataFrame(data, columns=['Name','TranspoId','LogId','Thing1','Thing2'])
print(df)
In the real world that function is doing a lot of heavy lifting, and I can't afford to run it twice, once for each new variable being added to the df.
I've been hitting myself in the head with this for a few hours. Any insights would be greatly appreciated.
I believe the best way is to change the order and make a function that works with Series.
import pandas as pd
# Create function that deals with series
def testfunc (Series1, Series2):
Thing1 = Series1 + Series2
Thing2 = Series1 - Series2
return Thing1, Thing2
# Create df
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
# Apply function
Thing1,Thing2 = testfunc(df['TranspoId'],df['LogId'])
print(Thing1)
print(Thing2)
# Assign new columns
df = df.assign(Thing1 = Thing1)
df = df.assign(Thing2 = Thing2)
# print df
print(df)
Your function should return a series that calculates the new columns in one pass. Then you can use pandas.apply() to add the new fields.
import pandas as pd
df = pd.DataFrame( {'TranspoId':[1,2,3], 'LogId':[4,5,6]})
def testfunc(row):
new_cols = pd.Series([
row['TranspoId'] + row['LogId'],
row['LogId'] - row['TranspoId']])
return new_cols
df[['thing1','thing2']] = df.apply(testfunc, axis = 1)
print(df)
Output:
TranspoId LogId thing1 thing2
0 1 4 5 3
1 2 5 7 3
2 3 6 9 3

How to efficiently filter a pandas dataframe and return a pandas series?

The question seems simple and arguably on the verge of stupid. But given my scenario, it seems that I would have to do exactly that in order to keep a bunch of calculations accross several dataframes efficient.
Scenario:
I've got a bunch of pandas dataframes where the column names are constructed by a name part and a time part such as 'AA_2018' and 'BB_2017'. And I'm doing calculations on different columns from different dataframes so I'll have to filter out the timepart. As an mcve let's just say that I'd like to subract the column containing 'AA' from the column containing 'BB' and ignore all other columns in this dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
If i knew the exact name of the columns, this can easily be done using:
diff_series = df['AA_2018'] - df['BB_2017']
This would return a pandas series since I'm using single brackets [] as opposed to a datframe If I had used double brackets [[]].
My challenge:
diff_series is of type pandas.core.series.Series. But since I've got some filtering to do, I'm using df.filter() that returns a dataframe with one column and not a series:
# in:
colAA = df.filter(like = 'AA')
# out:
# AA_2018
# 2018-01-01 0.801295
# 2018-01-02 0.860808
# 2018-01-03 -0.728886
# in:
# type(colAA)
# out:
# pandas.core.frame.DataFrame
Snce colAA is of type pandas.core.frame.DataFrame, the following returns a dataframe too:
# in:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
# out:
AA_2018 BB_2017
2018-01-01 NaN NaN
2018-01-02 NaN NaN
2018-01-03 NaN NaN
And that is not what I'm after. This is:
# in:
diff_series = df['AA_2018'] - df['BB_2017']
# out:
2018-01-01 0.828895
2018-01-02 -1.153436
2018-01-03 -1.159985
Why am I adamant in doing it this way?
Because I'd like to end up with a dataframe using .to_frame() with a specified name based on the filters I've used.
My presumably inefficient approach is this:
# in:
colAA_values = [item for sublist in colAA.values for item in sublist]
# (because colAA.values returns a list of lists)
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# out:
someFilter
2018-01-01 -0.828895
2018-01-02 1.153436
2018-01-03 1.159985
What I've tried / What I was hoping to work:
# in:
(df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')
# out:
# AttributeError: 'DataFrame' object has no attribute 'to_frame'
# (Of course because df.filter() returns a one-column dataframe)
I was also hoping that df.filter() could be set to return a pandas series, but no.
I guess I could have asked this questions instead: How to convert pandas dataframe column to a pandas series? But that does not seem to have an efficient built-in oneliner either. Most search results handle the other way around instead. I've been messing around with potential work-arounds for quite some time now, and an obvious solution might be right around the corner, but I'm hoping some of you has a suggestion on how to do this efficiently.
All code elements for an easy copy&paste:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
#diff_series = df[['AA_2018']] - df[['BB_2017']]
#type(diff_series)
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
#type(df_filtered)
#type(colAA)
#colAA.values
#colAA.values returns a list of lists that has to be flattened for use in pd.Series
colAA_values = [item for sublist in colAA.values for item in sublist]
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# Attempts:
# (df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')
You need opposite of to_frame - DataFrame.squeeze - convert one column DataFrame to Series:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB.squeeze() - colAA.squeeze()
print (df_filtered)
2018-01-01 -0.479247
2018-01-02 -3.801711
2018-01-03 1.567574
Freq: D, dtype: float64

Python: Printing dataframe to csv

I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335
The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.

Categories

Resources