Reading excel and storing data with xlrd - python

I have this data in excel sheet
FT_NAME FC_NAME C_NAME
FT_NAME1 FC1 C1
FT_NAME2 FC21 C21
FC22 C22
FT_NAME3 FC31 C31
FC32 C32
FT_NAME4 FC4 C4
where column names are
FT_NAME,FC_NAME,C_NAME
and I want to store this values in a data structure for further use, currently I am trying to store them in a list of list but could not do so with following code
i=4
oc=sheet.cell(i,8).value
fcl,ocl=[],[]
while oc:
ft=sheet.cell(i,6).value
fc=sheet.cell(i,7).value
oc=sheet.cell(i,8).value
if ft:
self.foreign_tables.append(ft)
fcl.append(fc)
ocl.append(oc)
self.foreign_col.append(fcl)
self.own_col.append(ocl)
fcl,ocl=[],[]
else:
fcl.append(fc)
ocl.append(oc)
i+=1
i expect output as
ft=[FT_NAME1,FT_NAME2,FT_NAME3,FT_NAME4]
fc=[FC1, [FC21,FC22],[FC31,FC32],FC4]
oc=[C1,[C21,C22],[C31,C32],C4]
could anyone please help for better pythonic solution ?

You can use pandas. It reads the data into a DataFrame which is essentially a big dictionary.
import pandas as pd
data =pd.read_excel('file.xlsx', 'Sheet1')
data = data.fillna(method='pad')
print(data)
it gives the following output:
FT_NAME FC_NAME C_NAME
0 FT_NAME1 FC1 C1
1 FT_NAME2 FC21 C21
2 FT_NAME2 FC22 C22
3 FT_NAME3 FC31 C31
4 FT_NAME3 FC32 C32
5 FT_NAME4 FC4 C4
To get the sublist structure try using this function:
def group(data):
output = []
names = list(set(data['FT_NAME'].values))
names.sort()
output.append(names)
headernames = list(data.columns)
headernames.pop(0)
for ci in list(headernames):
column_group = []
column_data = data[ci].values
for name in names:
column_group.append(list(column_data[data['FT_NAME'].values == name]))
output.append(column_group)
return output
If you call it like this:
ft, fc, oc = group(data)
print(ft)
print(fc)
print(oc)
you get the following output:
['FT_NAME1', 'FT_NAME2', 'FT_NAME3', 'FT_NAME4']
[['FC1'], ['FC21', 'FC22'], ['FC31', 'FC32'], ['FC4']]
[['C1'], ['C21', 'C22'], ['C31', 'C32'], ['C4']]
which is what you want except for the single element now also being in a list.
It is not the cleanest method but it gets the job done.
Hope it helps.

Related

Substituting variable in a dataframe row based on other row's value

I have a dataframe that contains ID, Formula, and a dependent ID column that I extracted the ID from the Formula column.
Now I have to substitute all the dependent ID into the formulas based on the dataframe.
My approach is to run a nested loop for each row to substitute a dependent ID in the formula using the replace function. The loop would stop until there's no more possible substitution. However I don't know where to begin and not sure if this is the correct approach.
I am wondering if there's any function that can make the process easier?
Here is the code to create the current dataframe:
data = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','E3/2 +20','var_i','var_x'],
'Dependent ID':['C2','B2, D3','E3','D3, E3', '','']})
Here are the examples of my current dataframe and my desire end result.
Current dataframe:
Desire end result:
Recursively replace dependent ID inside formula with formula:
df = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','D3+E3','var_i','var_x'],
'Dependent ID':['C2','B2,D3','E3','D3,E3', '','']})
def find_formula(formula:str, ids:str):
#replace all the ids inside formula with the correct formula
if ids == '':
return formula
ids = ids.split(',')
for x in ids:
sub_formula = df.loc[df['ID']==x, 'Formula'].values[0]
sub_id = df.loc[df['ID']==x, 'Dependent ID'].values[0]
formula = formula.replace(x, find_formula(sub_formula, sub_id))
return formula
df['new_formula']=df.apply(lambda x: find_formula(x['Formula'], x['Dependent ID']), axis=1)
output:
ID Formula Dependent ID new_formula
0 A1 C2/500 C2 var_i+var_x/500
1 A3 If B2 >10 then (B2*D3) + ... If var_x/2 +20 >10 then (var_x/2 +20*var_i) + ...
2 B2 E3/2 +20 E3 var_x/2 +20
3 C2 D3+E3 D3,E3 var_i+var_x
4 D3 var_i var_i
5 E3 var_x var_x

Python pandas read_csv merge every two columns and read them as a dataframe

Beginner in python and pandas and trying to figure out how to read from csv in a particular way.
My datafile
01 AAA1234 AAA32452 AAA123123 0 -9 C C A A T G A G .......
01 AAA1334 AAA12452 AAA125123 1 -9 C A T G T G T G .......
...
...
...
So I have 100.000 columns in this file and I want to merge every two columns into one. But the merging needs to occur after the first 6 columns. I would prefer to do this while reading the file if possible instead of manipulating this huge datafile/
Desired outcome
01 AAA1234 AAA32452 AAA123123 0 -9 CC AA TG AG .......
01 AAA1334 AAA12452 AAA125123 1 -9 CA TG TG TG .......
...
...
...
That will result in a dataframe with half the columns. My datafile has no col names, the names reside in a different csv but that is another subject.
I d appreciate a solution, thanks in advance!
Separate the data frame initially. I created one for experimental purposes:
Then I defined a function. Then passed in the dataframe which needed manipulation as an argument into the function
def columns_joiner(data):
new_data = pd.DataFrame()
for i in range(0,11,2): # You can change range to your wish
# Here, I had only 10 columns to concatenate (Therefore the range ends at 11)
ser = data[i] + data[i + 1]
new_data = pd.concat([new_data, ser], axis = 1)
return new_data
I don't think this is an efficient solution. But it worked for me.

How to remove outliers in a text dataframe?

I'm writing a program that reads a text file and sorts the data into name, job, company and location fields in the form of a pandas dataframe. The location field is the same for all of the rows except for one or two outliers. I want to remove these rows from the df and put them in a separate list.
Example:
Name Job Company Location
1. n1 j1 c1 l
2. n2 j2 c2 l
3. n3 j3 c3 x
4. n4 j4 c4 l
Is there a way to remove only the row with location 'x'(row 3)?
I would extract the two groups into separate DFS
same_df = df.query('location == "<onethatisthesame>"')
Then I would repeat this but using != To get the others
other_df = df.query('location =! "<onethatisthesame>"')
You can use :
import pandas as pd
# df = df[df['location'] == yourRepeatedValue]
df = pd.DataFrame(columns = ['location'] )
df.at[1, 'location'] = 'mars'
df.at[2, 'location'] = 'pluto'
df.at[3, 'location'] = 'mars'
print(df)
df = df[df['location'] == 'mars']
print(df)
This will create a new DataFrame that only contains yourRepeatedValue.
In the example, the new df won't contain rows that are different from 'mars'
The output would be:
location
1 mars
2 pluto
3 mars
location
1 mars
3 mars

Create a new column based on previous row value and delete the current row

I have an input dataframe which can be generated from the code given below
df = pd.DataFrame({'subjectID' :[1,1,2,2],'keys':
['H1Date','H1','H2Date','H2'],'Values':
['10/30/2006',4,'8/21/2006',6.4]})
The input dataframe looks like as shown below
This is what I did
s1 = df.set_index('subjectID').stack().reset_index()
s1.rename(columns={0:'values'},
inplace=True)
d1 = s1[s1['level_1'].str.contains('Date')]
d2 = s1[~s1['level_1'].str.contains('Date')]
d1['g'] = d1.groupby('subjectID').cumcount()
d2['g'] = d2.groupby('subjectID').cumcount()
d3 = pd.merge(d1,d2,on=["subjectID", 'g'],how='left').drop(['g','level_1_x','level_1_y'], axis=1)
Though it works, I am afraid that this may not be the best approach. As we might have more than 200 columns and 50k RECORDS. Any help to improve my code further is very helpful.
I expect my output dataframe to look like as shown below
may be something like:
s=df.groupby(df['keys'].str.contains('Date').cumsum()).cumcount()+1
final=(df.assign(s=s.astype(str)).set_index(['subjectID','s']).
unstack().sort_values(by='s',axis=1))
final.columns=final.columns.map(''.join)
print(final)
keys1 Values1 keys2 Values2
subjectID
1 H1Date 10/30/2006 H1 4
2 H2Date 8/21/2006 H2 6.4

Input split for Map function in Hadoop

This is my first implementation in Hadoop. I am trying to implement my algorithm for probabilistic dataset in Map Reduce. In my dataset, last column will have some id(number of unique id's in the dataset is equal to the number of nodes in my cluster). I have to divide my dataset based on this column value and each set of records should be processed by each nodes in my cluster.
For example, if i have three nodes in my cluster, for the below dataset, one node should process all the records with id=1, another one with id=2, another one with id=3
name time dept id
--------------------
b1 2:00pm z1 1
b2 3:00pm z2 2
c1 4:00pm y2 1
b3 3:00pm z3 3
c4 4:00pm x2 2
My map function should take each split as an input and process it in parallel in each node.
I am just trying to understand, which approach is possible to do in Hadoop. Either to input this dataset as a input for my map function and pass an additional argument with map to split the data based on id value.
Or split the data beforehand to "n"(number of nodes) subsets and load it in to the nodes, if this is the correct approach, how it is possible to split the data based on value and load in different nodes. Because, what i understood from my readings is that hadoop split the data in to blocks based on the specified size. How can we specify a particular condition while loading. Just to add up, I am writing my program in python.
Someone please advise. Thanks
The simplest thing for you would probably be to have the mapper output the data with the id as key, which will guarantee that one reducer will get all the records for a specific id and then do your processing in the reducer phase.
For example,
Input data:
b1 2:00pm z1 1
b2 3:00pm z2 2
c1 4:00pm y2 1
b3 3:00pm z3 3
c4 4:00pm x2 2
Mapper code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[-1]
print key + "\t" + line
Map output:
1 b1 2:00pm z1 1
2 b2 3:00pm z2 2
1 c1 4:00pm y2 1
3 b3 3:00pm z3 3
2 c4 4:00pm x2 2
Reducer 1 input:
1 b1 2:00pm z1 1
1 c1 4:00pm y2 1
Reducer 2 input:
2 b2 3:00pm z2 2
Reducer 3 input:
3 b3 3:00pm z3 3
Reducer code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
orig_line = "\t".join(cols[1:])
# do stuff...
Note that this way a single reducer might get several keys, but the data will be ordered and you can control the number of reducers with the mapred.reduce.tasks option.
EDIT
If you want to collect your data in the reducer per key you can do something like this (not sure it will run as-is but you get the idea)
#!/usr/bin/env python
import sys
def process_data(key_id, data_list):
# data_list has all the lines for key_id
last_key = None
data = []
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[0]
if last_key and key != last_key:
process_data(last_key, data)
data = []
orig_line = "\t".join(cols[1:])
data.append(orig_line)
last_key = key
process_data(last_key, data)
If you aren't worried about running out of memory in the reducer step you can simplify the code like this:
#!/usr/bin/env python
import sys
from collections import defaultdict
def process_data(key_id, data_list):
# data_list has all the lines for key_id
all_data = defaultdict(list)
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[0]
orig_line = "\t".join(cols[1:])
all_data[key].append(orig_line)
for key, data in all_data.iteritems():
process_data(key, data)
If I understood your question, the best way is to load your dataset into a hive table, then write UDF in python. After that, do something like this:
select your_python_udf(name, time, dept, id) from table group by id;
This is look like reduce phase, so you, maybe, need this before launching the query
set mapred.reduce.tasks=50;
How to create custom UDF:
Hive Plugins
Create Function

Categories

Resources