I have a file containing mixed information while I only need certain columns of them.
Below is my example file.
A B C D
1 2 3 abcdef
5 6 7 abcdef
1 2 3 abcdef
And I want to extract the file to get the information I need. For example, looks like below in my output file.
A C D # I only need A, C, and D column.
1 3 ab # For D column, I only need ab.
5 7 ab
1 3 ab
It is not a csv or txt file, but with a space between each column.
You can still read a space-separated file with csv module by using the delimiter kwarg:
>>> with open('/tmp/data.txt') as f:
... reader = csv.DictReader(f, delimiter=' ')
... for row in reader:
... print row['A'], row['C'], row['D'][:2]
...
1 3 ab
5 7 ab
1 3 ab
If you want to do something generical for managing data structures the easiest thing you can do is use python libraries to ease the job.
You can use Pandas Lib: Python Data Analysis Library to rapidly parse the file to a DataFrame that provides methods to make what you want.
You also need Numpy lib because as_matrix method (below) returns a numpyArray.
You can see your data file as a csv (Comma separated value) file with spaces as separators.
With pd you can easily parse the file with read_csv:
import pandas as pd
import numpy as np
dataFrame = pd.read_csv("file.txt", sep = ' ')
For selecting columns you use as_matrix method:
selection = dataFrame.as_matrix((A,C,D))
Then you probably want to can cast it back to dataFrame to continue using its methods:
newDataFrame = pd.DataFrame(selection)
Dropping "cdef" of the "abcdef" values in the column D looks like a thing that can be solved by a simple for, and using [String][5] methods provided by Python. Its a very particular instruction and i don't know any implemented method of any library that accomplishes this.
I hope i helped you.
PD: I tried to post a lot of links but the system didn't let me. I recomend you to look for Numpy and Pandas in Google if you dont have them.
You should check the pandas DataFrame docs to check the methods. I the case you didn't understand what i did look for pandas.read_csv, pandas.dataFrame.as_matrix docs in Google.
And if you don't know how to operate Strings look in Python docs for String.
Edit: Anyway, if you don't want to use libs you can parse the txt file to a list of lists imitating a matrix or using the csv structure that wim mentions in his answer. Then create a function to drop columns, checking the first element of every column (Column identifier) and with some fors export that to other matrix.
Then create another function that deletes the desired values of a column, with some other fors.
The point is that using functions to accomplish what you want makes the solution generical for any table managed as a matrix.
If you have more than one columns like D and want to do the same thing as D, you can do below if you're ok with selecting columns with indices instead of letters:
# your data like this
A B C D E
1 2 3 abcdef abbbb
5 6 7 abcdef abbbb
1 2 3 abcdef abbbb
You import csv then
>>> with open('yourdata.txt') as f:
... reader = csv.reader(f, delimiter=' ')
... for row in reader:
... print(row[0], row[1], *[c[:2] for c in row[3:]])
...
A B D E
1 2 ab ab
5 6 ab ab
1 2 ab ab
The * operator before the [c[:2] for c in row[3:]] is for list argument unpacking. * basicly converts [1,2,3] into 1,2,3, so print(*[1,2,3]) is identical to print(1,2,3). Works on tuples as well.
However, this is python3. If you are using python2, print will give you syntax error, but you can make a wrapper function that takes in the unpacked list arguments, and replace print with this function:
def myprint(*args):
print ' '.join([str(i) for i in args])
Related
I'm trying to clean data where there is a lot of partial duplicate only storing the first row of data when the key in Col A has duplicate.
A B C D
0 foo bar lor ips
1 foo bar
2 test do kin ret
3 test do
4 er ed ln pr
expected output after cleaning
A B C D
0 foo bar lor ips
1 test do kin ret
2 er ed ln pr
I have been looking at methods such as drop_duplicates or even group_by but they don't really help in my case : the duplicate are partial since some rows contain empty data and only have similar value in col A and B.
group by partial work but doesn't return the transformed data , they just filter through.
I'm very new to panda and pointer are appreciated. I could probably doing it outside panda but i'm thinking there might be a better way to do it.
edit: sorry just noticed a mistake i made in the provided example. ( test had became " tes "
In your case how would you say partial duplicate? Please provide complicate example. In the above example instead of Col A duplication you could try Col B.
Expected output could be obtained from this following snippet,
print (df.drop_duplicates(subset=['B']))
Note: Suggested solution only works for the above sample, it won't work when it has different col A and same Col B value.
Being able to define the ranges in a manner similar to excel, i.e. 'A5:B10' is important to what I need so reading the entire sheet to a dataframe isn't very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can't.
Thanks.
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols, skiprows and skip_footer to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]
I'm trying to do something that I think should be a one-liner, but am struggling to get it right.
I have a large dataframe, we'll call it lg, and a small dataframe, we'll call it sm. Each dataframe has a start and an end column, and multiple other columns all of which are identical between the two dataframes (for simplicity, we'll call all of those columns type). Sometimes, sm will have the same start and end as lg, and if that is the case, I want sm's type to overwrite lg's type.
Here's the setup:
lg = pd.DataFrame({'start':[1,2,3,4], 'end':[5,6,7,8], 'type':['a','b','c','d']})
sm = pd.DataFrame({'start':[9,2,3], 'end':[10,6,11], 'type':['e','f','g']})
...note that the only matching ['start','end'] combo is ['2','6']
My desired output:
start end type
0 1 5 a
1 2 6 f # where sm['type'] overwrites lg['type'] because of matching ['start','end']
2 3 7 c
3 3 11 g # where there is no overwrite because 'end' does not match
4 4 8 d
5 9 10 e # where this row is added from sm
I've tried multiple versions of .merge(), merge_ordered(), etc. but to no avail. I've actually gotten it to work with merge_ordered() and drop_duplicates() only to realize that it was simply dropping the duplicate that was earlier in the alphabet, not because it was from sm.
You can try to set start and end columns as index and then use combine_first:
sm.set_index(['start', 'end']).combine_first(lg.set_index(['start', 'end'])).reset_index()
I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv().
The problem is that some rows have extra columns, something like that:
col1 col2 stringColumn ... col25
1 12 1 str1 3
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
When I try to read it, I get the error
CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28
The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine
#that example works:
col1 col2 stringColumn ... col25
1 12 1 str1 3
2 12 1 str1 3
3 12 1 str1 3 f 4
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.
Skipping the offending lines like suggested here is not an option, those lines contain valuable information.
Does anybody know a way around this?
In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv. I decided that actually doing so is the more proper and elegant solution. I found this post quite useful.
Can I redirect the stdout in python into some sort of string buffer?
I added a little twist to the code shown in the answer.
import sys
import re
from cStringIO import StringIO
import pandas as pd
fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()
df1 = pd.read_csv(StringIO(fake_csv),
error_bad_lines=False)
sys.stderr = old_stderr
log = mystderr.getvalue()
isnum = re.compile("\d+")
lines_skipped_log = [
isnum.findall(i) + [fname]\
for i in log.split("\n") if isnum.search(i)
]
columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)
From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.
Perhaps you have a directory full of files. You can create a list of pandas data frames out of each log and concatenate. From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).
A possible workaround is to specify the column names. Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550
Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '') allows me to read the file. It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now.
Combinining the arguments names and usecols and does not work, if somebody has another solution I would be happy to hear it.
I'm trying to do some conditional parsing of excel files into Pandas dataframes. I have a group of excel files and each has some number of lines at the top of the file that are not part of the data -- some identification data based on what report parameters were used to create the report.
I want to use the ExcelFile.parse() method with skiprows=some_number but I don't know what some_number will be for each file.
I do know that the HeaderRow will start with one member of a list of possibilities. How can I tell Pandas to create the dataframe starting on the row that includes any some_string in my list of possibilities?
Or, is there a way to import the entire sheet and then remove the rows preceding the row that includes any some_string in my list of possibilities?
Most of the time I would just post-process this in pandas, i.e. diagnose, remove the rows, and correct the dtypes, in pandas. This has the benefit of being easier but is arguably less elegant (I suspect it'll also be faster doing it this way!):
In [11]: df = pd.DataFrame([['blah', 1, 2], ['some_string', 3, 4], ['foo', 5, 6]])
In [12]: df
Out[12]:
0 1 2
0 blah 1 2
1 some_string 3 4
2 foo 5 6
In [13]: df[0].isin(['some_string']).argmax() # assuming it's found
Out[13]: 1
I may actually write this in python, as it's probably little/no benefit in vectorizing (and I find this more readable):
def to_skip(df, preceding):
for s in enumerate(df[0]):
if s in preceding:
return i
raise ValueError("No preceding string found in first column")
In [21]: preceding = ['some_string']
In [22]: to_skip(df, preceding)
Out[22]: 1
In [23]: df.iloc[1:] # or whatever you need to do
Out[23]:
0 1 2
1 some_string 3 4
2 foo 5 6
The other possibility, messing about with ExcelFile and finding the row number could be doing (again with a for-loop as above but in openpyxl or similar). However, I don't think there would be a way to read the excel file (xml) just once if you do this.
This is somewhat unfortunate when compared to how you could do this on a csv, where you can read the first few lines (until you see the row/entry you want), and then pass this opened file to read_csv. (If you can export your Excel spreadsheet to csv then parse in pandas, that would be faster/cleaner...)
Note: read_excel isn't really that fast anyways (esp. compared to read_csv)... so IMO you want to get to pandas asap.