I'm trying to do some conditional parsing of excel files into Pandas dataframes. I have a group of excel files and each has some number of lines at the top of the file that are not part of the data -- some identification data based on what report parameters were used to create the report.
I want to use the ExcelFile.parse() method with skiprows=some_number but I don't know what some_number will be for each file.
I do know that the HeaderRow will start with one member of a list of possibilities. How can I tell Pandas to create the dataframe starting on the row that includes any some_string in my list of possibilities?
Or, is there a way to import the entire sheet and then remove the rows preceding the row that includes any some_string in my list of possibilities?
Most of the time I would just post-process this in pandas, i.e. diagnose, remove the rows, and correct the dtypes, in pandas. This has the benefit of being easier but is arguably less elegant (I suspect it'll also be faster doing it this way!):
In [11]: df = pd.DataFrame([['blah', 1, 2], ['some_string', 3, 4], ['foo', 5, 6]])
In [12]: df
Out[12]:
0 1 2
0 blah 1 2
1 some_string 3 4
2 foo 5 6
In [13]: df[0].isin(['some_string']).argmax() # assuming it's found
Out[13]: 1
I may actually write this in python, as it's probably little/no benefit in vectorizing (and I find this more readable):
def to_skip(df, preceding):
for s in enumerate(df[0]):
if s in preceding:
return i
raise ValueError("No preceding string found in first column")
In [21]: preceding = ['some_string']
In [22]: to_skip(df, preceding)
Out[22]: 1
In [23]: df.iloc[1:] # or whatever you need to do
Out[23]:
0 1 2
1 some_string 3 4
2 foo 5 6
The other possibility, messing about with ExcelFile and finding the row number could be doing (again with a for-loop as above but in openpyxl or similar). However, I don't think there would be a way to read the excel file (xml) just once if you do this.
This is somewhat unfortunate when compared to how you could do this on a csv, where you can read the first few lines (until you see the row/entry you want), and then pass this opened file to read_csv. (If you can export your Excel spreadsheet to csv then parse in pandas, that would be faster/cleaner...)
Note: read_excel isn't really that fast anyways (esp. compared to read_csv)... so IMO you want to get to pandas asap.
Related
I have a multidimensional NumPy array read from a CSV file. I want to retrieve rows matching a certain column in the data set dynamically.
My current array is
[[LIMS_AY60_51X, AY60_51X_61536153d7cdc55.png, 857.61389, 291.227, NO, 728.322,865.442]
[LIMS_AY60_52X, AY60_52X_615f6r53d7cdc55.png, 867.61389, 292.227, NO, 728.322,865.442]
[LIMS_AY60_53X, AY60_53X_615ft153d7cdc55.png, 877.61389, 293.227, NO, 728.322,865.442]
[LIMS_AY60_54X, AY60_54X_615u6153d7cdc55.png, 818.61389, 294.227, NO, 728.322,865.442]
[LIMS_AY60_55X, AY60_55X_615f615od7cdc55.png, 847.61389, 295.227, NO, 728.322,865.442]......]
I would like to use 'np.where' method to extract the rows matching the criteria as follows :
(second column value equal to
'AY60_52X_615f6r53d7cdc55.png'
np.where ((vals == (:,'AY60_52X_615f6r53d7cdc55.png',:,:,:,:,:)).all(axis=1))
This one has an error due to syntax.
File "<ipython-input-31-a28fe9729cd4>", line 3
np.where ((vals == (:,'AY60_52X_615f6r53d7cdc55.png',:,:,:,:,:)).all(axis=1))
^
SyntaxError: invalid syntax
Any help is appreciated
If you're dealing with CSV files and tabular data handling, I'd recommend using Pandas.
Here's very briefly how that would work in your case (df is the usual variable name for a Pandas DataFrame, hence df).
df = pd.read_csv('datafile.csv')
print(df)
results in the output
code filename value1 value2 yesno anothervalue yetanothervalue
0 LIMS_AY60_51X AY60_51X_61536153d7cdc55.png 857.61389 291.227 NO 728.322 865.442
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
2 LIMS_AY60_53X AY60_53X_615ft153d7cdc55.png 877.61389 293.227 NO 728.322 865.442
3 LIMS_AY60_54X AY60_54X_615u6153d7cdc55.png 818.61389 294.227 NO 728.322 865.442
4 LIMS_AY60_55X AY60_55X_615f615od7cdc55.png 847.61389 295.227 NO 728.322 865.442
Note that the very first column is called the index. It is not in the CSV file, but automatically added by Pandas. You can ignore it here.
The column names are thought-up by me; usually, the first row of the CSV file will have column names, and otherwise Pandas will default to naming them something like "Unnamed: 0", "Unnamed: 1", "Unnamed: 2" etc.
Then, for the actual selection, you do
df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
which results in
0 False
1 True
2 False
3 False
4 False
Name: filename, dtype: bool
which is a one-dimensional dataframe, called a Series. Again, it has an index column, but more importantly, the second column shows for which row the comparison is true.
You can assign the result to a variable instead, and use that variable to access the rows that have True, as follows:
selection = df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
print(df[selection])
which yields
code filename value1 value2 yesno anothervalue yetanothervalue
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
Note that in this case, Pandas is smart enough to figure out whether you want to access a particular column (df['filename']) or a selection of rows (df[selection]). More complicated ways of accessing a dataframe are possible, but you'll have to read up on that.
You can merge some things together, and with the reading of the CSV file, it's just two lines:
df = pd.read_csv('datafile.csv')
df[ df['filename'] == 'AY60_52X_615f6r53d7cdc55.png' ]
which I think is a bit nicer than using purely NumPy. Essentially, use NumPy only when you are really dealing with (multi-dimensional) array data. Not when dealing with records / tabular structured data, as in your case. (Side note: under the hood, Pandas uses a lot of NumPy, so the speed is the same; it's largely a nicer interface with some extra functionality.)
You can do it like this using numpy:
selected_row = a[np.any(a == 'AY60_52X_615f6r53d7cdc55.png', axis=1)]
Output:
>>> selected_row
array([['LIMS_AY60_52X', 'AY60_52X_615f6r53d7cdc55.png', '867.61389', '292.227', 'NO', '728.322', '865.442']], dtype='<U32')
I have a csv field column in string format that has between 4 and 6 digits in each element. If the first 4 digits equal [3372] or [2277] I want to drop the last 2 digits for the element so that only 3372 or 2277 remains. I don't want to alter the other elements.
I'm guessing some loops, if statements and mapping maybe?
How would I go about this? (Please be kind. By down rating peoples posts you are discouraging people from learning. If you want to help, take time to read the post, it isn't difficult to understand.)
Rather then using loops, and if your csv file is rather big, I suggest you use pandas DataFrames :
import pandas as pd
# Read your file, your csv will be read in a DataFrame format (which is a matrix)
df = pd.read_csv('your_file.csv')
# Define a function to apply to each element in your DataFrame:
def update_df(x):
if x.startswith('3372'):
return '3372'
elif x.startswith('2277'):
return '2277'
else:
return x
# Use applymap, which applies a function to each element of your DataFrame, and collect the result in df1 :
df1 = df.applymap(update_df)
print(df1)
On the contrary, if you have a small dataset you may use loops, as suggested above.
Since your values are still strings, I would use slicing to look at the first 4 chars. If they match, we'll chop the end off the string. Otherwise, we'll return the value unaltered.
Here's a function that should do what you want:
def fix_digits(val):
if val[:4] in ('3372', '2277'):
return val[:4]
return val
# Here you'll need the actual code to read your CSV file
for row in csv_file:
# Assuming your value is in the 6'th column
row[5] = fix_digits(row[5])
I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.
I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.
Is there existing functionality in pandas or some other package i can utilize for this?
Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.
df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})
df['ssn_anon'] = df['ssn'].astype('category').cat.codes
df
Out[38]:
ssn ssn_anon
0 1 0
1 2 1
2 3 2
3 999 4
4 10 3
5 1 0
You can using ngroup or factorize from pandas
df.groupby('ssn').ngroup()
Out[25]:
0 0
1 1
2 2
3 4
4 3
5 0
dtype: int64
pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)
In sklearn, if you are doing ML , I will recommend this approach
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)
Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
You seem to be looking for a way to encrypt the strings in your dataframe. There are a bunch of python encryption libraries such as cryptography
How to use it is pretty simple, just apply it to each element.
import pandas as pd
from cryptography.fernet import Fernet
df =pd.DataFrame([{'a':'a','b':'b'}, {'a':'a','b':'c'}])
f = Fernet('password')
res = df.applymap(lambda x: f.encrypt(byte(x, 'utf-8'))
# Decrypt
res.applymap(lambda x: f.decrypt(x))
That is probably the best way in terms of security but it would generate a long byte/string and be hard to look at.
# 'a' -> b'gAAAAABaRQZYMjB7wh-_kD-VmFKn2zXajMRUWSAeridW3GJrwyebcDSpqyFGJsCEcRcf68ylQMC83G7dyqoHKUHtjskEtne8Fw=='
Another simple way so solve your problem is to create a function that maps a key to a value and creates a new value if a new key is present.
mapper = {}
def encode(string):
if x not in mapper:
# This part can be changed with anything really
# Such as mapper[x]=randint(-10**10,10**10)
# Just ensure it would not repeat
mapper[x] = len(mapper)+1
return mapper[x]
res = df.applymap(encode)
Sounds a bit like you want to be able to reverse the process by maintaining a key somewhere. If your use case allows I would suggest replacing all the values with valid, human readable and irreversible placeholders.
John > Mark
21 Hammersmith Grove rd > 48 Brewer Street
This is good for generating usable test data for remote devs etc. You can use Faker to generate replacement values yourself. If you want to maintain some utility in your data i.e. "replace all addresses with alternate addresses within 2 miles" you could use an api i'm working on called Anon AI. We parse JSON from s3 buckets, find all the PII automatically (including in free text fields) and replace it with placeholders given your spec. We can keep consistency and reversibility if required and it will be most useful if you want to keep a "live" anonymous version of a growing data set. We're in beta at the moment so let me know if you would be interested in testing it out.
Being able to define the ranges in a manner similar to excel, i.e. 'A5:B10' is important to what I need so reading the entire sheet to a dataframe isn't very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can't.
Thanks.
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols, skiprows and skip_footer to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]
I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv().
The problem is that some rows have extra columns, something like that:
col1 col2 stringColumn ... col25
1 12 1 str1 3
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
When I try to read it, I get the error
CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28
The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine
#that example works:
col1 col2 stringColumn ... col25
1 12 1 str1 3
2 12 1 str1 3
3 12 1 str1 3 f 4
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.
Skipping the offending lines like suggested here is not an option, those lines contain valuable information.
Does anybody know a way around this?
In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv. I decided that actually doing so is the more proper and elegant solution. I found this post quite useful.
Can I redirect the stdout in python into some sort of string buffer?
I added a little twist to the code shown in the answer.
import sys
import re
from cStringIO import StringIO
import pandas as pd
fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()
df1 = pd.read_csv(StringIO(fake_csv),
error_bad_lines=False)
sys.stderr = old_stderr
log = mystderr.getvalue()
isnum = re.compile("\d+")
lines_skipped_log = [
isnum.findall(i) + [fname]\
for i in log.split("\n") if isnum.search(i)
]
columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)
From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.
Perhaps you have a directory full of files. You can create a list of pandas data frames out of each log and concatenate. From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).
A possible workaround is to specify the column names. Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550
Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '') allows me to read the file. It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now.
Combinining the arguments names and usecols and does not work, if somebody has another solution I would be happy to hear it.