I have a csv field column in string format that has between 4 and 6 digits in each element. If the first 4 digits equal [3372] or [2277] I want to drop the last 2 digits for the element so that only 3372 or 2277 remains. I don't want to alter the other elements.
I'm guessing some loops, if statements and mapping maybe?
How would I go about this? (Please be kind. By down rating peoples posts you are discouraging people from learning. If you want to help, take time to read the post, it isn't difficult to understand.)
Rather then using loops, and if your csv file is rather big, I suggest you use pandas DataFrames :
import pandas as pd
# Read your file, your csv will be read in a DataFrame format (which is a matrix)
df = pd.read_csv('your_file.csv')
# Define a function to apply to each element in your DataFrame:
def update_df(x):
if x.startswith('3372'):
return '3372'
elif x.startswith('2277'):
return '2277'
else:
return x
# Use applymap, which applies a function to each element of your DataFrame, and collect the result in df1 :
df1 = df.applymap(update_df)
print(df1)
On the contrary, if you have a small dataset you may use loops, as suggested above.
Since your values are still strings, I would use slicing to look at the first 4 chars. If they match, we'll chop the end off the string. Otherwise, we'll return the value unaltered.
Here's a function that should do what you want:
def fix_digits(val):
if val[:4] in ('3372', '2277'):
return val[:4]
return val
# Here you'll need the actual code to read your CSV file
for row in csv_file:
# Assuming your value is in the 6'th column
row[5] = fix_digits(row[5])
Related
I have a data frame like this:
MONTH TIME PATH RATE
0 Feb 15:24:11 enp1s0 14.71Kb
I want to create a function which can identify if 'Kb' or 'Mb' is in the column RATE. If an entry in column RATE has 'Kb' or 'Mb' at the end, to strip it of 'Kb'/'Mb' and perform an operation to convert it into just b.
Here's my code so far where RATE is treated by the Dataframe as an object:
df=pd.DataFrame(listOfLists)
def strip(bytesData):
if "Kb" in bytesData:
bytesData/1000
elif "Mb" in bytesData:
bytesData/1000000
df['RATE']=df.apply(lambda x: strip(x['byteData']), axis=1)
How can I get it to change the value within the column while stripping it of unwanted characters and converting it into the format I need? I know once this operation is complete I'll have to change it to an int, however, I can't seem to alter the data in the way I need.
Thanks in advance!
I modified your function a bit and use map(lambda x:) instead of apply, since we are working with a series, and not the full dataframe. Also I added some additional lines as to provide examples for both Kb and Mb and if neither are present:
example_df = pd.DataFrame({'Month':[0,1,2,3],
'Time':['15:32','16:42','17:11','15:21'],
'Path':['xxxxx','yyyyy','zzzzz','aaaaa'],
'Rate':['14.71Kb','18.21Mb','19.01Kb','Error_1']})
def case_1(value):
if value[-2:] == 'Kb':
return float(value[:-2])*1000
elif value[-2:] == 'Mb':
return float(value[:-2])*100000
else:
return np.nan
example_df['Rate'] = example_df['Rate'].map(lambda x: case_1(x))
The logic for the function is, if it ends with Kb then multiply the value by 1000, else-if it ends with Mb multiply the value by 100000, otherwise simply return NaN (because neither of the two conditions are satisfied)
Output:
Month Time Path Rate
0 0 15:32 xxxxx 14710.0
1 1 16:42 yyyyy 1821000.0
2 2 17:11 zzzzz 19010.0
3 3 15:21 aaaaa NaN
Here's an alternative of how I may approach this. This solution handles other Abbreviations. It does rely on regex re standard lib package though.
This approach makes a new column called Bytes. I often find it helpful to keep the RATE column in this case to verify there aren't any edge cases I haven't thought of. I also use a mapping to obtain the necessary power to raise the value to to get the correct bytes. I did add the code required to drop the original RATE column and rename the new column.
import re
def convert_to_bytes(contents):
value, label, _ = re.split('([A-Za-z]+)', contents)
factors = {'Kb': 1, 'Mb': 2, 'Gb': 3, 'Tb': 4}
return float(value) * 1000**(factors[label])
df['Bytes'] = df['RATE'].map(convert_to_bytes)
# Drop original RATE column
df = df.drop('RATE', axis=1)
# Rename Bytes column to RATE
df = df.rename({'Bytes': 'RATE'}, axis='columns')
I have a six column matrix. I want to find the row(s) where BOTH columns match the query.
I've been trying to use numpy.where, but I can't specify it to match just two columns.
#Example of the array
x = np.array([[860259, 860328, 861277, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 871151, 871173],])
print(x)
#Match first column of interest
A = np.where(x[:,2] == 861301)
#Match second column on interest
B = np.where(x[:,3] == 861393)
#rows in both A and B
np.intersect1d(A, B)
#This approach works, but is not column specific for the intersect, leaving me with extra rows I don't want.
#This is the only way I can get Numpy to match the two columns, but
#when I query I will not actually know the values of columns 0,1,4,5.
#So this approach will not work.
#Specify what row should exactly look like
np.where(all([860259, 860328, 861277, 861393, 865534, 865716]))
#I want something like this:
#Where * could be any number. But I think that this approach may be
#inefficient. It would be best to just match column 2 and 3.
np.where(all([*, *, 861277, 861393, *, *]))
I'm looking for an efficient answer, because I am looking through a 150GB HDF5 file.
Thanks for your help!
If I understand you correctly,
you can use a little more advanced slicing, like this:
np.where(np.all(x[:,2:4] == [861277, 861393], axis=1))
this will give you only where these 2 cols are equal to [861277, 861393]
I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is to make a sum of each column. However, the files are too large to import with pandas.read_csv() as it consumes too much RAM.
sample data:
Therefore I wrote following code to import 1 column, perform the sum of that column, store the result in a dataframe (= summed_cols), delete the column, and go on with the next column of the file:
x=10 ###columns I'm interested in start at col 11
#empty dataframe to fill
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
Each column represents a sample and the ''read sum'' is the sum of that column. So the output of this code is a dataframe with 2 columns with in the first column one sample per row, and in the second column the corresponding read sum.
This code does exactly what I want to do, however, it is not efficient. For this large file it takes about 1-2 hours to complete the calculations. Especially the loading of just 1 columns takes quiet a long time.
My question: Is there a faster way to import just one column of this large tab file and perform the same calculations as I'm doing with the code above?
You can try something like this:
samples = []
sums = []
with open('file.txt','r') as f:
for i,line in enumerate(f):
columns = line.strip().split('\t')[10:] #from column 10 onward
if i == 0: #supposing the sample_name is the first row of each column
samples = columns #save sample names
sums = [0 for s in samples] #init the sums to 0
else:
for n,v in enumerate(columns):
sums[n] += float(v)
result = dict(zip(samples,sums)) #{sample_name:sum, ...}
I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need.
Mind that this code does not deal with missing values.
The else block can be improved using numpy:
import numpy as np
...
else:
sums = np.add(sums, map(float,columns))
I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.
Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?