I have a pandas dataframe with a integer column called TDLINX. I'm trying to convert that to a string with leading zeros such that all values are 7 characters, with leading zeros. So 7 would become "0000007"
This is the code that I used:
df_merged_total['TDLINX2'] = df.TDLINX.apply(lambda x: str(x).zfill(7))
At first glance this appeared to work, but as I went further down the file, I realized that the value in TDLINX2 was starting to get shifted. What could be causing this and what can I do to prevent it?
You could do something like this:
>>> df = pd.DataFrame({"col":[1, 33, 555, 7777]})
>>> df["new_col"] = ["%07d" % x for x in df.col]
>>> df
col new_col
0 1 0000001
1 33 0000033
2 555 0000555
3 7777 0007777
Related
I was just looking at some code for Random Forests, and came across these two lines.
Let's assume I have a pandas dataframe 'df' that consists of 12 columns.
What will the following code return
X = df.iloc[:,0:11].values
Y = df.iloc[:, 12].values
To generate a dataframe to consider:
>>> df = pd.DataFrame(np.random.randint(10, size=(5, 2)),
columns=['Col 1', 'Col 2'])
If we print the dataframe, you get:
>>> print(df)
Col 1 Col 2
0 8 4
1 6 4
2 7 5
3 9 6
4 1 5
To determine what the : does, lets consider
>>> print(df.iloc[:,0])
0 8
1 6
2 7
3 9
4 1
which appears to produce every single row in the 0-th column.
Lets try another example:
>>> print(df.iloc[0:3,0])
0 8
1 6
2 7
It looks like that gives the rows at position 0 through position 2 in the 0-th column.
So, from playing with those examples, you can infer that : returns the full dimension. In your example, it returns all rows since the : comes first. The 0:11 returns columns 0 through column 10. The 12 returns the 12th column.
X = df.iloc[:,0:11].values
The above line will return all rows and columns starting from 1st till 11th column (11th column inclusive) in the form of an array.
Y = df.iloc[:, 12].values
The above line returns 13th column values (not 12th column) in the form of an array of the data frame
Example :
Sample dataframe:
df = pd.DataFrame(np.random.randint(0,120,size=(5, 14)), columns=[k+l for k,l in zip(list('ABCDEFGHIJKLMN'), [str(i) for i in range(1,15)])]) #Just tried to name the columns with letters combined with numbers for convenient tracking.
df
X = df.iloc[:,0:11]#.values
X
Y = df.iloc[:, 12].values
Y
I have a script that generates DataFrames with 1 row and 50 columns. Each cell of each DataFrame contains a string. However, with the possible exception of one cell, all these strings contain no elements, so they look like this: ''. As a result, each DataFrame looks something like this:
Col 1 Col 2 ... Col 49 Col 50
0 "Here it is."
Only one of the cells may contain a full sentence (the one in column 49 in this case), but it is unknown what the sentence is and in which column it is located. And I want to return only that sentence. What is a simple way to do this?
Use the fact that empty strings are falsey
df.at[0, df.loc[0].astype(bool).idxmax()]
If you use a Series instead, it is easy to filter the one cell with a non-empty element:
import pandas as pd
df = pd.DataFrame({'col1': [""], 'col2': [""], 'col3': [""], 'col4': ["some words"], 'col5': [""]})
s = df.T[0]
sentence = s[s != ""]
This transposes the dataframe, then converts it to a Series. It is of course easier and quicker if you can store the data in a Series in the first place.
Or, as RafaelC hints at in a comment: avoid storing all the empty strings in the first place, and store the non-empty string directly in your variable, skipping the dataframe completely.
Here's one solution. Given this scenario
import pandas as pd
row = ['' for i in range(50)]
row[34] = 'Raining somewhere'
pdf = pd.DataFrame([row])
which looks like
In [5]: print(pdf)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 ...
[1 rows x 50 columns]
We can get a data frame that contains columns with entries that are not '' with
pdf[pdf !=''].dropna(axis = 1)
which returns
34
0 Raining somewhere
If you just want the string
pdf[pdf !=''].dropna(axis = 1).values[0][0]
returns
'Raining somewhere'
This assumes that there is only one such string in the data frame. Alternately if you didn't want to use pdf != '' you could always use
import numpy as np
pdf.replace('',np.nan).dropna(axis = 1).values[0][0]
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.
So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.
You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely
You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4
Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])
I want to replace values within a range of columns in a dataframe with a corresponding value in another column if the value in the range is greater than zero.
I would think that a simple replace like this would work:
df = df.loc[:,'A':'D'].replace(1, df['column_with_value_I_want'])
But that in fact does nothing as far as I can tell except drop the column_with_value_I_want, which is totally unintended, and I'm not sure why that happens.
This doesn't seem to work either:
df[df.loc[:,'A':'D']] > 0 = df['column_with_value_I_want']
It returns the error: SyntaxError: can't assign to comparison.
This seems like it should be straightforward, but I'm at a loss after trying several different things to no avail.
The dataframe I'm working with looks something like this:
df = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,0,1],
'D' : [1,0,0,1,0,0],
'column_with_value_I_want' : [22.0,15.0,90.0,10.,None,557.0],})
Not sure how to do it in Pandas per se, but it's not that difficult if you drop down to numpy.
If you're lucky enough so that your entire DataFrame is numerical, you can do so as follows:
import numpy as np
m = df.as_matrix()
>>> pd.DataFrame(
np.where(np.logical_or(np.isnan(m), m > 0), np.tile(m[:, [4]], 5), m),
columns=df.columns)
A B C D column_with_value_I_want
0 22 22 22 22 22
1 0 0 0 0 15
2 0 0 0 0 90
3 10 10 10 10 10
4 0 0 0 0 NaN
5 0 557 557 0 557
as_matrix converts a DataFrame to a numpy array.
np.where is numpy's ternary conditional.
np.logical_or is numpy's or.
np.isnan is a check if a value is not nan.
np.tile tiles (in this case) a 2d single column to a matrix.
Unfortunately, the above will fail if some of your columns (even those not involved in this operation) are inherently non-numerical. In this case, you can do the following:
for col in ['A', 'B', 'C', 'D']:
df[col] = np.where(df[col] > 0, df[col], df.column_with_value_I_want)
which will work as long as the 5 relevant columns are numerical.
This uses a loop (which is frowned upon in numerical Python), but at least it does so over columns, and not rows. Assuming your data is longer than wider, it should be OK.