I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0
I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]
Related
I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209
I am trying to count how many characters from the first column appear in second one. They may appear in different order and they should not be counted twice.
For example, in this df
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2"],["AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"]]).T
the result should be:
result = [3,2,2,2,3]
Looks like set.intersection after zipping the 2 columns:
[len(set(a).intersection(set(b))) for a,b in zip(df[0],df[1])]
#[3, 2, 2, 2, 3]
The other solutions will fail in the case that you compare names that both have the same multiple character, eg. AAL0 and AAL0X24. The result here should be 4.
from collections import Counter
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2", "AAL0"],
["AL0X24", "CXP44", "MLN", "KKRR9", "22MMRRS", "AAL0X24"]]).T
def num_shared_chars(char_counter1, char_counter2):
shared_chars = set(char_counter1.keys()).intersection(char_counter2.keys())
return sum([min(char_counter1[k], char_counter2[k]) for k in shared_chars])
df_counter = df.applymap(Counter)
df['shared_chars'] = df_counter.apply(lambda row: num_shared_chars(row[0], row[1]), axis = 'columns')
Result:
0 1 shared_chars
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
5 AAL0 AAL0X24 4
Sticking to the dataframe data structure, you could do:
>>> def count_common(s1, s2):
... return len(set(s1) & set(s2))
...
>>> df["result"] = df.apply(lambda x: count_common(x[0], x[1]), axis=1)
>>> df
0 1 result
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)
I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.