My goal is to iterate thru data frame from CSV file and perform a point in polygon operation and assign this to a new column in the data frame. I was able to construct function and my only problem is how to iterate thru data frame and append with results of the function.
from shapely.geometry import Point, shape
point = Point(10.2023019,8.43254802)
fc = fiona.open("ngaadmbndaadm2osgof20170222.geojson")
print (fc.schema)
and then print value for values from the point:
for feature in fc:
if shape(feature['geometry']).contains(point):
df_data['Admin2name'] = (feature['properties']['admin2Name'])
My data frame looks like this
It is usually not efficient to iterate through the dataframe. Instead you may want to read your csv file to a different dataframe and merge with your current dataframe on the X&Y coordinates columns.
I found a solution with another method executing spatial join (merge) on two geodataframes
merged_gdf = gpd.sjoin(gdf, df_lga, how="inner", op="within")
For others who will have this problem full code below
import pandas as pd
import geopandas as gpd
df_lga = gpd.read_file('ngaadmbndaadm2osgof20170222.geojson') #first df
df_data = pd.read_csv('nigeria_healthfacilities.csv', low_memory=False)# second df
geometry = [Point(xy) for xy in zip(df_data.X, df_data.Y)] # transpose lat, long to shapely geomentry point
crs = {'init': 'epsg:4326'} #set crs
gdf = GeoDataFrame(df, crs=crs, geometry=geometry)
merged_gdf = gpd.sjoin(gdf, df_lga, how="inner", op="within")# merge two df into one that spatialy joins points in polygons
merged_gdf
As far as I can tell, your question isn't very clear but I assume you're asking how to create a new column as a function of previous columns? You almost never want to iterate through all columns but if the Dataframe is small enough it won't matter of course.
You'll want to use vectorization if possible:
df = pd.DataFrame({'A' : [1,2,3,4,5,6], 'B': [6,5,4,3,2,1]})
df
>>> A B
0 1 6
1 2 5
2 3 4
3 4 3
4 5 2
5 6 1
Now you can simple define a new column directly from A and B:
df['C'] = (df['A']*df['B'] >= 8)
df
>>> A B C
0 1 6 False
1 2 5 True
2 3 4 True
3 4 3 True
4 5 2 True
5 6 1 False
Or alternatively, use apply to apply a function:
df['D'] = df['B'].apply(lambda x: -x)
df
>>> A B C D
0 1 6 False -6
1 2 5 True -5
2 3 4 True -4
3 4 3 True -3
4 5 2 True -2
5 6 1 False -1
Related
I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]
I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4
I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.
So I created hdf5 file with a simple dataset that looks like this
>>> pd.read_hdf('STORAGE2.h5', 'table')
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Using this script
import pandas as pd
import scipy as sp
from pandas.io.pytables import Term
store = pd.HDFStore('STORAGE2.h5')
df_tl = pd.DataFrame(dict(A=list(range(5)), B=list(range(5))))
df_tl.to_hdf('STORAGE2.h5','table',append=True)
I know I can select columns using
x = pd.read_hdf('STORAGE2.h5', 'table', columns=['A'])
or
x = store.select('table', where = 'columns=A')
How would I select all values in column 'A' that equals 3 or specific or indicies with strings in column 'A' like 'foo'? In pandas dataframes I would use df[df["A"]==3] or df[df["A"]=='foo']
Also does it make a difference in efficiency if I use read_hdf() or store.select()?
You need to specify data_columns= (you can use True as well to make all columns searchable)
(FYI, the mode='w' will start the file over, and is just for my example)
In [50]: df_tl.to_hdf('STORAGE2.h5','table',append=True,mode='w',data_columns=['A'])
In [51]: pd.read_hdf('STORAGE2.h5','table',where='A>2')
Out[51]:
A B
3 3 3
4 4 4
I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9
You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.
The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9
A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)