I have a pandas dataframe as follows...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['abc', 11], ['xyz', 21],['pqr',31]]),columns=['member','value'])
What I need is to collapse the column 'member' inside a single string with the output as an SQL query as follows...
"select * from table1 where member in ('abc','xyz','pqr')"
In my original data I have a large number of values. I couldn't figure out a way to collapse it from previous question search. Is there a way to do this without using a loop? Thanks.
You can use tolist method of interesting column then convert it to tuple and then to string:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['abc', 11], ['xyz', 21],['pqr',31]]),columns=['member','value'])
sql_string = "select * from table1 where member in "
members = tuple(df.member.tolist())
query = sql_string + str(members)
Related
I have a given dataset: https://www.kaggle.com/abcsds/pokemon
I need to create new column based on another column called 'Name of the Pokemon'(string type) that will contain list of strings instead of strings
I need to use function. That's my code:
import pandas as pd
import numpy as np
df = pd.read_csv('pokemon.csv')
def transform_faves(df):
df = df.assign(name_as_list = df.name) #new column
list_of_a_single_column = df['name'].tolist()
df['name_as_list'] = list_of_a_single_column
print(type(list_of_a_single_column))
return df
df = transform_faves(df)
The problem is that new column is still string rather than list of strings. Why such conversion does not work?
I have the following situation:
I have multiple tables that look like this:
table1 = pd.DataFrame([[0,1],[0,1],[0,1],[0,1],[0,1]], columns=['v1','v2'])
I have one dataframe that each element refers to these tables, something like this:
df = pd.DataFrame([table1, table2, table3, table4], columns=['tablename'])
I need to create a new column in df that contains, for each table, the values that I get from np.polyfit(table1['v1'],table1['v2'],1)
I have tried to do the following
for x in df['tablename']:
df.loc[:,'fit_result'] = np.polyfit(x['v1'],x['v2'],1)
but it returns me
TypeError: string indices must be integers
Is there a way to do it? or am I writing something that makes no sense?
obs: in fact, these tables are HUGE and contains more than two columns.
You can try something like this
import numpy as np
import pandas as pd
table1 = pd.DataFrame([[0.0,0.0],[1.0,0.8],[2.0,0.9],[3.0,0.1],[4.0,-0.8],[5.0,-1.0]], columns=['table1_v1','table1_v2'])
df = pd.DataFrame([['some','random'],['values','here']], columns=['example_1','example_2'])
def fit_result(v1,v2):
return np.polyfit(v1, v2, 1)
df['fit_result'] = df.apply(lambda row: fit_result(table1['table1_v1'].values,table1['table1_v2'].values), axis=1)
df.head()
Output
example_1 example_2 fit_result
0 some random [-0.3028571428571428, 0.7571428571428572]
1 values here [-0.3028571428571428, 0.7571428571428572]
You only need do this over all your dataframes and concat all off them at the end
df_col = pd.concat([df1,df2], axis=1) (https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)
I have a large dataframe and would like to update specific values at known row and column indices. I would like to do this without an explicit for loop.
For example:
import string
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 10), index = range(10), columns = list(string.ascii_lowercase)[:10])
I have arbitrary arrays of indexes, columns, and values that I would like to use to update df. For example:
update_values = [0,-2,-3]
update_index = [3,5,7]
update_columns = ["d","g","i"]
I can loop over the arrays to update the original dataframe:
for i,j,v in zip(update_index, update_columns, update_values):
df.loc[i,j] = v
but would like to use a technique not involving an explicit for loop.
Use the underlying numpy values
indexes = map(df.columns.get_loc, update_columns)
df.values[update_index, list(indexes)] = update_values
try using loc which is used to specify the needed indexes and columns names loc[[index_names], [columns_names]]
df.loc[[3,5,7], ["d","g","i"]] = [0,-2,-3]
I am trying to remove index while converting pandas data-frame into html table. Prototype is as follows:
import pandas as pd
import numpy as np
df= pd.DataFrame({'list':np.random.rand(100)})
html_table = df.to_html()
In html table I don't want to display index.
Try this:
html_table = df.to_html(index = False)
It seems you need remove index name:
df = df.rename_axis(None)
Or:
df.index.name = None
For not display index use:
print (df.to_string(index=False))
I am trying to use Pandas.DataFrame as the intermediate result dataset between two consequent SQL queries.
I imagine it looks like:
import pandas.io.sql as pisql
import pyodbc
SQL_command1 = """
select * from tab_A
"""
result = pisql.read_frame(SQL_command1)
SQL_command2 = """
select *
from ? A
inner join B
on A.id = B.id
"""
pyodbc.cursor.execute(SQL_command2, result)
The SQL_command2 in above code is simply a pseudo code, where ? takes in the result as the input and given a alias name as A.
This is my first time using Pandas, so I'm not confident if my idea is feasible or efficient. Can anyone enlight me please?
Many thanks.
The pseudo code would look like this
import pandas as pd
df_a = pd.read_csv('tab_a.csv') #or read_sql or other read engine
df_b = pd.read_csv('tab_b.csv')
result = pd.merge(left=df_a,
right=df_b,
how='inner',
on='id') #assuming 'id' is in both table
And to select columns of pandas dataframe, it would be something like df_a[['col1','col2','col3']]