I am able to load a csv into pandas dataframe, but it is stuck in a list. How can I load directly into a pandas dataframe from Pydrill or unlist the pandas dataframe columns and data? I've tried unlisting and it puts everything into a list of a list.
I've used the to_dataframe(), but can't seem to find documentation on if I can use a delimeter. pd.dataframe doesn't work because of the Pydrill query.
reviews = drill.query("SELECT * FROM hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)
new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()
The results cast everything into a list.
["pclass","sex","age","sibsp","parch","fare","embarked","survived"]
0 ["3","1","38.0","0","0","7.8958","1","0"]
1 ["1","1","42.0","0","0","26.55","1","0"]
2 ["3","0","9.0","4","2","31.275","1","0"]
3 ["3","1","27.0","0","0","7.25","1","0"]
4 ["1","1","41.0","0","0","26.55","1","0"]
I'd like to get everything into a normal pandas dataframe.
The solution I found was this:
it doesn't unlist the dataframe, but it's an alternate solution to the problem.
connect_str = "dbname='dbname' user='dsa_ro_user'
conn = psycopg2.connect(connect_str) host='host database'
SQL = "SELECT * "
SQL += " FROM train"
df = pd.read_sql(SQL,conn)
df.head()
Try using Table Functions as described in O’Reily Text: Chapter 4. Querying Delimited Data. This will delimit the file and apply the first row to your columns. Note: because everything is being read as text, you may need to cast your values as floats if you want to do arithmetic in your select or where.
This should get you what you want:
sql="""
SELECT *
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
type => 'text',
extractHeader => true,
fieldDelimiter => ',')
) LIMIT 1000
"""
rows = drill.query(sql, timeout=30)
df = rows.to_dataframe()
df.head()
Related
I have received a data frame using pandas, data have one column and multiple rows in that column
and each row has multiple data like ({buy_quantity:0, symbol:nse123490,....})
I want to insert it into an excel sheet using pandas data frame with python xlwings lib. with some selected data please help me
wb = xw.Book('Easy_Algo.xlsx')
ts = wb.sheets['profile']
pdata=sas.get_profile()
df = pd.DataFrame(pdata)
ts.range('A1').value = df[['symbol','product','avg price','buy avg']]
output like this :
please help me... how to insert data into excel only selected.
Considering that the dataframe below is named df and the type of the column positions is dict, you can use the code below to transform the keys to columns and values to rows.
out = df.join(pd.DataFrame(df.pop('positions').values.tolist()))
out.to_excel('Easy_Algo.xlsx', sheet_name=['profile'], index=False) #to store the result in an Excel file/spreadsheet.
Note : Make sure to add these two lines below if the type of the column positions is not dict.
import ast
df['positions']=df['positions'].apply(ast.literal_eval)
#A sample dataframe for test :
import pandas as pd
import ast
string_dict = {'{"Symbol": "NIFTY2292218150CE NFO", "Produc": "NRML", "Avg. Price": 18.15, "Buy Avg": 0}',
'{"Symbol": "NIFTY22SEP18500CE NFO", "Produc": "NRML", "Avg. Price": 20.15, "Buy Avg": 20.15}',
'{"Symbol": "NIFTY22SEP16500PE NFO", "Produc": "NRML", "Avg. Price": 16.35, "Buy Avg": 16.35}'}
df = pd.DataFrame(string_dict, columns=['positions'])
df['positions']=df['positions'].apply(ast.literal_eval)
out = df.join(pd.DataFrame(df.pop('positions').values.tolist()))
>>> print(out)
Symbol Produc Avg. Price Buy Avg
0 NIFTY22SEP16500PE NFO NRML 16.35 16.35
1 NIFTY22SEP18500CE NFO NRML 20.15 20.15
2 NIFTY2292218150CE NFO NRML 18.15 0.00
If i understood correctly, you want only those columns written to an excel file
df = df[['symbol','product','avg price','buy avg']]
df.to_excel("final.xlsx")
df.to_excel("final.xlsx", index = False) # in case there was a default index generated by pandas and you wanna get rid of it.
i hope this helps.
I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result
I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values. Here is I've tried:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")
It thrown the error:
ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table
Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?
I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.
from pyspark.sql import Row
from pyspark.sql.functions import lit
bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])
fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))
uniondf = bigdf.unionAll(fitdf)
Can you try this
df = spark.createDataFrame(pandas_df)
df_table_struct = sqlContext.sql('select * from my_table limit 0')
for col in set(df_table_struct.columns) - set(df.columns):
df = df.withColumn(col, F.lit(None))
df_table_struct = df_table_struct.unionByName(df)
df_table_struct.write.saveAsTable('my_table', mode='append')
I am pulling data using pytreasurydirect and I would like to query each unique cusip and then append them and create a pandas dataframe table. I am having difficulties generating the the pandas dataframe. I believe it is because of the unicode structure of the data.
import pandas as pd
from pytreasurydirect import TreasuryDirect
td = TreasuryDirect()
cusip_list = [['912796PY9','08/09/2018'],['912796PY9','06/07/2018']]
for i in cusip_list:
cusip =''.join(i[0])
issuedate =''.join(i[1])
cusip_value=(td.security_info(cusip, issuedate))
#pd.DataFrame(cusip_value.items())
df = pd.DataFrame(cusip_value, index=['a'])
td = td.append(df, ignore_index=False)
Example of data from pytreasurydirect :
Index([u'accruedInterestPer100', u'accruedInterestPer1000',
u'adjustedAccruedInterestPer1000', u'adjustedPrice',
u'allocationPercentage', u'allocationPercentageDecimals',
u'announcedCusip', u'announcementDate', u'auctionDate',
u'auctionDateYear',
...
u'totalTendered', u'treasuryDirectAccepted',
u'treasuryDirectTendersAccepted', u'type',
u'unadjustedAccruedInterestPer1000', u'unadjustedPrice',
u'updatedTimestamp', u'xmlFilenameAnnouncement',
u'xmlFilenameCompetitiveResults', u'xmlFilenameSpecialAnnouncement'],
dtype='object', length=116)
I think you want to define a function like this:
def securities(type):
secs = td.security_type(type)
keys = secs[0].keys() if secs else []
seri = [pd.Series([sec[key] for sec in secs]) for key in keys]
return pd.DataFrame(dict(zip(keys, seri)))
Then, use it:
df = securities('Bond')
df[['cusip', 'issueDate', 'maturityDate']].head()
to get results like these, for example (TreasuryDirect returns a lot of addition columns):
cusip issueDate maturityDate
0 912810SD1 2018-08-15T00:00:00 2048-08-15T00:00:00
1 912810SC3 2018-07-16T00:00:00 2048-05-15T00:00:00
2 912810SC3 2018-06-15T00:00:00 2048-05-15T00:00:00
3 912810SC3 2018-05-15T00:00:00 2048-05-15T00:00:00
4 912810SA7 2018-04-16T00:00:00 2048-02-15T00:00:00
At least today those are the results today. The results will change over time as bonds are issued and, alas, mature. Note the multiple issueDates per cusip.
Finally, per the TreasuryDirect website (https://www.treasurydirect.gov/webapis/webapisecurities.htm), the possible security types are: Bill, Note, Bond, CMB, TIPS, FRN.
So I am doing some merged using Pandas using a name-map because the two files I want don't have exact name names to merge on easily. But My Pdata sheet has lists of dates from 2014 to 2016, but I want to filter the sheet down to only contain dates from 1/1/2015 - 31/12/2016.
Below is the code that I currently have and I am not sure how to/if I can filter on date before the merge.
import pandas as pd
path= 'C:/Users/Rukgo/Desktop/Match thing/'
name_map = pd.read_excel(path+'name_map.xls',sheetname=0)
Tdata = pd.read_excel(path+'2015_TXNs.xls',sheetname=0)
pdata = pd.read_excel(path+'Pipeline.xls', sheetname=0)
#pdata = pdata[(1/1/2015 <=pdata.date)&(pdata.date <=31/12/2015)]
merged = pd.merge(Tdata, name_map, how="left", on="Local Customer")
merged.to_excel(path+"results.xls")
mdata = pd.read_excel(path +'results.xls',sheetname=0)
final_merge = pd.merge(mdata, pdata, how='right', on='Client')
final_merge = final_merge[final_merge.Amount_USD !=0]
final_merge.to_excel(path+"Final Results.xls")
So I had a commented out section that ended up being quite close to the actual code that I needed.
pdata = pdata[(pdata['date']>='20150101')&(pdata['date']<='20151231')]
That ended up working perfectly, though hard codes the dates