String Comparison in Python for harmonization - python

I am coming from SQL world and we are using pandas for ETL this time. We use DIFFERENCE and SOUNDEX for the string comparison. But its not giving expected results lately. Is there any way to achieve this in python?
Currently we are using code like below which will return a score for the match.
SELECT difference(soundex('string'),soundex(Col)) from table
Looking for a similar solution here. Thanks in advance

Related

How to restructure this dataframe using Python?

[What I am starting out withWhat I want to end up with](https://i.stack.imgur.com/xW8Zf.jpg)I am having trouble writing the code to transform this dataset into what you see below. I am a beginner and am just practicing using Python.
So far, I have tried the str.split, but it didn’t produce the results I was hoping for.

How to show float values with pandas.read_sql from a sqlite query

I need to get this result for an assignment using python/sqlite3.
required result
I did my query in MYSQL, and I got the answer already to the assignment question. Since I am learning, I find it easier to do the queries using MySQL Workbench first.
result in MySQLWorkbench
However, When I try to do it in Jupyter notebook with Sqlite3, it only shows the zeros on the percentage column.
I am using the function pd.read_sql_query. I went to the documentation and could not find any arguments there that would do what I want, or I did not understand it. I played with the coarse_float argument, but it did not make a difference. I am learning, so sometimes, I do not understand the documentation completely.
query_results = pd.read_sql_query(query1,conn)
This is what I get in my Jupyter notebook:
Output un Jupyter Notebook
I know the numbers are there because if I multiple the column ”percentage_female_only_movie” fly 100, I see them. I would like to know how to show them like in MYSQLWorkbench.
Thank you for any help. An if you know any link where I can learn about this type of issues, I would love if you can share it.
Try df[colname] = df[colname].astype(float).
This would convert your column to float and you should see the values
I found the solution. I needed to CAST the numerator and denominator of the column I was generating in SELECT statement of my query.
SELECT SUBSTR(TRIM(m.year),-4) AS Movie_year,
ROUND(CAST(fcm.Female_Cast_Only_Movies*100 AS FLOAT)/ CAST(tm.movies_total AS FLOAT),2) AS Percentage_Female_only_Movie,
tm.movies_total As Total_movies
FROM Movie AS m
output:
enter image description here

Python - efficient approach to build an iterating mean

I'm a Python user and I'm quite lost on the task below.
Let df be a time series of 1000 stock returns.
I would like to calculate an iterating mean as for below
df[0:500].mean()
df[0:501].mean()
df[0:502].mean()
...
df[0:999].mean()
df[0:1000].mean()
How can I write a efficient code?
Many thanks
Pandas has common transformations like this built in. See for example:
df.expanding().mean()

How to write a dataframe without using pandas or any package?

I'm studying phyton and one of my goals is write most os my codes without packages, and I would to like write a structure which looks like with pandas's DataFrame, but without using any other package. Is there any way to do that?
Using pandas, my code looks like this:
From pandas import Dataframe
...
s = DataFrame(s, index = ind)
where ind is the result of a function.
Maybe dictionary could be the answer?
Thanks
No native python data structure has all the features of a pandas dataframe, which was part of why pandas was written in the first place. Leveraging packages others have written brings the time and work of many other people into your code, advancing your own code's capabilities in a similar way that Isaac Newton said his famous discoveries were only possible by standing on the shoulders of giants.
There's no easy summary for your answer except to point out that pandas is open-source, and their implementation of the dataframe can be found at https://github.com/pandas-dev/pandas.

pandas.dataframe.set_index(column1) in python to MATLAB

I want to index a table in MATLAB on a particular column. In Python we can use set_index(column_name) using pandas library. I want an equivalent code that can do in MATLAB. To be more precise I want to look at the internal code of set_index() in Python. Can someone help me?
Code in MATLAB:
T = readtable('filename.csv');
I want to set an index on T.column_name here.

Categories

Resources