Rearrange data using python - python

SO following the question I asked before, (Replacing special patterns in a string, reading from a file) I seemed to resolve the problem there. But now I'm pretty much stuck with this problem for a month.
So with the code there , it reads the data from a file, parse it (so it divides the fields with tabs) but now I want the program to recognize the data ( so it needs to recognize in this example the authors, year of publication , isbn) and has to rearrange that data into a specific form/ pattern like for example:
INPUT DATA: Aubrecht, Christoph; Özceylan, Aubrecht Dilek; Klerx, Joachim; Freire, Sérgio (2013) “Future-oriented activities as a concept for improved disaster risk management. Disaster Advances”, 6(12), 1-10. (IF = 2.272) E-ISSN 2278-4543. REVISTA INDEXADA NO WEB OF SCIENCE
AUTHORS:Aubrecht, Christoph; Özceylan, Aubrecht Dilek; Klerx, Joachim; Freire, Sérgio
YEAR: 2013
ISBN: 2278-4543
TEMPLATE 1: AUTHOR YEAR ISSN (Aubrecht, Christoph || 2013 || 2278-4543 )
TEMPLATE 2: YEAR ISSN AUTHOR (2013 || 2278-4543 || Aubrecht, Christoph )
TEMPLATE 3: ISSN YEAR AUTHOR (2278-4543 || 2013 || Aubrecht, Christoph )
The goal for this is to then import/export this data into Excel and then to a SQL database. I did my research and what I conclude (not sure if right) is that Django seems to be a good way to go since it can create templates (https://docs.djangoproject.com/en/dev/ref/templates/api/) or even with data frames using pandas dataframe (How to rearrange some elements as a data frame, Rearrange data for pandas dataframe?) , but I'm not sure how to implement both, or if its even the best way to do it ( or if it's even possible for a program to recognize those elements and rearrange them).
Similar questions I've searched were:
using a key to rearrange string (didn't help)
Rearranging letters from a list

Related

Extract information from table-columns of a PDF file with a pattern

my bank only gives me my activity statement as pdf whereas business-clients can get a .csv
I need to extract data from the relevant table in that multi-page pdf which (the relevant table) is between the letter-head of the bank and each page is prepended with one 6-column information 1-row-table.
The cells are like
date
transaction information
debit
credit
13.09./13.09.
payment by card ACME-SHOP Reference {SOME 26 digits code} {SHOP OWNER WITH LINE-B\\REAK} Terminal {some 8 digits code} {DATE-FORMAT LIKE THIS: 2022-09-06T14:25:11} SequenceNr. 012 Expirationd. {4 digit code}
-312,12
After the last such column follow two tables in horizontal split so
table 1
table 2
3-row, 2-column table with information on your account
3-row, 2-column table with new balance
I have taken a look at the tabula-library for python but it looks like it doesn't offer any option for pattern-matching or something.
So my question would be if there is a more sophisticated open-source solution to this problem. Doesn't have to be in Python, but I just took a guess that the AI community of Python would be a good place to start looking for such kind of extraction tools.
Otherwise I guess I have to extract the columns and then do pattern-matching to re-accumulate the data.
Thanks

How to load complex data using Pyspark

I have a CSV dataset that looks like the below:
Also, PFB data in form of text:
Timestamp,How old are you?,What industry do you work in?,Job title,What is your annual salary?,Please indicate the currency,Where are you located? (City/state/country),How many years of post-college professional work experience do you have?,"If your job title needs additional context, please clarify here:","If ""Other,"" please indicate the currency here: "
4/24/2019 11:43:21,35-44,Government,Talent Management Asst. Director,75000,USD,"Nashville, TN",11 - 20 years,,
4/24/2019 11:43:26,25-34,Environmental nonprofit,Operations Director,"65,000",USD,"Madison, Wi",8 - 10 years,,
4/24/2019 11:43:27,18-24,Market Research,Market Research Assistant,"36,330",USD,"Las Vegas, NV",2 - 4 years,,
4/24/2019 11:43:27,25-34,Biotechnology,Senior Scientist,34600,GBP,"Cardiff, UK",5-7 years,,
4/24/2019 11:43:29,25-34,Healthcare,Social worker (embedded in primary care),55000,USD,"Southeast Michigan, USA",5-7 years,,
4/24/2019 11:43:29,25-34,Information Management,Associate Consultant,"45,000",USD,"Seattle, WA",8 - 10 years,,
4/24/2019 11:43:30,25-34,Nonprofit ,Development Manager ,"51,000",USD,"Dallas, Texas, United States",2 - 4 years,"I manage our fundraising department, primarily overseeing our direct mail, planned giving, and grant writing programs. ",
4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",
Now, I tried the below code to load the data:
df = spark.read.option("header", "true").option("multiline", "true").option(
"delimiter", ",").csv("path")
It gives me the output as below for the last record which divides the columns and also the output is not as expected:
The value should be null for the last column i.e "If ""Other,"" please indicate the currency here: " and the entire string should be wrapped up in the earlier column which is "If your job title needs additional context, please clarify here:"
I also tried .option('quote','/"').option('escape','/"') but didn't work too.
However, when I tried to load this file using Pandas, it was loaded correctly. I was surprised how Pandas can identify where the new column name starts and all. Maybe I can define a String schema for all the columns and load it back to the spark data frame but since I am using the lower spark version it won't work in a distributed manner hence I was exploring a way how Spark can handle this efficiently.
Any help is much appreciated.
Main issue is consecutive double quotes in your csv file.
you have to escape extra double quotes in your csv file
Like this :
4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the \\" \ " Associate Product Manager \\" \ " title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",
After this it is generating result as expected :
df2 = spark.read.option("header",True).csv("sample1.csv")
df2.show(10,truncate=False)
******** Output ********
|4/25/2019 8:35:51 |25-34 |Marketing |Associate Product Manager |43,000 |USD |Cincinnati, OH, USA |5-7 years |I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.|null |null |
Or you can use blow code
df2 = spark.read.option("header",True).option("multiline","true").option("escape","\"").csv("sample1.csv")

Extracting Fasta Moonlight Protein Sequences with Python

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!
I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])

Technique to add csv data for varying fields

I am trying to scrape results from a site (has no captcha , simple roll-no authentication, and the pattern of roll-no is known to me). The problem is that they have the results in a table format and many students have different subjects. The code I wrote so far in Python is
for row in rows:
col=row.findAll('td') #BeautifulSoup object
sub=col[1].text.encode('utf-8') #Header.(Subject Names)
subjectname.append((sub))
marks=col[4].text.encode('utf-8')
markall.append((marks))
csvwriter.writerows([subjectname,])
csvwriter.writerows([markall,])
I want to generate a .csv file so that I can do some data analysis on it. Now the problem is I want a table which has a specific subject column and marks of it. But the scraper won't know if it's a different subject and will append marks of whatever subject it finds in that row/column pair.
How do I approach this?
Here's a visual representation of the problem.
So if I have Subject A at column 1 , I want to get marks only of subject A and not any other subject. Do I need to create a list for all marks ?
Edit : Here's the HTML table markup https://jsfiddle.net/rpmgku7m/

How mysql is ordering textual data?

This is the query result from some dataset of articles ordered by article title ascending with limit 10 in MySQL. Encoding is set to utf8_unicode_ci.
'GTA 5' Sells $800 Million In One Day
'Infinity Blade III' hits the App Store ahead of i...
‘Have you lurked her texts?’ How the directors of ...
‘Second Moon’ by Katie Paterson now on a journey a...
"Do not track" effort in trouble as Digital Advert...
"Forbes": Bill Gates wciąż najbogatszym obywatelem...
"Here Is Something False: You Only Live Once"
“That's The Dumbest Thing I've Ever Heard Of.”
[Introduction to Special Issue] The Future Is Now
1 Great Dividend You Can Buy Right Now
I thought ordering works by getting the position of the character in the encoding table.
like ' is 39 and " is 34 in unicode but apostrophe ʼ and double quote “ have much higher position. From my understanding ʼ“ shouldn't make it into the result and " should be at the top. I'm clearly missing something here.
My goal is to order this data by title in python to get the same results as if data was ordered in mysql.
The gist of it is that in order to get better sort orders the Unicode Collation Algorithm is used, which would (probably) convert “ into " and ‘ into ' when sorting.
Unfortunately this is not simple to emulate in Python as the algorithm is non-trivial. You can look for a wrapper library like PyICU to do the hard work, although I've no guarantees they'll work.

Categories

Resources