Import File to Database in PhpMyAdmin - python

I want to import a file to a phpmyadmin database. It is to have 5 columns: id, url, lat, lon and address. However each line of the file is structured as follows:
23947501894 https://farm2.staticflickr.com/1664/23947501894_09e21ac1c4_q.jpg 53.404021 -2.996651 Belgian Merchant Seamen, Queensway (Mersey Tunnel), Liverpool, North West England, England, CH41, United Kingdom
Most of the data I want to input is seperated by a space, other than when it gets to the address at the end, where it has many spaces and commas. Is it possible to input this data to the database as is? If so can anyone suggest how I might do this?
I am very new to phpmyadmin and I am using python to do this. Thanks in advance for your help I am very stuck!

You'll have to process the text file before importing, since the delimiter also appears unescaped in line with your data.
The good news is that your data format makes this really easy. Take the first four spaces and convert them to a special character (maybe ; or ~, something that doesn't appear anywhere else in your data). You can accomplish this with your favorite stream editor or text manipulation program (sed, awk, perl, and python are all good candidates for this work).
There are many ways to do this (see also these answers for an idea how many different ways exist, though note that question is about working on an entire file and we want to work on individual lines), but probably the simplest is by running sed four times:
for i in $(seq 4) ; do sed -i -e 's/ /~/' ~/import.csv ; done
Make sure you do this with a copy of the file because this will edit the specified file in-place.
From your phpMyAdmin Import tab, you'll then use ~ (or whatever separator you used) as the value for "Columns separated with:" and leaving blank all the others except for leaving "auto" at "Lines terminated with:"
Your import settings should look like this (again, substitute whatever character you need to for the delimiter):

Log in PHPMyAdmin, then do:
[Refer Right Frame]
1. Click Database Tab and Create DB
2. Click Import Tab
3. Click Browse and select csv file
4. Change Format from SQL to CSV
5. Click Go

Related

Extract text from a config file [duplicate]

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 1 year ago.
I'm using a config file to inform my Python script of a few key-values, for use in authenticating the user against a website.
I have three variables: the URL, the user name, and the API token.
I've created a config file with each key on a different line, so:
url:<url string>
auth_user:<user name>
auth_token:<API token>
I want to be able to extract the text after the key words into variables, also stripping any "\n" that exist at the end of the line. Currently I'm doing this, and it works but seems clumsy:
with open(argv[1], mode='r') as config_file:
lines = config_file.readlines()
for line in lines:
url_match = match('jira_url:', line)
if url_match:
jira_url = line[9:].split("\n")[0]
user_match = match('auth_user:', line)
if user_match:
auth_user = line[10:].split("\n")[0]
token_match = match('auth_token', line)
if token_match:
auth_token = line[11:].split("\n")[0]
Can anybody suggest a more elegant solution? Specifically it's the ... = line[10:].split("\n")[0] lines that seem clunky to me.
I'm also slightly confused why I can't reuse my match object within the for loop, and have to create new match objects for each config item.
you could use a .yml file and read values with yaml.load() function:
import yaml
with open('settings.yml') as file:
settings = yaml.load(file, Loader=yaml.FullLoader)
now you can access elements like settings["url"] and so on
If the format is always <tag>:<value> you can easily parse it by splitting the line at the colon and filling up a custom dictionary:
config_file = open(filename,"r")
lines = config_file.readlines()
config_file.close()
settings = dict()
for l in lines:
elements = l[:-1].split(':')
settings[elements[0]] = ':'.join(elements[1:])
So, you get a dictionary that has the tags as keys and the values as values. You can then just refer to these dictionary entries in your pogram.
(e.g.: if you need the auth_token, just call settings["auth_token"]
if you can add 1 line for config file, configparser is good choice
https://docs.python.org/3/library/configparser.html
[1] config file : 1.cfg
[DEFAULT] # configparser's config file need section name
url:<url string>
auth_user:<user name>
auth_token:<API token>
[2] python scripts
import configparser
config = configparser.ConfigParser()
config.read('1.cfg')
print(config.get('DEFAULT','url'))
print(config.get('DEFAULT','auth_user'))
print(config.get('DEFAULT','auth_token'))
[3] output
<url string>
<user name>
<API token>
also configparser's methods is useful
whey you can't guarantee config file is always complete
You have a couple of great answers already, but I wanted to step back and provide some guidance on how you might approach these problems in the future. Getting quick answers sometimes prevents you from understanding how those people knew about the answers in the first place.
When you zoom out, the first thing that strikes me is that your task is to provide config, using a file, to your program. Software has the remarkable property of solve-once, use-anywhere. Config files have been a problem worth solving for at least 40 years, so you can bet your bottom dollar you don't need to solve this yourself. And already-solved means someone has already figured out all the little off-by-one and edge-case dramas like stripping line endings and dealing with expected input. The challenge of course, is knowing what solution already exists. If you haven't spent 40 years peeling back the covers of computers to see how they tick, it's difficult to "just know". So you might have a poke around on Google for "config file format" or something.
That would lead you to one of the most prevalent config file systems on the planet - the INI file. Just as useful now as it was 30 years ago, and as a bonus, looks not too dissimilar to your example config file. Then you might search for "read INI file in Python" or something, and come across configparser and you're basically done.
Or you might see that sometime in the last 30 years, YAML became the more trendy option, and wouldn't you know it, PyYAML will do most of the work for you.
But none of this gets you any better at using Python to extract from text files in general. So zooming in a bit, you want to know how to extract parts of lines in a text file. Again, this problem is an age-old problem, and if you were to learn about this problem (rather than just be handed the solution), you would learn that this is called parsing and often involves tokenisation. If you do some research on, say "parsing a text file in python" for example, you would learn about the general techniques that work regardless of the language, such as looping over lines and splitting each one in turn.
Zooming in one more step closer, you're looking to strip the new line off the end of the string so it doesn't get included in your value. Once again, this ain't a new problem, and with the right keywords you could dig up the well-trodden solutions. This is often called "chomping" or "stripping", and with some careful search terms, you'd find rstrip() and friends, and not have to do awkward things like splitting on the '\n' character.
Your final question is about re-using the match object. This is much harder to research. But again, the "solution" wont necessarily show you where you went wrong. What you need to keep in mind is that the statements in the for loop are sequential. To think them through you should literally execute them in your mind, one after one, and imagine what's happening. Each time you call match, it either returns None or a Match object. You never use the object, except to check for truthiness in the if statement. And next time you call match, you do so with different arguments so you get a new Match object (or None). Therefore, you don't need to keep the object around at all. You can simply do:
if match('jira_url:', line):
jira_url = line[9:].split("\n")[0]
if match('auth_user:', line):
auth_user = line[10:].split("\n")[0]
and so on. Not only that, if the first if triggered then you don't need to bother calling match again - it will certainly not trigger any of other matches for the same line. So you could do:
if match('jira_url:', line):
jira_url = line[9:].rstrip()
elif match('auth_user:', line):
auth_user = line[10:].rstrip()
and so on.
But then you can start to think - why bother doing all these matches on the colon, only to then manually split the string at the colon afterwards? You could just do:
tokens = line.rstrip().split(':')
if token[0] == 'jira_url':
jira_url = token[1]
elif token[0] == 'auth_user':
auth_user = token[1]
If you keep making these improvements (and there's lots more to make!), eventually you'll end up re-writing configparse, but at least you'll have learned why it's often a good idea to use an existing library where practical!

How to read list element in Python from a text file?

My text file is like below.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
[5, "#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n"]
[6, "http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n"]
[7, "cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n"]
[8, "\" couples in public\n"]
[9, "damn wendy's commerical got that damn in my head.\n"]
[10, "i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n"]
[11, "\" getting ready for school. after i print out this\n"]
I want to read every second element from the list mean all the text tweets into array.
I wrote
tweets = []
for line in open('tweets.txt').readlines():
print line[1]
tweets.append(line)
but when I see the output, It just takes 2nd character of every line.
When you read a text file in Python, the lines are just strings. They aren't automatically converted to some other data structure.
In your case, it looks like each line in your file contains a JSON list. In that case, you can parse the line first using json.loads(). This converts the string to a Python list which you can then take the second element of:
import json
with open('tweets.txt') as fp:
tweets = [json.loads(line)[1] for line in fp]
May be you should consider to use json.loads method :
import json
tweets = []
for line in open('tweets.txt').readlines():
print json.loads(line)[1]
tweets.append(line)
There is more pythonic way in #Erik Cederstrand 's comment.
Rather than guessing what format the data is in, you should find out.
If you're generating it yourself, and don't know how to parse back in what you're creating, change your code to generate something that can be easily parsed with the same library used to generate it, like JsonLines or CSV.
If you're ingesting it from some API, read the documentation for that API and parse it the way it's documented.
If someone handed you the file and told you to parse it, ask that someone what format it's in.
Occasionally, you do have to deal with some crufty old file in some format that was never documented and nobody remembers what it was. In that case, you do have to reverse engineer it. But what you want to do then is guess at likely possibilities, and try to parse it with as much validation and error handling as possible, to verify that you guessed right.
In this case, the format looks a lot like either JSON lines or ndjson. Both are slightly different ways of encoding multiple objects with one JSON text per line, with specific restrictions on those texts and the way they're encoded and the whitespace between them.
So, while a quick&dirty parser like this will probably work:
with open('tweets.txt') as f:
for line in f:
tweet = json.loads(line)
dosomething(tweet)
You probably want to use a library like jsonlines:
with jsonlines.open('tweets.txt') as f:
for tweet in f:
dosomething(tweet)
The fact that the quick&dirty parser works on JSON lines is, of course, part of the point of that format—but if you don't actually know whether you have JSON lines or not, you're better off making sure.
Since your input looks like Python expressions, I'd use ast.literal_eval to parse them.
Here is an example:
import ast
with open('tweets.txt') as fp:
tweets = [ast.literal_eval(line)[1] for line in fp]
print(tweets)
Output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n', 'I want to write a . I think I will.\n', '#va_stress broke my twitter..\n', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n', 'aww great "Picture to burn"\n', '#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n', 'http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n', 'cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n', '" couples in public\n', "damn wendy's commerical got that damn in my head.\n", 'i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n', '" getting ready for school. after i print out this\n']

Python CSV module handling comma within quote inside a field

I am using Python's csv module to parse data from a CSV file in my application. While testing the application, my colleague entered a piece of sample text copy-pasted from random website.
The sample text has double quotes inside the field and a comma within the double quotes. The commas outside of double quotes are correctly handled by the csv module but the comma inside the double quote is split into next column. I looked at the csv specification and the field does comply to the specification by escaping the double quotes by another set of double quotes.
I checked the file in libreoffice and it is handled correctly.
Here's one line from the csv data where I'm having a problem:
company_name,company_revenue,company_start_year,company_website,company_description,company_email
Acme Inc,80000000000000,2004,http://google.com,"The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as ""Acme Rocket-Powered Products, Inc."" based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled ""(For Tripping Road Runners)"".
Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.
While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.",roadrunner#acme.com
Here's what it looks like in the debug log:
2014-08-27 21:35:53,922 - DEBUG: company_website=http://google.com
2014-08-27 21:35:53,923 - DEBUG: company_revenue=80000000000000
2014-08-27 21:35:53,923 - DEBUG: company_start_year=2004
2014-08-27 21:35:53,923 - DEBUG: account_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products
2014-08-27 21:35:53,924 - DEBUG: company_name=Acme Inc
2014-08-27 21:35:53,925 - DEBUG: company_email=Inc."" based in Fairfield
The relevant piece of code to handle csv parsing:
with open(csvfile, 'rU') as contactsfile:
# sniff for dialect of csvfile so we can automatically determine
# what delimiters to use
try:
dialect = csv.Sniffer().sniff(contactsfile.read(2048))
except:
dialect = 'excel'
get_total_jobs(contactsfile, dialect)
contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True, quoting=csv.QUOTE_MINIMAL)
# Start reading the rows
for row in contacts:
process_job()
for key, value in row.iteritems():
logging.debug("{}={}".format(key,value))
I understand that this is just junk data and we'll likely never encounter such a data but the csv files we receive are not within our control and we can have such an edge case. And since it's a valid csv file, which is handled correctly by libreoffice, it makes sense for me to handle it correctly as well.
I have searched for other questions on csv handling where people have had problems with either handling of quotes or comma within the field. I have both of these working fine, my problem is when a comma is nested within quotes within a field. There is a question with same problem which does solve the issue Comma in DoubleDouble Quotes in CSV File but it's a hackish way where I am not preserving the contents as they are given to me, which is a valid way as per RFC4180.
The Dialect.doublequote attribute
controls how instances of quotechar appearing inside a field should be
themselves be quoted. When True, the character is doubled. When False,
the escapechar is used as a prefix to the quotechar. It defaults to
True.
The sniffer is setting the doublequote attribute to False, but the CSV you posted should be parsed with doublequote = True:
import csv
with open(csvfile, 'rb') as contactsfile:
# sniff for dialect of csvfile so we can automatically determine
# what delimiters to use
try:
dialect = csv.Sniffer().sniff(contactsfile.read(2048))
except:
dialect = 'excel'
# get_total_jobs(contactsfile, dialect)
contactsfile.seek(0)
contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True,
quoting=csv.QUOTE_MINIMAL, doublequote=True)
# Start reading the rows
for row in contacts:
for key, value in row.iteritems():
print("{}={}".format(key,value))
yields
company_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products, Inc." based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled "(For Tripping Road Runners)".
Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.
While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.
company_website=http://google.com
company_start_year=2004
company_name=Acme Inc
company_revenue=80000000000000
company_email=roadrunner#acme.com
Also, per the docs, in Python2 the filehandle should be opened in 'rb' mode, not 'rU' mode:
If csvfile is a file object, it must be opened with the ‘b’ flag on
platforms where that makes a difference.

Writing a script for fetching the specific part of a web page in a loop for offline use

I have a specific use. I am preparing for GRE. Everytime a new word comes, I look it up at
www.mnemonicdictionary.com, for its meanings and mnemonics. I want to write a script in python preferably ( or if someone could provide me a pointer to an already existing thing as I dont know python much but I am learning now) which takes a list of words from a text file, and looks it up at this site, and just fetch relevant portion (meaning and mnemonics) and store it another text file for offline use. Is it possible to do so ?? I tried to look up the source of these pages also. But along with html tags, they also have some ajax functions.
Could someone provide me a complete way how to go about this ??
Example: for word impecunious:
the related html source is like this
<ul class='wordnet'><li><p>(adj.) not having enough money to pay for necessities</p><u>synonyms</u> : hard up , in straitened circumstances , penniless , penurious , pinched<p></p></li></ul>
but the web page renders like this:
•(adj.) not having enough money to pay for necessities
synonyms : hard up , in straitened circumstances , penniless , penurious , pinched
If you have Bash (version 4+) and wget, an example
#!/bin/bash
template="http://www.mnemonicdictionary.com/include/ajaxSearch.php?word=%s&event=search"
while read -r word
do
url=$(printf "$template" "$word")
data=$(wget -O- -q "$url")
data=${data#* }
echo "$word: ${data%%<*}"
done < file
Sample output
$> more file
synergy
tranquil
jester
$> bash dict.sh
synergy: the working together of two things (muscles or drugs for example) to produce an effect greater than the sum of their individual effects
tranquil: (of a body of water) free from disturbance by heavy waves
jester: a professional clown employed to entertain a king or nobleman in the Middle Ages
Update: Include mneumonic
template="http://www.mnemonicdictionary.com/include/ajaxSearch.php?word=%s&event=search"
while read -r word
do
url=$(printf "$template" "$word")
data=$(wget -O- -q "$url")
data=${data#* }
m=${data#*class=\'mnemonic\'}
m=${m%%</p>*}
m="${m##* }"
echo "$word: ${data%%<*}, mneumonic: $m"
done < file
Use curl and sed from a Bash shell (either Linux, Mac, or Windows with Cygwin).
If I get a second I will write a quick script ... gotta give the baby a bath now though.

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Categories

Resources