Data hide automatically when converting text to DataFrame in Python - python

I have an issue with data hiding.When I print the extracted data as text, every data is shown properly. Below code is for printing extracted data and output is also given.
import os
import ocrmypdf
import pdfplumber
path= "G:\\SKM.pdf"
os.system(f'ocrmypdf {path} output.pdf')
ocrmypdf.ocr(path, "output.pdf")
invoice= pdfplumber.open("output.pdf")
count_pages= len(invoice.pages)
page=invoice.pages[count_pages-1]
text=page.extract_text(x_tolerance=2)
print(text)
Output:
Order Number : 202100050 Order Date : 25.11.2021
Client Number : 145 Delivery Date : Pending
Currency : Euro Contact Perso: Martin
Payment Condition : Due Email : martin#def.com
When I convert to DataFrame and print the data, some data such as Order date, delivery date and email address have been partially hid. Output is given.
ds = pd.DataFrame(text.split('\n'))
print(ds)
Output:
1 Order Number : 202100050 Order Date : ...
2 Client Number : 145 Delivery Date : Pen...
3 Currency : Euro Contact Perso: Martin
4 Payment Condition : Due Email : martin#d...
What is the reason. How can I solve this issue?

Try using a pandas printing formater, like tabulate, that you must first install with pip install tabulate, and then you can use it to print the dataframe formated:
ds = pd.DataFrame(text.split('\n'))
print(ds.to_markdown())

Related

How to delete icons from comments in csv files using pandas

I am try to delete an icons which appears in many rows of my csv file. When I create a dataframe object using pd.read_csv it shows a green squared check icon, but if I open the csv using Excel I see ✅ instead. I tried to delete using split function because the verification status is separated by | to the comment:
df['reviews'] = df['reviews'].apply(lambda x: x.split('|')[1])
I noticed it didn't detect the "|" separator when the review contains the icon mentioned above.
I am not sure if it is an encoding problem. I tried to add encoding='utf-8' in pandas read_csv but It didn't solve the problem.
Thanks in advance.
I would like to add, this is a pic when I open the csv file using Excel.
You can remove non-latin characters using encode/decode methods:
>>> df
reviews
0 ✓ Trip Verified
1 Verified
>>> df['reviews'].str.encode('latin1', errors='ignore').str.decode('latin1')
0 Trip Verified
1 Verified
Name: reviews, dtype: object
Say you had the following dataframe:
reviews
0 ✅ Trip Verified
1 Not Verified
2 Not Verified
3 ✅ Trip Verified
You can use the replace method to replace the ✅ symbol which is unicode character 2705.
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
Here is the full example:
Code:
import pandas as pd
df = pd.DataFrame({"reviews":['\u2705 Trip Verified', 'Not Verified', 'Not Verified', '\u2705 Trip Verified']})
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
print(df)
Output:
reviews
0 Trip Verified
1 Not Verified
2 Not Verified
3 Trip Verified

Save API filtered data in Excel file

I have a problem with some data from Binance API.
What I want is to keep a list of the USDT paired coins from binance. The problem is Binance give me a complete list of all kind paired coins, and I'm unable to filter only the USDT paired.
I need to save in Excel file only the USDT paired coins.
I have write the code to keep all the coin list:
import requests, json
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
print(data)
What you need is just pandas module. You can try the code below:
import requests
import json
import pandas as pd
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
dataframe = pd.DataFrame(data)
dataframe.to_excel("my_data.xlsx")
Your file will be saved in the same directory as the script and is named my_data.xlsx
Note that the dataframe variable is something like what follows:
symbol
bidPrice
bidQty
askPrice
askQty
0
ETHBTC
0.068918
1.7195
0.068919
0.0219
1
LTCBTC
0.002926
7.943
0.002927
4.368
2
BNBBTC
0.009438
4.493
0.009439
3.072
3
NEOBTC
0.000499
385.33
0.0005
793.74
4
QTUMETH
0.002231
304.3
0.002235
60.9
As per your comment, you need the pair of coins ending with USDT. Therefore what you need is to filter the dataframe out using a regex statement:
import requests
import json
import pandas as pd
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
dataframe = pd.DataFrame(data)
dataframe = dataframe[dataframe["symbol"].str.contains("USDT$")]
dataframe.to_excel("my_data.xlsx")
dataframe
which results in an output such as what follows:
symbol
bidPrice
bidQty
askPrice
askQty
11
BTCUSDT
44260
0.11608
44260
1.56671
12
ETHUSDT
3116.59
5.0673
3116.6
12.3602
98
BNBUSDT
428.2
124.404
428.3
45.021
125
BCCUSDT
0
0
0
0
Note that I have shown just the first four rows of the dataframe.
Explanation
The regex USDT$ points to the strings which end (the dollar sign) with USDT.

Pandas slices row and adds it to the firts columns

I'm trying to display all the columns of a csv file. This is the file info.
File
And this is the code I'm using:
pd.options.display.max_colwidth = None
pd.options.display.max_columns = None
excel1 = pd.read_csv('CO-Chats1.csv', sep=';')
But when I read it, I get this.
Case Owner Resolved Date/Time Case Origin Case Number Status \
0 Reinaldo Franco 10/16/2021, 3:54 PM Chat 20546561 Resolved
1 Catalina Sanchez 10/16/2021, 5:38 AM Chat 5625033 Resolved
Subject
0 General Support
1 Support for payment
Not sure what causes the \ and then adding the following columns to the first one.
You should try to use display() instead of print() to see the output.
excel1 = pd.read_csv('CO-Chats1.csv', sep=';')
display(excel1)

Compare 2 values in a excel file and get the common values

i am trying to extract a particular data using python. I have 1 month data which consists of how many jobs have failed with a mentioned return code over the 1 month duration.
I have 30 excel files and so far i have loaded the data into my dataframe using the below code :
import glob2
import os
import pandas as pd
def concatenate(indir="C:\\Users\\hp",outfile="C:\\Users\\hp\\new1.csv"):
os.chdir(indir)
filelist = glob2.glob("*.csv")
dfList=[]
for f in filelist:
print(f)
df = pd.read_csv(f)
dfList.append(df)
concatDf = pd.concat(dfList,axis=0)
b = concatDf[['JOB NAME' , ' RC ']]
i have extracted the required columns and i have to perform a operation on them such that i know in 1 month data how many jobs have failed with same reasons
input :
STATUS JOB NAME RC DATE TIME
R ABCDEFGH U0900 18163 19:53
X SSTUFGHI C0001 18164 2:04
R LMNOPQRS SB37 18164 2:41
R ABCDEFGH U0900 18164 3:36
O/P required :
JOB NAME RC
ABCDEFGH U0900
ABCDEFGH U0900
i am not understanding how do i compare the 2 values and get the above o/p.Please help me i am very new to python

How to get historical weather data (Temperature) on hourly basis for Chicago, IL

I need the historical weather data (Temperature) on hourly basis for Chicago, IL (Zip code 60603)
Basically i need it for the month of June and July 2017 either hourly or in 15 mins interval.
I have searched on NOAA, Weather underground etc. But haven't found anything relevant to my use case. Tried my hands on scraping using R and Python, but no luck.
Here is a snippet for the same
R :
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml", encoding = "UTF-8") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
Error :
Error in UseMethod("xmlSApply") :
no applicable method for 'xmlSApply' applied to an object of class "list"
Python :
from pydap.client import open_url
# setup the connection
url = 'http://nomads.ncdc.noaa.gov/dods/NCEP_NARR_DAILY/197901/197901/narr-
a_221_197901dd_hh00_000'
modelconn = open_url(url)
tmp2m = modelconn['tmp2m']
# grab the data
lat_index = 200 # you could tie this to tmp2m.lat[:]
lon_index = 200 # you could tie this to tmp2m.lon[:]
print(tmp2m.array[:,lat_index,lon_index] )
Error :
HTTPError: 503 Service Temporarily Unavailable
Any other solution is appreciated either in R or Python or any related online dataset link
There is an R package rwunderground, but I've not had much success getting what I want out of it. In all honesty, I'm not sure if that's the package, or if it is me.
Eventually, I broke down and wrote a quick diddy to get the daily weather history for personal weather stations. You'll need to sign up for a Weather Underground API token (I'll leave that to you). Then you can use the following:
library(rjson)
api_key <- "your_key_here"
date <- seq(as.Date("2017-06-01"), as.Date("2017-07-31"), by = 1)
pws <- "KILCHICA403"
Weather <- vector("list", length = length(date))
for(i in seq_along(Weather)){
url <- paste0("http://api.wunderground.com/api/", api_key,
"/history_", format(date[i], format = "%Y%m%d"), "/q/pws:",
pws, ".json")
result <- rjson::fromJSON(paste0(readLines(url), collapse = " "))
Weather[[i]] <- do.call("rbind", lapply(result[[2]][[3]], as.data.frame,
stringsAsFactors = FALSE))
Sys.sleep(6)
}
Weather <- do.call("rbind", Weather)
There's a call to Sys.sleep, which causes the loop to wait 6 seconds before going to the next iteration. This is done because the free API only allows ten calls per minute (up to 500 per day).
Also, some days may not have data. Remember that this connects to a personal weather station. There could be any number of reasons that it stopped uploading data, including internet outages, power outages, or the owner turned off the link to Weather Underground. If you can't get the data off of one station, try another nearby and fill in the gaps.
To get a weather station code, go to weatherunderground.com. Enter your desired zip code into the search bar
Click on the "Change" link
You can see the station code of the current station, and options for other stations nearby.
Just to provide a python solution for whoever comes by this question looking for one. This will (per the post) go through each day in June and July 2017 getting all observations for a given location. This does not restrict to 15 minute or hourly but does provide all data observed on the day. Additional parsing of observation time per observation is necessary but this is a start.
WunderWeather
pip install WunderWeather
pip install arrow
import arrow # learn more: https://python.org/pypi/arrow
from WunderWeather import weather # learn more: https://python.org/pypi/WunderWeather
api_key = ''
extractor = weather.Extract(api_key)
zip = '02481'
begin_date = arrow.get("201706","YYYYMM")
end_date = arrow.get("201708","YYYYMM").shift(days=-1)
for date in arrow.Arrow.range('day',begin_date,end_date):
# get date object for feature
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.date
date_weather = extractor.date(zip,date.format('YYYYMMDD'))
# use shortcut to get observations and data
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.date.Observation
for observation in date_weather.observations:
print("Date:",observation.date_pretty)
print("Temp:",observation.temp_f)
If you solve ML task and want to try weather historical data, I recommend you to try python library upgini for smart enrichment. It contains 12 years history weather data by 68 countries.
My code of usage is following:
%pip install -Uq upgini
from upgini import SearchKey, FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters
## define search keys
search_keys = {
"Date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"postal_code": SearchKey.POSTAL_CODE
}
## define X_train / y_train
X_train=df_prices.drop(columns=['Target'])
y_train = df_prices.Target
## define Features Enricher
features_enricher = FeaturesEnricher(
search_keys = search_keys,
cv = CVType.time_series
)
X_enriched=features_enricher.fit_transform(X_train, y_train, calculate_metrics=True)
As a result you'll get dataframe with new features with non-zero feature importance on the target, such as temperature, wind speed etc
Web: https://upgini.com GitHub: https://github.com/upgini

Categories

Resources