this post was submitted on 24 Dec 2023
204 points (95.5% liked)

Linux

48397 readers
968 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
204
submitted 11 months ago* (last edited 11 months ago) by [email protected] to c/[email protected]
 

I've been reading Mastering Regular Expressions by Jeffrey E.F. Friedl, and since nobody in my life (aside from my wife) cares, I thought I’d share something I'm pretty proud of. My first set of regular expressions, that I wrote myself to manipulate the text I'm working with.

What’s I’m so happy about is that I wrote these expressions. I understand exactly what they do and the purpose of each character in each expression.

I've used regex in the past. Stuff cobbled together from stack overflow, but I never really understood how they worked or what the expressions meant, just that they did what I needed them to do at the time.

I'm only about 10% of the way through the book, but already I understand so much more than I ever did about regex (I also recognize I have a lot to learn).

I wrote the expressions to be used with egrep and sed to generate and clean up a list of filenames pulled out of tarballs. (movies I've ripped from my DVD collection and tarballed to archive them).

The first expression I wrote was this one used with tar and egrep to list the files in the tarball and get just the name of the video file:

tar -tzvf file.tar.gz | egrep -o '\/[^/]*\.m(kv|p4)' > movielist

Which gives me a list of movies of which this is an example:

/The.Hunger.Games.(2012).[tmdbid-70160].mp4

Then I used sed with the expression groups to remove:

  • the leading forward slash
  • Everything from .[ to the end
  • All of the periods in between words

And the last expression checks for one or more spaces and replaces them with a single space.

This is the full sed command:

sed -Eie 's/^\///; s/\.\[[a-z]+-[0-9]+\]\.m(p4|kv)//; s/[^a-zA-Z0-9\(\)&-]/ /g; s/ +/ /g' movielist

Which leaves me with a pretty list of movies that looks like this:

The Hunger Games (2012)

I'm sure this could be done more elegantly, and I'm happy for any feedback on how to do that! For now, I'm just excited that I'm beginning to understand regex and how to use it!

Edit: fixed title so it didn’t say “regex expressions”

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 10 points 11 months ago (2 children)

Wait. Are there flavors of regex? Every time I have to use regex it hurts my brain and I never need to do it enough to actually sit down and learn it properly like OP is doing. Just knowing there are different ways of doing the same things in an already mind baffeling language blows me away even more.

[–] [email protected] 20 points 11 months ago* (last edited 11 months ago) (3 children)

Yeah. The only one you really need to care about (especially under Linux) is PCRE, the good 'ol Perl Compatible Regular Expressions. For the most part, every other flavor is a derivative of that. Microsoft had a weird version for a while, but that may be completely dead now, thankfully.

Learning the syntax of regex is fairly easy. Hell, I still have to use this cheat sheet more often now that my perl skills are no longer needed or even relevant.

Regex isn't that hard. The challenge is identifying and understanding patterns in the data that you are filtering. Here is a brain hack: As an example, if to have pages and pages of logs that you need to filter, open up one of the log files, stare at the screen and hold the page down key for several dozen pages. Patterns can be easily seen in the blur of text that is quickly scrolling across the screen. (Our brains love to find patterns in noise, btw.) The patterns that you see will give you focus points for developing regular expressions to match. ie: You start breaking strings into chunks and seeing the ebb and flow of data streaming across a screen helps. Anomalies in the data "stream" are are easy to spot as well.

From a security and efficiency standpoint, you should also understand where the most processing takes place so you don't kill whatever platform you are working on.

Sorry for the rambling, but I am getting older and feel the need to pass on a ton of tips and tricks whenever I can for these "archaic" languages.

[–] [email protected] 6 points 11 months ago

That screen scrolling tip is gold. I’ve often used that trick to spot anomalies in data. Hadn’t considered using it to spot the patterns for regex.

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago) (1 children)

The only one you really need to care about (especially under Linux) is PCRE,

Well, no. sed, grep, awk, vi etc. use POSIX regexes. GNU implementations also provide perl compatible mode via an unportable option. In modern programming languages like go and rust standard regex engines are compatible to RE2 - relatively new dialect developed in Google that is not described in the Friedl's book (you may think of it as an extension of extended POSIX dialect). Even raku has its own dialect incompatible to perl as well as other ones.

Nowadays it is common to move away from perl-like engines, however they are still widely used in PCRE based software and software written in python, JS etc.

[–] [email protected] 1 points 11 months ago (1 children)

POSIX? Never heard of her.

While you are likely 100% correct, the legacy perl developer side of me is making nasty comments to you with illegible syntax.

[–] [email protected] 2 points 11 months ago

Perl has introduced powerful backtracking regexes that were widely adopted. However they can be damn slow in some cases, that's why RE2 refused backtracking while using some perl-like elements. Both basic and extended POSIX regexes are also non-backtracking because they are older than perl.

[–] [email protected] 1 points 11 months ago

Thanks for the comprehensive reply! I have only used it for quite simple things like getting the id's out of log lines where this and this key word exist. Great tip about pattern searching!

Merry Christmas

[–] ricecake 5 points 11 months ago

Yes. Most things use pcre, or Perl Compatible Regular Expressions, but there are other flavors. Usually they lack features or have slightly different syntax.